BUILDING A SEARCH FROM AN INDEX
In part one we described how a modern search engine utilizes meta tags, indexing, and hashing to form a database of indexes. In this post we will continue where we left off and describe how search queries are built and how searches are conducted.
Searching through an index requires the user to build a query and submit it to the search engine. Queries can be one word or they can be more complex. The engine looks for specific Boolean operators that may be present in the search query in order to clearly define the terms of the search.
Some of the most common Boolean operators and what they mean to the engine are:
AND – All terms joined by this word must appear in the pages or documents of results.
OR – At least one term joined by this word must appear in the pages or documents of results.
NOT – The terms following this word must not appear at all in the pages or documents.
NEAR – One of the terms must be close enough to the other. The exact number of words is dictated by the search engine.
There are many other less common operators that have rules the engine uses to guide its search. Punctuation marks will also have an effect on the results most of the time.
THE PROBLEM WITH LITERAL SEARCHES
Using the Boolean operators previously described the engine looks for words or phrases exactly as they are entered. When words have multiple meanings this can cause the results to be skewed in some way shape or form. Perhaps a user is only interested in searching for one particular meaning of a word or group of words and only wants results relevant to that meaning. In that case a user would have to build a literal search that attempts to eliminate unwanted meanings but it would be far more efficient if the search engine itself could accomplish this.
CONCEPT BASED SEARCHING
A concept search is a retrieval method that is used to searched through documents of unstructured text for information that is conceptually similar to the information provided in a search query. This method attempts to search for the ideas rather than the exact words which opens up the floodgates for more sophisticated search engines to be created.
These kinds of engines were created to combat the problem with literal searches described above. Classical Boolean keyword searches all to often result in many non-relevant items and false positives finding their way into the sea of results. Too many English words have two or even more meanings which is a major obstacle for computer systems when trying to deal with human language. For example the word fire can mean a combustion activity, to terminate employment, to launch, or to excite. Engines need a way to filter through all the noise of multiple definitions.
GENERAL APPROACHES TO CONCEPT BASED SEARCHING
Information retrieval in general can be divided into two categories. Semantic and Statistical. Search engines that fall into the semantic category attempt to implement some degree of syntactic analysis of the text a human user would provide. Systems that fall into the statistical category attempt to find results based on a statistical measure of how close they match the query provided by the user. The Semantic method is the main method behind an engine that utilizes concept based searching.
SEARCH ENGINE SEMANTIC TECHNIQUES
Multiple techniques based on artificial intelligence have been applied to semantic processing in order to create engines that can conduct searches based on the concept of the search rather than the literal words of the search. Most of these programs have relied on the use of auxiliary structures such as controlled vocabularies. Controlled vocabularies are one way to overcome the weaknesses of Boolean keyword searches. Large scale systems have been constructed that consist of synonyms that can aide in the ability of a search engine to deal with words that have multiple definitions. WordNet is one of those systems.
WordNet essentially groups English words into sets of synonyms referred to as synsets, provides definitions and usage examples, and takes note of relationships between certain sets of synonyms. It is a rough combination of a dictionary and a thesaurus. It is available to download but it is primarily used in artificial intelligence applications and various text analysis methods. From what I understand it is free to download in case anyone wants to check it out.
WordNet search interface
In addition to using databases like WordNet to deal with concept searching there is another common method that incorporates statistics related to the number of times a group of terms appear together within a certain window of sentences. For example a program might check if a particular group of terms is repeated within a 5 sentence window or a 50 word window within a document. This process mainly works on the idea that words that occur together in similar concepts of similar meanings. The window that is used to check this is relatively small which makes for a simple yet effective way of grouping semantic information.
This approach is simple but it is only able to deal with small bits of semantic information contained in a document of text. Experiments with this method in the past have shown that only approximately ¼ of the information can be extracted using it because in order to be more effective prior knowledge regarding the content of the text is necessary. It can get difficult to teach AI programs about this because most of the documents being searched are large and unstructured collections. However, despite these small drawbacks co-occurrence methods are an integral part of a concept based search engine.
FUTURE SEARCH ENGINES
Many companies such as eDiscovery have put together programs that utilized concept based search methods. The statistical analysis methods are constantly being improved upon to deal with increased processing requirements for each individual search. There are many groups are working on ways to improve the performance and results of this kind of search engine. Other groups of researchers have moved on to another area called natural-language queries.
The idea behind this is that the user would be able to type a question in the same way you would ask it to a person sitting beside you. There would be no need to keep track of Boolean operators or overly complicated query structures. The most popular natural language query site today is www.AskJeeves.com, for now it only works with simple queries but competition driven by major search engines such as Google will ensure that a natural language query engine that can deal with extremely complicated queries will be coming sometime in the near future.