Today Google and other search engines are far more sophisticated than ever before. They utilize high end machine learning algorithms to process search results and understand human language. Everyone who has ever used a search engine has probably wondered how they do such a good job of finding and ranking relevant results. In this post we will briefly describe how search engines came to be, then detail exactly how Google’s current algorithms work and talk about what the future has in store for search engines.
EARLY SEARCH ENGINES
Before search engines were created the internet was a small collection of FTP (file transfer protocol) sites in which users could navigate shared files. There really wasn’t a whole lot you could actually do on the internet back then but as the list of web servers populating the internet grew people needed a way to sort through everything.
The very first program that even remotely resembled a search engine was called “Archie”. It was created in 1990 and acted as an archive (hence the name) of anonymous computer files stored on FTP web sites. It was very primitive compared to today’s search engines, only being able to match the titles of groups of files to specific keywords.
In 1994 the first search engine to provide full text search was created by an engineering student in his spare time. WebCrawler was one of a kind. It was able to search through millions of web sites that were available according to a search query provided by the user. In a lot of ways it resembled an actual spider crawling over not just titles but entire bodies of text to make a more accurate match. Later versions of WebCrawler utilized specific algorithms to increase its accuracy.
The most popular search engine we all know and love (well not love exactly), Google, was born in 1997. It was originally created with the nickname “Back Rub” which stuck because it checked the number of links coming back to websites to estimate their value. Google now accounts for over 70% of web searches. Growing privacy concerns may hinder its growth in the future but for now Google’s market share is continuing to steadily increase.
An old WebCrawler interface. Things have changed a lot that’s for sure.
HOW DO MODERN SEARCH ENGINES WORK?
Internet search engines are special sites on the web designed to help people find specific web pages related to keywords they input. This allows us to sort through the millions upon millions of web pages out there to find answers to specific questions or general information about a particular subject. Today’s search engines utilize some very high end algorithms but at the basic level they all perform three tasks. They search the internet based on important keywords, they keep an index of the words they find and where they find them, and they allow users to look for words or combinations of words found at that index.
Today’s engines will index hundreds of millions of documents and respond to millions of inquires in a single day. They utilize basic software AI programs that are commonly referred to as “spiders”. The act of crawling over the content of web pages has changed a lot since WebCrawler peaked in popularity but most of the basic concepts are the same.
These spiders build lists of words that they encounter while crawling web pages and follow each link on the page and build more lists of words they come across on those pages and so on. The starting point of this long search usually begins on a server with a ton of traffic or a popular web page that was already ranked very highly in value by the search engine’s ranking system. High end search engines like Google have many spiders branching out in many different directions at incredible speeds. Certain words that are found on various pages are valued higher in the list of indexed words. These words are referred to as Meta Tags and they allow for more specific search results to be delivered.
Meta Tags allow owners of web pages to choose specific keywords under which the page will be indexed by any spiders attempting to crawl over it. These tags can guide a search engine towards providing relevant results even in cases where a tag may have multiple meanings. The tags are checked against the content of the page as a whole to prevent web page owners from assigning popular meta tags to their pages without having the content to match it. This is not a new concept but it remains a vital part in the search engine delivering the best results possible.
An example of a description for a Meta Tag.
BUILDING A DATABASE OF INDEXES
The spiders never really stop moving because of how fast the world wide web is changing but after a certain page or set of pages is parsed over the information is indexed in a database that is then accessed by the search engine in response to specific keywords. The engine needs methods of assigning a “weight” to the url’s that get parsed by the spiders. Some of these methods are simple ones like checking to see how many times an entered keyword appears on the page. A higher number would in theory signify a page with information that was closer to what the user was looking for. There are multiple kinds of formulas that are currently in use by mainstream search engines to handle this whole process.
The other methods of ranking potential search results according to weight are much more subtle but after it is all said and done a database of indexes is created. This data is encoded in a specific way in order to make the information contained in each index accessible in the fastest way possible. This is accomplished by creating a hash table.
The act of “hashing” refers to using a formula that attaches a numerical value to each word. The formula evenly distributes the entries across all of the divisions in the table. This numerical distribution is different from the distribution of words across the alphabet, and that is how the hash table is so effective.
In the English alphabet there are common letters and uncommon letters. Therefore if you were to look up a particular word in a dictionary that began with a common letter there would be far more words to choose from and finding it would take longer. A hash table evens out this discrepancy. There are multiple different ways a hash table can be organized for maximum efficiency that are beyond the scope of this post. However, the basic idea is that with a combination of effective indexing and a hash table results of all kinds can be pulled from the database of indexes created by the crawling spiders very quickly.
THE ACT OF SEARCHING THROUGH THE INDEX
Searching through the index requires a user to build a query and submit it into a search engine. There are multiple different high level algorithms that dictate the whole process starting from when the user hits “search”. These methods will be analyzed in detail in part 2 of this new series on search engines. Also, we will hypothesize as to what the future has in store for search engines.