Searching for information
With millions of web pages available it is virtually impossible to find the one you need without the help of a search engine. Search engines operate by retrieving (spidering) all pages from a site and indexing the words therein to a database. A query is then compared against the database and links to matching pages are presented to the user.
A difficult task is to determine the relevance of the matching pages to a query. Since no standard exists to indicate the relevance of a page (and any such standard is doomed to fail due to abuse because of unscrupulous website owners wanting to be number one in any query), search engine employs various heuristics with varying degrees of success. It is desirable to have a heuristic with a high probability of success.
Some known heuristics are:
- a page is more relevant to a query if the words from the query are present in the page title or in the heading(s) on the page;
- a page is more relevant to a query if other pages link to that page using the words from the query in or near the link;
- a page is more relevant to a query if many pages that are also relevant to the query link to it;
- a page which mentions some words many times, or the same word repeatedly in the same color as the background color, is trying to appear more relevant and so is actually not relevant at all.
One heuristic which appears to work exceptionally well is employed by Google . This search engine computes a so-called PageRank (patent pending, not published) for every page it indexes, and orders the results by this PageRank.
The PageRank is computed based on two assumptions. First, a page to which many pages link should get a higher ranking. Second, a link from a page with a high ranking is worth more than a link from low ranked page. The importance of a page is also determined from anchor text, font size, location in the document, and proximity. All of these factors are combined to compute the overall relevance of a page. This computation corresponds to computing the principal eigenvector of the normalized link matrix of the web (which is not for the faint of heart -- see the PageLink paper for details). The results are surprisingly accurate.
Another method for ranking documents is used by Altavista (US 6,112,203). Using this method, a graph is created of all the indexed documents based on the way they link to each other. The graph is analyzed, and documents are grouped based on the fact that they have a large number of links in common. Such a group is then related to a particular topic, which can be found by scanning for common words and themes. In each group, each document is assigned a weight based on its relevance to the topic. When the user then enters a query, only the most relevant documents from groups related to the query are presented.
When the documents in question comprise music, indexing and matching is much more difficult than with textual documents. To allow searching, identifiers can be embedded in pieces of music using watermarking techniques. This is known as content identification. A server can now automatically detect the identifier and perform a database lookup to determine the title, artist and so on the piece of music in question. However, this has the disadvantage that the watermark must be inserted beforehand.
As an alternative, music recognition technologies have been developed. These technologies attempt to detect characteristic features in the piece of music and compute a signature or hatch based on the features a signature isn't stored in the database. When a piece of music is received, its signature is computed and compared against the signatures stored in the database. This way, no identifying information needs to be present in music itself, although reliably detecting characteristic features is very difficult technically. Further, the database lookup can be very time-consuming.
Watermarks and hashing technologies allow many new business opportunities. For example, a service could be offered in which music is recognized for a fee. The service could operate by letting a consumer call a special phone number to transmit a piece of music to a server using the microphone in his mobile telephone. The server comptues the hash or detects the watermark.Using this information, the server then performs a database lookup to extract identifying information and sends the identifying information to the consumer, for example as an SMS message.
When presenting the results of a query to the user, it is important to avoid cluttering the presentation with multiple documents that are (virtually) the same. To this end, documents that match the query are clustered. One simple way to cluster is to group documents by site or common URL prefix. When presenting results, only a limited number of documents having a common URL prefix are included, preferably with the option to show more results from that site. This approach is based on the assumption that documents on one site are probably very much related, and the user will be able to find the most relevant one by starting at the one shown.
This approach does not solve the problem of mirrors, where the same document is made available from multiple independent sites, possibly in different languages. In such a case, the document will be listed multiple times, once for each mirror. A text-based comparison can be employed to find these mirrors, but this is not always accurate. Often, mirrored documents contain extra text referring to the original site, or an indication that they are a mirror. This extra text can easily cause an automated comparison to fail. And of course a translation in another language is always marked as a different document.
Altavista solved this problem by comparing the hyperlinks in both documents (US 6,138,113). If both documents link to the exact same set of other pages, then these pages are considered to be the same and only one of them is shown.
Another clustering algorithm compares the frequency of words in both versions, and clusters documents if many words have almost the same frequency in every document. A dictionary can be employed to roughly translate documents in different languages before comparing frequencies. Since frequency counting is typically only done after words have been converted to their normal form ("taught" is converted to "teach", for instance), a rough dictionary-based translation is sufficient for this purpose.
The relatedness of two documents can also be derived from the fact that they relate to the same topic (as determined using summarizing algorithms), and link to each other, or both are linked to from a common third document.
These clustering algorithms can also be used to generate a listing of documents that are similar to a chosen one. Many search engines offer with each result the option to "show similar documents". This allows a user to easily narrow down his search.
Some known solutions to common problems are:
- If a user makes a spelling error in his query, he is likely to get the wrong results, or no results at all. Some search engines perform a spell check on the query to notify the user of this problem. The user can then correct his query and try again. Alternatively, the search engine attempts to correct the misspelling itself so the user will get the correct results (US 6,144,958).
- When a user enters a broad query, the number of matching documents is usually way too much to handle. Still, users expect highly accurate results. In such a case, a search engine should provide a way to narrow down the query. One way to solve this problem is to suggest further terms that can be used to narrow down the search based on previous queries performed by the same or other users. For instance, if a user searches for cars and many people searching for cars accessed information on Toyota cars, then a possible term to narrow down his search is Toyota (US 6,185,558).
- A correlation can be made based on the search patterns of multiple users. If the same title is used for multiple books or CDs, but most users pick the book or CD from one particular author or artist, then a suitable term to narrow down the search is the name of that author or artist ((US 6,006,225; US 6,169,986).
Further, it is desirable to offer features that make the search engine more attractive to use. Some known features are:
- If the number of query terms entered is sufficient to uniquely determine one matching document, the search engine can directly deliver the one matching document. For instance, in a book search, the full ISBN will match one particular book, so it makes sense to directly show the details page for that book. Google provides a button labeled "I'm feeling lucky" on the query page, which when presses directly takes the user to the first matching document.
- Search engine Google offers a so-called "Google Toolbar", which is an add-on for Internet Explorer. This toolbar installs as an extra set of buttons in the browser, and allows the user to directly enter query terms, which are submitted to the Google search engine. This saves the user from having to go to the Google homepage first. It also provides a button which, when pressed, directly shows documents similar to the current one.
- Some search engines allow users to view previous queries by other users, for example in the form of a "top 10" list, or a "real-time sneak peek" showing what others are searching for right now. This can make searching more interesting or exciting, although there are obvious privacy ramifications that need to be taken into account.
A crawler is a program that retrieves webpages, commonly for use by a search engine or a web cache. Basically, the crawler takes an initial webpage, extracts all the URLs in it, and retrieves those. It then repeats this procedure for every webpage retrieved. It is clear that this procedure generates a large amount of network traffic, and cannot succeed in retrieving every document on the WWW. And by the time it has retrieved a substantial portion of the WWW, the pages it has retrieved will have been out of date anyway.
Further, it is desirable that a crawler does not overload a single site by repeatedly downloading webpages at high speed.
When presenting the results of a query, it is desirable to show a short summary of what each result is about. The most common way to do this is to show the first few sentences or words from the document in question. However, these words often are not related to the topic of the document, but are advertisements, navigational elements ("Home", "People", Search", and so on) or other generic texts. This makes them a suboptimal choice for summarizing the document.
To overcome this problem, the Hypertext Markup Language provides a tag with which an abstract can be provided. The author of a document simply adds to the document, and a search engine can use that as summary when presenting the document. Unfortunately, a substantial number of authors does not accurately summarize documents.
Another approach, which requires storing the full text of every document that was indexed, is to show sentences from the document in which at least one word from the query occurs. This allows the user to quickly determine whether the document is relevant to his query (instead of being an accidental match, for instance because the word in the document is used in a different sense than in the query).