|
A search engine is a program designed to help find files
stored on a computer, for example a public server on the World Wide Web, or one's own computer. The search engine allows one to ask for
media content meeting specific criteria (typically those containing a given word or phrase) and retrieving a list of files that match those criteria. A search engine often uses a previously made, and
regularly updated index to look for files after the user has entered search criteria.
In the context of the Internet, search engines usually refer to the World Wide Web and not other protocols or areas.
Furthermore search engines mine data available in newsgroups, large databases, or open directories like DMOZ.org. Because the data collection is automated, they
are distinguished from Web directories, which are maintained by
people.
The vast majority of search engines are run by private companies using proprietary algorithms and closed databases, the most
popular currently being Google (with MSN
Search and Yahoo! closely behind). There have been several attempts to create
open-source search engines, among which are Htdig, Nutch, Egothor, and
OpenFTS. [1] (http://www.searchtools.com/tools/tools-opensource.html)
History
The first Web search engine was "Wandex", a now-defunct index collected by the World Wide Web Wanderer, a web crawler
developed by Matthew Gray at MIT in 1993. Another very early search engine,
Aliweb, also appeared in 1993 and still runs today. One of the first engines to later
become a major commercial endeavor was Lycos, which started at Carnegie Mellon University as a research project in
1994.
Soon after, many search engines appeared and vied for popularity. These
included WebCrawler, Hotbot, Excite, Infoseek, Inktomi, and AltaVista. In some ways they competed with popular
directories such as Yahoo!. Later, the directories integrated or added on search engine
technology for greater functionality.
In 2002, Yahoo! acquired Inktomi and in 2003, Yahoo! acquired Overture, which
owned AlltheWeb and Altavista. In
2004, Yahoo! launched its own search engine based on the combined technologies of its acquisitions and providing a service that
gave pre-eminence to the Web search engine over the directory.
Search engines were also known as some of the brightest stars in the Internet investing frenzy that occurred in the late
1990s. Several companies entered the market spectacularly, recording record gains during
their initial public offerings. Some have completely
taken off their public search engine, and are marketing Enterprise-only editions, such as Northern Light (http://www.northernlight.com/) which used to be part of the 8 or 9 early search engines after
Lycos came out.
Before the advent of the Web, there were search engines for other protocols or uses, such as the Archie search engine for anonymous FTP sites
and the Veronica search engine for the Gopher protocol.
Osmar R. Zaļane's From Resource Discovery to Knowledge Discovery on the
Internet details the history of search engine technology prior to the emergence of Google.
Recent additions to the list of search engines include a9.com, AlltheWeb, Ask Jeeves, Clusty, Gigablast, Ez2Find, Teoma, WiseNut, GoHook, Walhello, Kartoo,
Snap and Mamma .
Google
Around 2001, the Google search engine rose to prominence. Its success was based in
part on the concept of link popularity and PageRank. How many other web sites and web pages link to a given page is taken into consideration with
PageRank, on the premise that good or desirable pages are linked to more than others. The PageRank of linking pages and the
number of links on these pages contribute to the PageRank of the linked page. This makes it possible for Google to order its
results by how many web sites link to each found page. Google's minimalist user interface was very popular with users, and has
since spawned a number of imitators.
Researchers at NEC Research
Institute claim to have improved upon Google's patented PageRank technology by using web crawlers to find "communities" of websites. Instead of ranking pages, this technology uses an algorithm that follows links on a
webpage to find other pages that link back to the first one and so on from page to page. Google and most other web engines
utilize not only PageRank but more than 150 criteria to determine relevancy. The algorithm "remembers" where it has been and
indexes the number of cross-links and relates these into groupings. PageRank is
based on citation analysis that was developed in the 1950s by Dr. Eugene Garfield at the University of Pennsylvania. Google's founder's cite Garfield's work in their original paper. In
this way virtual communities of webpages are found. Teoma's search technology uses a communities
approach in its ranking algorithm. Web link analysis was first developed by Dr. Jon Kleinberg and his team while working on the
CLEVER project at IBM's Almaden research
lab.
Challenges faced by search engines
- The web is growing much faster than any present-technology search engine can possibly index (see distributed web crawling).
- Many web pages are updated frequently, which forces the search engine to revisit them periodically.
- The queries one can make are currently limited to searching for key words, which may results in many false positives.
- Dynamically generated sites, which may be slow or difficult to index, or may result in excessive results from a single
site.
- Many dynamically generated sites are not indexable by search engines; this phenomenon is known as the invisible web.
- Some search engines do not order the results by relevance, but rather according to how much money the sites have paid
them.
- Some sites use tricks to manipulate the search engine to display them as the first result returned for some keywords. This
can lead to some search results being polluted, with more relevant links being pushed down in the result list.
How search engines work
Web search engines work by storing information about a large number of web
pages, which they retrieve from the WWW itself. These pages are retrieved by a web crawler (sometimes also known as a spider) — an automated web browser which follows every link it
sees. The contents of each page are then analyzed to determine how it should be indexed (for example, words are extracted from
the titles, headings, or special fields called meta tags). Data about web pages
is stored in an index database for use in later queries. Some search engines, such as Google, store all or part of the source page (referred to as a cache) as
well as information about the web pages. This cached page always holds the actual search text since it is the one that was
actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no
longer in it. This problem might be considered to be a mild form of linkrot, and
Google's handling of it increases usability by satisfying user expectations that the search terms will be on the returned web
page.
When a user comes to the search engine and makes a query, typically by giving key words, the engine looks up the index and provides a listing of best-matching web pages according to its
criteria, usually with a short summary containing the document's title and sometimes parts of the text.
There is another main type: Real-time search engines (such as Orase (http://www.orase.com), which is now defunct).
Such search engines don't use an index. The information that a search engine needs is only collected if a new query is started.
Compared to the index-based systems of Google-like search engines this real-time system has some advantages: The information are
always up-to-date, there are (almost) no dead links and less system resources are needed. (Google uses almost 100,000 computers,
Orase only one.) But there are some disadvantages, too: A search needs longer to be finished, for example.
The usefulness of a search engine
depends on the relevance of the results it gives back. While there may be
millions of Web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than
others. Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results
should be shown in, varies widely from one engine to another. The methods also change over time as Internet usage changes and new
techniques evolve.
Most Web search engines are commercial ventures supported by advertising
revenue and, as a result, some employ the controversial practice of allowing advertisers to pay money to have their listings
ranked higher in search results.
External links
|