Google Copyright Blog: A Brief History of Internet Search

The following is part of a paper on Google Book Search that I wrote for a law school seminar. I am currently in the process of reworking it for inclusion in a larger paper that addresses a broader spectrum of Google's copyright issues including Book Search, Video, Images, News, and search results generally. It describes the development, both physically and intellectually, of serach engines and the need to organize information. While it is difficult to say with certainty whether search engines spurred the creation of content or whether they were a response that leaches off the value of content, this takes the position that effective search technology enables a new level of creation, both more in-depth and more efficient, without which creation suffers. By looking at the history of internet search one can understand that Google is but the latest in a long line of search tools that, if prevented from further development through threat of copyright lawsuits, may very well be the last great search development that can economically be put to widespread use. Any thoughts or comments would be greatly appreciated.

A Brief History of Internet Search [1]

In arguing for the importance of search engines as enablers of speech it is necessary to discuss the history of the Internet. From the Internet’s very beginnings, the emphasis has been on organizing the wealth of human knowledge into a useful format that could be comprehended in order to serve a larger public interest.

One of the first calls for an information organizing system resembling the Internet came from Vannevar Bush. His Atlantic Monthly article of July 1945 entitled As We May Think[2] called attention to the problem of an ever growing amount of scientific research without a useful way of using it. He believed that for new research to be useful, it needed to be recorded, continually added to, and continually consulted. He further noted the inefficiencies of indexes at that time:

Our ineptitude in getting at the record is largely caused by the artificiality of the systems of indexing. ... Having found one item, moreover, one has to emerge from the system and re-enter on a new path.

The human mind does not work this way. It operates by association. ... Man cannot hope fully to duplicate this mental process artificially, but he certainly ought to be able to learn from it. In minor ways he may even improve, for his records have relative permanency.

Presumably man's spirit should be elevated if he can better review his own shady past and analyze more completely and objectively his present problems. He has built a civilization so complex that he needs to mechanize his records more fully if he is to push his experiment to its logical conclusion and not merely become bogged down part way there by overtaxing his limited memory.

Bush introduced the concept of the memex, a micro-film device that could be used to store all a person’s books and papers and could be consulted when needed.[3] Though the memex was ultimately a flawed system, Bush’s approach to organizing information led to the creation of hypertext, a system of automated cross-references that connect text in one document to another related document.[4]

The precursor to the Internet was the Advanced Research Projects Agency Network (ARPANET), which was developed by the US Department of Defense and went online October 29, 1969.[5] The first ARPANET link only connected a computer from UCLA to one from Stanford, growing to 231 by 1981. Though it is claimed that ARPANET was initially funded out of a desire to preserve a military command structure in the event of a nuclear attack, ARPANET Director Charles Herzfeld has said that the ARPANET came out of frustration of there being only a few large, powerful research computers available in the country and the interest in giving researchers access to them despite geographical limits.[6] ARPANET provided the technological basis for “packet switching,” the process that powers the network underlying the Internet today.[7]

The Internet progressed slowly from 1969 to 1990 as new computers trickled onto the network. During this time there was little way to find a particular document on the network without knowing exactly where it was or by finding it through word of mouth. The method by which files were shared is called the File Transfer Protocol[8] (FTP), which requires one user to upload a file onto a server, making it available to be read by anyone who finds it. During this time many people uploaded their own files to their own servers and this resulted in a system where information was scattered across servers. Because there was no central clearing house for information, it was rather difficult to find what one was searching for since there was no way to search across all the servers.

Archie changed all that in 1990 when it became the first search engine.[9] Archie was created by Alan Emtage, a student at McGill University in Montreal, and got its name because of a peculiarity in Unix standards that required a shorter version of its intended name of Archives. It worked by contacting individual FTP servers and requesting a list of all files. Archie then compiled those lists and made the whole thing searchable through a Unix command similar to the “find” function on a PC.[10] Archie was merely a database of filenames that users could scroll through.

Several other search engines popped up after Archie. Gopher[11] was created by the University of Minnesota and functioned much like Archie in that it created a database of text files. The popularity of Gopher spawned two different search engines called Veronica and Jughead. Veronica,[12] created by University of Nevada System Computing Services, was much the same as Archie, but dedicated to searching for Gopher files across multiple servers. Jughead[13] searched for text files on particular servers and was a further attempt at a search engine like Archie.

It was during this time that the first modern website was created by Tim Berners-Lee in 1991, while working at the European Organization Nuclear Research (CERN).[14] Berners-Lee is credited with creating what we know as the World Wide Web. The Web is a computer service that functions on top of the Internet. In this sense the Web is a system that rests on top of the ARPANET infrastructure, making the networks more navigable. By utilizing hypertext and links combined with web pages viewable through web browsers, people were now able to create graphical web pages rather than mere directories of files and could link to other documents without the permission of the host.[15] In 1993, after Gopher decided to charge users for access to their database, CERN declared that the Web would be free for anyone to use. Originally concerned with making his research available to a larger audience, the decision to make the Web free was made to ensure that its use would become widespread.[16]

The amount of content available began to steadily increase with the creation of the Web. The World Wide Web Wanderer[17] was created in 1993 as a way to measure the growth of the Internet with the creation of the world’s first Internet robot called a spider. A robot is an automated program that explores the Internet looking for web pages and the term spider was chosen to fit with the metaphor of a web of pages that were connected by links. The spider travels through the Internet looking for web pages and sends copies of the pages back to its server as well as recording information about all the links contained therein. The information gathered by the Wanderer was used to create the first search engine called Wandex in 1993.[18] Wandex itself was controversial in that its robots consumed a lot of bandwidth while fulfilling their tasks, slowing down the entire Internet.

Based on Wandex’s success, Archie Like Indexing for the Web (Aliweb) was introduced late in 1993.[19] Unlike Wandex, Aliweb relied on users to submit their sites to the index rather than using robots that would slow down the Web. This allowed users to customize how their site would be displayed to users searching for information. Aliweb is the first “modern” search engine in that it was more than an index or directory of pages or documents located on servers.[20] The information that users supplied about their site, the first use of “meta-data”[21] used to classify the contents of web pages, was searched by the use of key words that would respond with results that matched the original query. It was about this time that web search was seen as a profitable business because of the growth of content and increased Internet use beyond academic research, resulting in an explosion of search engines and their functions. The first big innovation came from WebCrawler in 1994 with the first search engine that performed full-text searches rather than only searches of meta-data or page and document titles.[22] Later in 1994, Lycos introduced the first ranked relevance system for text searches, allowing the engine to return better matches by analyzing the full-text of pages and documents.[23] Generally speaking, Lycos was also the search engine to be commercially successful as a business.[24] Then came AltaVista in 1995 that both provided faster service and allowed users to search all of the sites that linked to a particular URL. [25] AltaVista also assisted users in their search by offering a “tips” option that assisted in formulating an effective search and was the first engine to allow “natural language” searches as opposed to Boolean[26] searches. HotBot was created by Inktomi Corporation in 1996 and further increased the speed by which results were returned, helping to increase its popularity. It was also the first search engine to use “cookies” to track user behavior and to customize a user’s experience.[27] HotBot used a computerized system to analyze links, traffic, and other factors to determine a site’s place in its search results.[28]

While the mid-1990s saw an increase in the number of search engines and innovation, such did not necessarily equate to better search results. That each search engine used its own unique formula for searching the web meant that results varied from one engine to the next. Further, the criteria for judging relevancy of any given result was not very good, leading to many false positives for web searches. Based on the technology, there was almost too much information for the search engines to it useful.

Google was founded in 1998 and is currently the number one search engine and the second most visited web site on the Internet with about 275,000 daily searches.[29] Google has set itself apart through its PageRank[30] system for returning relevant search results. PageRank relies on the “uniquely democratic nature of the web” by counting inbound links to a site as a “vote.” A page is deemed popular if it has many votes. Thus, by tabulating the number of votes a web page has and weighing those votes by the popularity of the voter, Google is able to return the most relevant results from the popular sites as determined by web users. The theory is that individuals are good at deciding whether a page is relevant or not and that people will link to sites that provide useful information. Google is then able to sort through the large number of web rages by harnessing the people power of web users, who themselves have determined what pages are relevant.

In describing the history of search engines, from Vannevar Bush’s vision of the memex to Google, the theme has been making sense of the information available to mankind. Information needs to be made useful for it to used. Every technology has struggled to give people access to what it is they are looking for and while Google is the best today, if history is a guide, a new search engine will arise that makes search even more useful.

Another theme is the slow commercialization of search technology and the Internet as a whole. What started out as a government project to facilitate scientific research has turned into the free and public Internet. What started out as ways for academics to find papers and other documents from far away universities has turned into a lucrative business in Internet search. While commerce may have changed how search engines are developed and implemented, it has not changed the underlying purpose of connecting people with the information they seek. The market that has been created with the advent of Internet search is a market of finding answers to questions and no matter what else changes a search engine that fails to produce search results will fail to produce financial results.

One question that remains relevant is the causality dilemma that exists with the development of search technology. Does better search technology help spur the creation of more content, or does the existence of more content force search technology to improve? What can be teased out is from the history of search engines is that access to the Internet, coupled with the ability for the public to create content, appears to be the driving force in the growth of all content. When the Internet was largely an academic affair with few users it grew slowly. Once the Web was introduced the number of pages skyrocketed. Today the amount of new content online is accelerating due in large part by the popularity of web logs (blogs) that empower any person to publish their thoughts online with a few clicks of the mouse. Better search engines allow these new publishers to find information to link to more easily, giving them source material to work with, and allowing any Joe Public with an Internet connection to have as much research power at his fingertips as any researcher at a major university. It is possible that people would be creating as much content without effective search technology – life off-line is interesting enough to keep a writer busy for a lifetime – but the kind of progress envisioned by Bush, real progress based on the findings of others, would certainly suffer. The kind of creation Bush was concerned with has been characterized as the process of “glomming on”[31] as in the process of appropriating information from other sources and using it as the building blocks for innovation and commentary. It is this kind of thinking about information that led to the creation of the Internet, and with it the creation of search engines, making the two inseparable and dependant on the other for their utility.

Whether or not the creative process of “glomming on” will be allowed to flourish is something that will be decided in the courts over the next few years, barring Congressional action. Google understands that its relevance depends in large part on the answers it provides to those seeking information. However, Google’s ability to serve that information on request is being challenged in a number of areas.

[1] For an overview of search engine history, see History of Search Engines and Web History by Aaron Wall at http://www.search-marketing.info/search-engine-history/, A Brief History of Search Engines by Lee Underwood at http://www.webreference.com/authoring/search_history/, A History of Search Engines by Wes Sonnenreich at http://www.wiley.com/legacy/compbooks/sonnenreich/history.html, and Wikipedia entry for “search engine” at http://en.wikipedia.org/wiki/Search_engine.

[2] Vannevar Bush, As We May Think, The Atlantic Monthly, July 1945, Volume 176, No.1; 101-108. Available at http://www.theatlantic.com/doc/194507/bush

[3] Bush, Id

[4] See Wikipedia entry for “hypertext” available at http://en.wikipedia.org/wiki/Hypertext (last checked 5.20.2006). Systems of what can be called hypertext made up the early “Internet” in the 1970s and 1980s. The introduction of the World Wide Web in 1990 by Tim Berners-Lee incorporated hypertext as a way to help researchers connect with each other from all over the world. The commonly known “link” is composed of hypertext, the text that serves as a marker, and the hyperlink, the actual connection between two documents.

[5] See Wikipedia entry for “arpanet” available at http://en.wikipedia.org/wiki/Arpanet (last checked 5.20.2006).

[6] Arpanet, id

[7] See Wikipedia entry for “packet switching” available at http://en.wikipedia.org/wiki/Packet_switching (last checked on 5.20.2006). Packet switching increases the speed of communication as each message is broken down into numerous packets and each is sent across the network, each finding the fastest way to its target where the packets are reassembled and displayed for the user.

[8] See Wikipedia entry for “file transfer protocol” available at http://en.wikipedia.org/wiki/FTP_server (last checked 5.20.2006).

[9] See William Slawski, Just what was the first search engine? SEO by the Sea, 2.5.2006. Available at http://www.seobythesea.com/?p=106 (Last checked 5.20.2006).

[10] See Wikipedia for “archie search engine” available at http://en.wikipedia.org/wiki/Archie_search_engine (Last checked 5.20.2006).

[11] See Wikipedia for “gopher protocol” available at http://en.wikipedia.org/wiki/Gopher_protocol (Last checked 5.20.2006). Gopher was chosen as the name because (1) users instructed the system to “go for” information, (2) the menus were analogous to gopher holes, and (3) the University of Minnesota’s mascot is the Golden Gopher.

[12] See Wikipedia for “veronica (computer)” available at http://en.wikipedia.org/wiki/Veronica_%28computer%29 (Last checked 5.20.2006). Veronica stands for Very Easy Rodent-Oriented Net-wide Index to Computer Archives and was most likely chosen as the name to fit in with Archie, based on the Archie comic books.

[13] See Wikipedia for “jughead (computer)” available at http://en.wikipedia.org/wiki/Jughead_%28computer%29 (Last checked 5.20.2006). Jughead stands for Jonzy's Universal Gopher Hierarchy Excavation And Display and was also probably chosen to fit with the Archie Comics theme.

[14] See Wikipedia for “Tim Berners-Lee” available at http://en.wikipedia.org/wiki/Tim_Berners-Lee (last checked 5.20.2006).

[15] See Wikipedia for “world wide web” under “origins” available at http://en.wikipedia.org/wiki/World_wide_web#Origins (last checked 5.20.2006).

[16] See Wikipedia for “history of the internet” under “a world library” available at http://en.wikipedia.org/wiki/History_of_the_Internet#A_world_library.E2.80.94From_gopher_to_the_WWW (last checked 5.20.2006).

[17] See Wikipedia for “web crawler” available at http://en.wikipedia.org/wiki/Web_crawler (last checked 5.20.2006).

[18] See Wikipedia for “search engine” under “history” available at http://en.wikipedia.org/wiki/Search_engine (last checked 5.20.2006).

[19] See Wikipedia for “aliweb” available at http://en.wikipedia.org/wiki/Aliweb (last checked 5.20.2006).

[20] See Historical Web Services available at http://www.greenhills.co.uk/mak/historical.html (last checked 5.20.2006).

[21] See Wikipedia for “meta data” available at http://en.wikipedia.org/wiki/Meta_data (last checked 5.20.2006). Meta data, literally “data about data” from Greek, is “structured, encoded data that describes characteristics of information-bearing entities to aid in the identification, discovery, assessment, and management of the described entities" (Committee on Cataloging Task Force on metadata Summary Report, http://www.libraries.psu.edu/tas/jca/ccda/tf-meta3.html, 1999). It functions on a web page much like a library card catalog card, providing general information about the source being sought and is in itself searchable. The manipulation of meta data can help ensure that a given site appears as a search result when a query matching a meta data term is entered.

[22] See WebCrawler Facts at http://www.thinkpink.com/bp/WebCrawler/History.html (last checked 5.20.2006).

[23] See Sonnenreich, supra, at Mellon-Mania: The Birth of Lycos.

[24] See Wikipedia entry for “history of the internet” at Finding What You Need, available at http://en.wikipedia.org/wiki/History_of_the_Internet#Finding_what_you_need.E2.80.94The_search_engine (last checked 5.20.2006).

[25] See Sonnenreich, supra, at Return of the DEC.

[26] See Wikipedia entry for “Boolean search” available at http://en.wikipedia.org/wiki/Boolean_search (last checked 5.20.2006). Named after the English mathematician George Boole who defined an alegrbraic system of logic in the 1800s. Boolean searches require the use of the “and,” “or,” and “not” operators (among others) in helping the system to understand the order and importance of certain search queries. Natural language search does not require these operators, hence the name.

[27] See Sonnenreich, supra, at A Spider Named “Slurp!”: The Powerful HotBot. HotBot was subsequently purchased by Yahoo! In 2002 and is the engine that produces its search results.

[28] See Underwood, Supra, at Enter the Accountants.

[29] See Alexa at www.alexa.com. These are the Google stats on 5.20.2006 available at http://www.alexa.com/data/details/traffic_details?&compare_sites=&y=r&q=&size=medium&range=&url=http://www.google.com.

[30] Information on Google PageRank is available on Google’s website at http://www.google.com/technology/ (last checked 5.20.2006).

[31] See Jack M. Balkin, Digital Speech and Democratic Culture: A Theory of Freedom of Expression for the Information Society, 79 N.Y.U. L. Rev. 1 at 10 (2004). In discussing mass media’s effect on the Internet, Balkin argues that mass media acts as a bottleneck on the dissemination of ideas because of its monopoly on content relevant to the public. “Glomming on” represents the essence of Internet speech and is built into the Net’s architecture through linking, fundamentally changing speech with the ability to link to primary source proof to show the validity of one’s argument.

Google Copyright Blog

Friday, August 04, 2006

A Brief History of Internet Search

0 Comments:

Previous Posts