Accessibility of information on the web

S Lawrence, CL Giles - intelligence, 2000 - dl.acm.org
intelligence, 2000dl.acm.org
Steve Lawrence C. Lee Giles es, and tested for a Web server at the standard port. There are
currently 2564 (about 4.3 billion) possible IP addresses (IPv6, the next version of the IP
protocol which is under development, will increase this substantially); some of these are
unavailable while some are known to be unassigned. We have tested random IP addresses
(with replacement), and have estimated the total number of Web servers using the fraction of
tests that successfully locate a server. Many sites are temporarily unavailable because of …
Steve Lawrence C. Lee Giles es, and tested for a Web server at the standard port. There are currently 2564 (about 4.3 billion) possible IP addresses (IPv6, the next version of the IP protocol which is under development, will increase this substantially); some of these are unavailable while some are known to be unassigned. We have tested random IP addresses (with replacement), and have estimated the total number of Web servers using the fraction of tests that successfully locate a server. Many sites are temporarily unavailable because of Internet connectivity problems or Web-server downtime, and to minimize this effect, we rechecked all IP addresses after a week. Testing 3.6 million IP addresses (requests timed out after 30 seconds of inactivity), we found a Web server for one in every 269 requests, leading to an estimate of 16.0 million Web servers in total. For comparison, Netcraft found 4.3 million Web servers in February 1999 based on testing known host names (aliases for the same site were considered as distinct hosts in the Netcraft survey (www. netcraft. com/survey/)). The estimate of
16.0 million servers is not very useful, because there are many Web servers that would not normally be considered part of the publicly indexable Web. These include servers with authorization requirements (including firewalls), servers that respond with a default page, those with no content (sites ‘coming soon’, for example), Web-hosting companies that present their home page on many IP addresses, printers, routers, proxies, mail servers, CD-ROM servers and other hardware that provides a Web interface. We built a database of regular expressions to identify most of these servers. For the results reported here, we manually classified all servers and removed servers that are not part of the publicly indexable Web. Sites that serve the same content on multiple IP addresses were accounted for by considering only one address as part of the publicly indexable Web. Our resulting estimate of the number of servers on the publicly indexable Web as of February 1999 is 2.8 million. Note that it is possible for a server to host more than one site. All further analysis presented here uses only those servers considered part of the publicly indexable Web.
To estimate the number of indexable Web pages, we crawled all the pages on the first 2,500 random Web servers. The mean number of pages per server was 289, leading to an estimate of the number of pages on the publicly indexable Web of about 800 million. It is important to note that the distribution of pages on Web servers is extremely skewed, following a universal power law4. Many sites have few pages, and a few sites have vast numbers of pages, which limits the accuracy of the estimate. The true value could be higher because of very rare sites that have millions of pages (for example, GeoCities reportedly has 34 million pages), or because some sites could not be crawled completely because of errors. The mean size of a page was 18.7 kilobytes (kbytes; median 3.9 kbytes), or 7.3 kbytes (median 0.98 kbytes) after reducing the pages to only the textual content (removing HTML tags, comments and extra white space). This allows an estimate of the amount of data on
ACM Digital Library
Showing the best result for this search. See all results