ABSTRACT
The web has greatly improved access to scientific literature. However, scientific articles on the web are largely disorganized, with research articles being spread across archive sites, institution sites, journal sites, and researcher homepages. No index covers all of the available literature, and the major web search engines typically do not index the content of Postscript/PDF documents at all. This paper discusses the creation of digital libraries of scientific literature on the web, including the efficient location of articles, full-text indexing of the articles, autonomous citation indexing, information extraction, display of query-sensitive summaries and citation context, hubs and authorities computation, similar document detection, user profiling, distributed error correction, graph analysis, and detection of overlapping documents. The software for the system is available at no cost for non-commercial use.
- 1.J.M. Barrie and D.E. Presti. The World Wide Web as an instructional tool. Science, 274:371-372, 1996.Google Scholar
Cross Ref
- 2.K. Bharat and M.R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In SIGIR Conference on Research and Development in Information Retrieval, 1998. Google Scholar
Digital Library
- 3.Kurt Bollacker, Steve Lawrence, and C. Lee Giles. CiteSeer: An autonomous web agent for automatic retrieval and identification of interesting publications. In Katia P. Sycara and Michael Wooldridge, editors, Proceedings of the Second International Conference on Autonomous Agents, pages 116-123, New York, 1998. ACM Press. Google Scholar
Digital Library
- 4.Kurt Bollacker, Steve Lawrence, and C. Lee Giles. A system for automatic personalized tracking of scientific literature on the web. In Digital Libraries 99- The Fourth ACM Conference on Digital Libraries, pages 105-113, New York, 1999. ACM Press. Google Scholar
- 5.S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the ACM SIGMOD Annual Conference, 1995. Google Scholar
Digital Library
- 6.S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Seventh International Worm Wide Web Conference, Brisbane, Australia, 1998. Google Scholar
Digital Library
- 7.Andrei Broder, Steve Glassman, Mark Manasse, and Geoffrey Zweig. Syntactic clustering of the web. In Sixth International World Wide Web Conference, pages 391-404, 1997. Google Scholar
Digital Library
- 8.Eric W. Brown, James P. Callan, and W. Bruce Croft. Fast incremental indexing for full-text information retrieval. In Proceedings of the 20th International Conference on Very Large Databases, pages 192-202, 1994. Google Scholar
Digital Library
- 9.Robert D. Cameron. A universal citation database as a catalyst for reform in scholarly communication. First Monday, 2(4), 1997.Google Scholar
- 10.Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. Efficient crawling through URL ordering. In Proceedings of the Seventh World-Wide Web Conference, 1998. Google Scholar
Digital Library
- 11.Doug Cutting and Jan Pedersen. Optimizations for dynamic inverted index maintenance, in Proceedings of the I3th International ACM SIG1R Conference on Research and Development in Information Retrieval, pages 405---411, 1990. Google Scholar
Digital Library
- 12.Eugene Garfield. Citation Indexing: Its Theory and Application in Science, Technology, and Humanities. Wiley, New York, 1979.Google Scholar
- 13.C. Lee Giles, Kurt Bollacker, and Steve Lawrence. CiteSeer: An automatic citation indexing system. In Ian Witten, Rob Akscyn, and Frank M. Shipman III, editors, Digital Libraries 98- The Third ACM Conference on Digital Libraries, pages 89-98, Pittsburgh, PA, June 23-26 1998. ACM Press. Google Scholar
Digital Library
- 14.P. Ginsparg. First steps towards electronic research communication. Computers in Physics, 8:390-396, 1994. Google Scholar
Digital Library
- 15.S. Hitchcock, L. Carr, S. Harris, J.M.N. Hey, and W. Hall. Citation linking: Improving access to online journals. In Robert B. Allen and Edie Rasmussen, editors, Proceedings of the 2nd ACM International Conference on Digital Libraries, pages 115-122, New York, NY, 1997. ACM. Google Scholar
Digital Library
- 16.H. Kautz, B. Selman, and M. Shah. ReferralWeb: Combining social networks and collaborative filtering. Communications of the A CM, 30(3), 1997. Google Scholar
Digital Library
- 17.J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings ACM-SIAM Symposium on Discrete Algorithms, pages 668-677, San Francisco, California, 25-27 January 1998. Google Scholar
Digital Library
- 18.Steve Lawrence, Kurt Bollacker, and C. Lee Giles. Distributed error correction, in Digital Libraries 99 - The Fourth ACM Conference on Digital Libraries, page 232, New York, 1999. ACM Press. Google Scholar
Digital Library
- 19.Steve Lawrence and C. Lee Giles. Context and page analysis for improved web search. IEEE Internet Computing, 2(4):38-46, 1998. Google Scholar
Digital Library
- 20.Steve Lawrence and C. Lee Giles. Searching the World Wide Web. Science, 280(5360):98-t00, 1998.Google Scholar
Cross Ref
- 21.Steve Lawrence and C. Lee Giles. Accessibility of information on the web. Nature, 400(6740):107-109, 1999.Google Scholar
Cross Ref
- 22.L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. 1998.Google Scholar
- 23.J. Rennie and A. McCallum. Using reinforcement learning to spider the web efficiently. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99), 1999. Google Scholar
Digital Library
- 24.E. Selberg and O. Etzioni. Multi-service search and comparison using the MetaCrawler. In Proceedings of the 1995 World Wide Web Conference, 1995.Google Scholar
- 25.Kristie Seymore, Andrew McCallum, and Roni Rosen~ feld. Learning hidden Markov model structure for information extraction. In AAAI 99 Workshop on Machine Learning for Information Extraction, 1999.Google Scholar
- 26.N. Shivakumar and H. Garcia-Molina. SCAM: A copy detection mechanism for digital documents. In 2nd International Conference on the Theory and Practice of Digital Libraries, 1995.Google Scholar
- 27.Anthony Tomasic, Hector Garcia-Molina, and Kurt Shoens. Incremental updates of inverted lists for text document retrieval. In Proceedings of the 1994 A CM SIGMOD Conference, pages 289-300, 1994. Google Scholar
Digital Library
- 28.A. Tombros and M. Sanderson. Advantages of query biased summaries in information retrieval. In Proceedings of S1G1R 98, Melbourne, Australia, 1998. Google Scholar
Digital Library
- 29.Anastasios Tombros. Reflecting User Information Needs Through Query Biased Summaries. PhD thesis, Department of Computer Science, University of Glasgow, September 1997.Google Scholar
- 30.I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes: Compressing and indexing documents and images. Van Nostrand Reinhold, New York, NY, 1994. Google Scholar
Digital Library
- 31.I.H. Witten, C.G. Nevill-Manning, and S.J.Cunningham. Building a digital library for computer science research: technical issues. In Proceedings Australasian Computer Science Conference, Melbourne, Australia, January 1996.Google Scholar
- 32.I.H. Witten, C.G. Nevill-Manning, and S.J. Cunningham. Digital libraries based on fulltext retrieval. In Proceedings of WebNet 96, San Francisco, October 1996.Google Scholar
Index Terms
- Indexing and retrieval of scientific literature
Recommendations
Measuring social media activity of scientific literature: an exhaustive comparison of scopus and novel altmetrics big data
This paper measures social media activities of 15 broad scientific disciplines indexed in Scopus database using Altmetric.com data. First, the presence of Altmetric.com data in Scopus database is investigated, overall and across disciplines. Second, a ...
A citation-based approach to automatic topical indexing of scientific literature
Topical indexing of documents with keyphrases is a common method used for revealing the subject of scientific and research documents to both human readers and information retrieval tools, such as search engines. However, scientific documents that are ...
Bibliometric analysis of fracking scientific literature
This study uses bibliometric methods to analyze the scientific literature of fracking. Web of Science database, including the Science Citation Index, Sciences Citation Index and Conference Proceedings Citation Index--Science were used to collect the ...
Comments