World Library  
Flag as Inappropriate
Email this Article

Document retrieval


Document retrieval

Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.

Document retrieval is sometimes referred to as, or as a branch of, text retrieval. Text retrieval is a branch of information retrieval where the information is stored primarily in the form of text. Text databases became decentralized thanks to the personal computer and the CD-ROM. Text retrieval is a critical area of study today, since it is the fundamental basis of all internet search engines.


  • Description 1
  • Variations 2
    • Form based 2.1
    • Content based 2.2
  • Example: PubMed 3
  • See also 4
  • References 5
  • Further reading 6
  • External links 7


Document retrieval systems find information to given criteria by matching text records (documents) against user queries, as opposed to expert systems that answer questions by inferring over a logical knowledge database. A document retrieval system consists of a database of documents, a classification algorithm to build a full text index, and a user interface to access the database.

A document retrieval system has two main tasks:

  1. Find relevant documents to user queries
  2. Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank.

R Internet search engines are classical applications of document retrieval. The vast majority of retrieval systems currently in use range from simple Boolean systems through to systems using statistical or natural language processing techniques.


There are two main classes of indexing schemata for document retrieval systems: form based (or word based), and content based indexing. The document classification scheme (or indexing algorithm) in use determines the nature of the document retrieval system.

Form based

Form based document retrieval addresses the exact syntactic properties of a text, comparable to substring matching in string searches. The text is generally unstructured and not necessarily in a natural language, the system could for example be used to process large sets of chemical representations in molecular biology. A suffix tree algorithm is an example for form based indexing.

Content based

The content based approach exploits semantic connections between documents and parts thereof, and semantic connections between queries and documents. Most content based document retrieval systems use an inverted index algorithm.

A signature file is a technique that creates a quick and dirty filter, for example a Bloom filter, that will keep all the documents that match to the query and hopefully a few ones that do not. The way this is done is by creating for each file a signature, typically a hash coded version. One method is superimposed coding. A post-processing step is done to discard the false alarms. Since in most cases this structure is inferior to inverted files in terms of speed, size and functionality, it is not used widely. However, with proper parameters it can beat the inverted files in certain environments.

Example: PubMed

The PubMed[1] form interface features the "related articles" search which works through a comparison of words from the documents' title, abstract, and MeSH terms using a word-weighted algorithm.[2][3]

See also


  1. ^ Kim W, Aronson AR, Wilbur WJ (2001). "Automatic MeSH term assignment and quality assessment". Proc AMIA Symp: 319–23.  
  2. ^ "Computation of Related Citations". 
  3. ^ Lin J1, Wilbur WJ (Oct 30, 2007). "PubMed related articles: a probabilistic topic-based model for content similarity". BMC Bioinformatics 8: 423.  

Further reading

  • Faloutsos, Christos; Christodoulakis, Stavros (1984). "Signature files: An access method for documents and its analytical performance evaluation". ACM Transactions on Information Systems (TOIS) 2 (4): 267–288.  
  • Justin Zobel, Alistair Moffat and Kotagiri Ramamohanarao (1998). "Inverted files versus signature files for text indexing" (PDF). ACM Transactions on Database Systems (TODS) 23 (4): 453–490.  
  • Ben Carterette and Fazli Can (2005). "Comparing inverted files and signature files for searching a large lexicon" (PDF). Information Processing and Management 41 (3): 613–633.  

External links

  • Formal Foundation of Information Retrieval, Buckinghamshire Chilterns University College
This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.

Copyright © World Library Foundation. All rights reserved. eBooks from World eBook Library are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.