Information Retrieval

J. Morato
 Incorporated contributions
Morato (11/2009)
 Usage domain
Information management, LIS, Linguistics, Informatics
Recherche d'information
 German Informationswiedergewinnung

  1. Changes in the meaning of the term, 
  2. Information Retrieval and Knowledge Retrieval, 
  3. Information Retrieval Language and In-formation Retrieval Systems, 
  4. Metadata, descriptors, and indexing, 
  5. Information retrieval by controlled vo-cabularies, 
  6. Relevance, 
  7. Retrieval Measures, 
  8. Retrieval Models.

Information retrieval is the academic discipline that studies the methodologies, tools, techniques, and languages for searching and retrieving relevant data for an information need. Information retrieval comprises techniques from linguistics, computer science, information science, and text mining.

1. Changes in the meaning of the term

Originally, the term was only used to denote the set of techniques and processes for retrieving data from databases in computer systems. In the early nineties, with the increasing amount of text documents on the Web, text retrieval became the main goal. Most of these tools are for finding words in common between the textual query and textual documents. As multimedia resources grew, search engines began to search audio, images, and video resources. In the literature, document retrieval, text retrieval, information retrieval, and data retrieval are often employed as synonyms, although, each one has its own specific meaning.

Traditionally, in the Web context the answer to a query is a set of documents that probably have relevant data about the topic. Another related area is question-answering systems that answer a query with specific data, and not with a set of documents.   

2. Information Retrieval and Knowledge Retrieval

Usually, Information is regarded as data in context. In other words, Information is related to definitions and domains. However, Information seldom cares about how it relates with other information elements, in a specific context. The integration and combination of information items is what is regarded as knowledge. So, any explanation of know-how has to define how the items are related and how the process is developed.  This approach assumes two important concepts to perform a task: the existence of a goal and the existence of relationships in the system between the concepts.

On one hand, the existence of a goal implies a purpose and a desire to achieve the goal on the part of an individual or group. Therefore, the information retrieved only becomes knowledge when it makes sense in the brain of the human being that performs the query.

On the other hand, knowledge implies that the information is interrelated to archivement of the goal. So, the information is related by means of a set of rules and restrictions. The inclusion of these rules in computer applications is the reason to change the name from Information Retrieval Systems to Knowledge Retrieval Systems. These systems have their origin in the Artificial Intelligence (AI) field. AI tries to emulate human reasoning, and this involves having finalities, rules, and relationships. Intelligent agents and ontologies are necessary resources to emulate the human brain. This emulation, if successful, is the justification for renaming  information retrieval as knowledge retrieval. Knowledge Retrieval Systems try to implement search engines that search not only words in the documents, but even attempt to infer relationships.

3. Information Retrieval Languages and Information Retrieval Systems

Information Retrieval on computer systems, as distinct from library methods, which have a wider scope, means  that some retrieval languages are linked with a specific technology or system. Some well-known retrieval languages are SQL, SPARQL, Boolean, etc.

4. Metadata, descriptors, and indexing

In the 60s and 70s, computers had a limited storage capacity and the processor speed was slow. Documents in these systems needed to represent its content with metadata (literally, data about data) and a small set of terms, called descriptors. Metadata terms used to include author, title, source, and date. Metadata and descriptors assignment was by hand. 

Nowadays, these metadata are used in the Semantic Web because of their simplicity, facilitating its interoperability and navigation in the Web.

Automatic indexing deals with the techniques to assign automatically relevant terms to a document. Relevance is computed by means of statistics and the term's location in the document. Examples are term frequency and Inverse Document Frequency (known as tf-IDF), stop-word removal, assining higher weight to words from the title or words with stressed typography (e.g. bold or italic letters). Most of these factors are used in web search engines.

5. Information Retrieval by controlled vocabularies

In Information Science, terms from a specific domain often are listed, in a normalized way. This list is called a controlled vocabulary, and each term is known as descriptor. Such a vocabulary can discover and display  relationships between terms. Vocabulary control tries to avoid typical problems in natural language: polysemy, homonyms, and synonyms.

Relationship types in these vocabularies might present different results. In a thesaurus, the relationships are equivalence, hierarchy, and semantic relatedness. A faceted thesaurus uses different scopes to facilitate retrieval. 

6. Relevance

Relevance is a measure of the degree to which a certain element answers a query. This measure is subjective, in the sense that it depends on the knowledge of the person assessing the relevance.

7. Retrieval Measures

Performance of an information retrieval system can be measured by the following coefficients:

  • Precision: proportion of relevant data retrieved from the total data retrieved. 
  • Recall: extent of relevant data retrieved from the total data relevant in the database.

Both measures have an inverse relationship (Cleverdon Law). Increased precision produces a decrease in recall.  These coefficients measure two different factors: noise and silence.

  • Noise: non-relevant data retrieved
  • Silence: relevant data that have not been retrieved from the data base

The test collection comprises all elements that are relevant to a specific query in the data base. Each query-document pair is judged for relevance by hand. Test-collections are used in international competitions to test retrieval systems. TREC (Text Retrieval Conference) is the best known conference on retrieval.

8. Retrieval Models

Retrieval models compute the degree to which certain elements answer a query. As a general rule the degree is computed by means of a similarity coefficient (Cosine, Phi, etc). Most popular models are:

  • Boolean: only two values are computed, relevant/non-relevant. Only relevant document are retrieved without any order attributed. An example is SQL in relational data bases; although there is an extended boolean model which provides a way to sort results.
  • Vectorial: A vector is built to represent the query terms. This query vector is compared against each document vector, measuring the similarities between them.
  • Probabilistic: the probability of a document answering a query is computed. Often, retrieval feedback is used to improve the probability estimate. Feedback is based on user judgments about the set of document retrieved. Words from positive results are given a higher weighting, when the query is recomputed.  

  • Antoniou, G., Van Harmelen, F. (2004). A semantic Web Primer. Massachussets: MIT, 2004
  • Baeza-Yates, R., Ribeiro-Neto, B. (1999)Modern information retrieval. New York : ACM Press ; Madrid[etc.]:  Addison-Wesley. 
  • Cleverdon, C.W. (1972). “On the inverse relationship of recall and precision”. Journal of Documentation, Vol. 28, pp. 195-201.
  • Spark-Jones, K. (1997). Readings in information retrieval. edited by Karen Sparck Jones, Peter Willett. San Francisco: Morgan Kaufmann.
New entry. Before doing a new entry, please, copy this line and the following ones and paste them at the column bottom. Next fill out the fields: 'name', 'date' and 'text', and delete this upper blue paragraph.
Name (date)
[Entry text]

Incorporated entries
Jorge Morato (5/11/2009)
[It corresponds with the first version of the article, which is now showed in the left column.]