Information retrieval is the academic discipline that studies the methodologies, tools, techniques, and languages for searching and retrieving relevant data for an information need. Information retrieval comprises techniques from linguistics, computer science, information science, and text mining.
1. Changes in the meaning of the term
Originally, the term was only used to denote the set of techniques and processes for retrieving data from databases in computer systems. In the early nineties, with the increasing amount of text documents on the Web, text retrieval became the main goal. Most of these tools are for finding words in common between the textual query and textual documents. As multimedia resources grew, search engines began to search audio, images, and video resources. In the literature, document retrieval, text retrieval, information retrieval, and data retrieval are often employed as synonyms, although, each one has its own specific meaning.
Traditionally, in the Web context the answer to a query is a set of documents that probably have relevant data about the topic. Another related area is question-answering systems that answer a query with specific data, and not with a set of documents.
2. Information Retrieval and Knowledge Retrieval
Usually, Information is regarded as data in context. In other words, Information is related to definitions and domains. However, Information seldom cares about how it relates with other information elements, in a specific context. The integration and combination of information items is what is regarded as knowledge. So, any explanation of know-how has to define how the items are related and how the process is developed. This approach assumes two important concepts to perform a task: the existence of a goal and the existence of relationships in the system between the concepts.
On one hand, the existence of a goal implies a purpose and a desire to achieve the goal on the part of an individual or group. Therefore, the information retrieved only becomes knowledge when it makes sense in the brain of the human being that performs the query.
On the other hand, knowledge implies that the information is interrelated to archivement of the goal. So, the information is related by means of a set of rules and restrictions. The inclusion of these rules in computer applications is the reason to change the name from Information Retrieval Systems to Knowledge Retrieval Systems. These systems have their origin in the Artificial Intelligence (AI) field. AI tries to emulate human reasoning, and this involves having finalities, rules, and relationships. Intelligent agents and ontologies are necessary resources to emulate the human brain. This emulation, if successful, is the justification for renaming information retrieval as knowledge retrieval. Knowledge Retrieval Systems try to implement search engines that search not only words in the documents, but even attempt to infer relationships.
3. Information Retrieval Languages and Information Retrieval Systems
Information Retrieval on computer systems, as distinct from library methods, which have a wider scope, means that some retrieval languages are linked with a specific technology or system. Some well-known retrieval languages are SQL, SPARQL, Boolean, etc.
4. Metadata, descriptors, and indexing
In the 60s and 70s, computers had a limited storage capacity and the processor speed was slow. Documents in these systems needed to represent its content with metadata (literally, data about data) and a small set of terms, called descriptors. Metadata terms used to include author, title, source, and date. Metadata and descriptors assignment was by hand.
Nowadays, these metadata are used in the Semantic Web because of their simplicity, facilitating its interoperability and navigation in the Web.
Automatic indexing deals with the techniques to assign automatically relevant terms to a document. Relevance is computed by means of statistics and the term's location in the document. Examples are term frequency and Inverse Document Frequency (known as tf-IDF), stop-word removal, assining higher weight to words from the title or words with stressed typography (e.g. bold or italic letters). Most of these factors are used in web search engines.
5. Information Retrieval by controlled vocabularies
In Information Science, terms from a specific domain often are listed, in a normalized way. This list is called a controlled vocabulary, and each term is known as descriptor. Such a vocabulary can discover and display relationships between terms. Vocabulary control tries to avoid typical problems in natural language: polysemy, homonyms, and synonyms.
Relationship types in these vocabularies might present different results. In a thesaurus, the relationships are equivalence, hierarchy, and semantic relatedness. A faceted thesaurus uses different scopes to facilitate retrieval.
Relevance is a measure of the degree to which a certain element answers a query. This measure is subjective, in the sense that it depends on the knowledge of the person assessing the relevance.
7. Retrieval Measures
Performance of an information retrieval system can be measured by the following coefficients:
Both measures have an inverse relationship (Cleverdon Law). Increased precision produces a decrease in recall. These coefficients measure two different factors: noise and silence.
The test collection comprises all elements that are relevant to a specific query in the data base. Each query-document pair is by hand. Test-collections are used in international competitions to test retrieval systems. TREC (Text Retrieval Conference) is the best known conference on retrieval.
8. Retrieval Models
Retrieval models compute the degree to which certain elements answer a query. As a general rule the degree is computed by means of a similarity coefficient (Cosine, Phi, etc). Most popular models are:
New entry. Before doing a new entry, please, copy this line and the following ones and paste them at the column bottom. Next fill out the fields: 'name', 'date' and 'text', and delete this upper blue paragraph.
Jorge Morato (5/11/2009)
[It corresponds with the first version of the article, which is now showed in the left column.]