|
HOME
PREVIOUS PAGE
NEXT PAGE
PART II: TEXT RETRIEVAL
[Page 141 ]
4 ANATOMY OF TEXT RETRIEVAL
4.1 Introduction
4.1.1 Defining an information retrieval system
The concept of "information retrieval" is broad and often employed
in an imprecise meaning. It is used to denote systems designed to
provide users or a group of users with information. Salton/McGill
(1983:x1) give this definition:
"An information retrieval system is an information
system, that is, a system used to store items of information that need
to be processed, searched, retrieved, and disseminated to various user
populations."
In sect 3.4 above we have given a definition of an information
system as part of the legal communication process - a definition which
corresponds to the more general definition cited above. In our context
also the concept of "information" has been defined, and we shall
remember from our earlier discussion that it was defined roughly as
the meaning or content of data.
In the literature, there are a number of different definitions of
"information retrieval". The concept is generally used in a more
precise meaning than the one above, making retrieval systems one type
of information systems. But a further limitation is generally made,
often implicit - to computerized retrieval systems or bibliographical
retrieval systems. Perhaps one might say that "information retrieval
system" has become accepted as denoting the type of systems discussed
in the works of Salton, Cleverdon, Lancaster, Sparck Jones, and others
(cfr Rijsbergen 1979:1). Lancaster gives the following definition of
an "information retrieval system" (Lancaster 1978:11-12):
[Page 143 ]
"As it is most commonly used, the term information
retrieval is really synonymous with literature searching. Information
retrieval is the process of searching some collection of documents,
using the term document in its widest sense, in order to identify
those documents which deal with a particular subject. Any system that
is designed to facilitate this literature searching may legitimately
be called an information retrieval system. "
Common to the authors mentioned is that they work within the field
of literature retrieval, most of them having their background in
library environments. The phrase "information retrieval" then becomes
synonymous to "document retrieval", "literature searching", "reference
retrieval", etc.
The term "reference retrieval" is reserved by some
authors for a special user situation, but does really originate within
systems for literature searching where it is possible to retrieve
references only to books.
In limiting the concept of information retrieval to document
retrieval, one excludes systems which are not oriented towards the
retrieval of texts, like "data base management systems" or "question
answering systems". Such systems are often called "data retrieval
systems" (Rijsbergen 1979:1-3) or "fact retrieval systems" (Salton),
where "data" and "fact" both denote an actual data element referring
to some fact in reality. Some authors distinguish between "fact
retrieval" and "question answering systems". Kochen uses the phrase
"fact retrieval systems" of systems where the answer (result of the
retrieval) is copied directly from the "data base", and "question
answering systems" of more intelligent systems which themselves make
conclusions on the basis of stored data. The distinction is
illustrated by the following example (Kochen 1974:66):
"A map which gives the average driving time between major
U.S. cities can be used for fact-retrieval if it is asked for the time
from Los Angeles to San Fransisco; it is used as a question-answering
system if it is asked for the time from Los Angeles to Portland,
Oregon, even if the map shows only the time from Los Angeles
[Page 144 ]
to San Fransisco, and from San Fransisco to Portland,
because the time from Los Angeles to Portland has to be deduced."
Many of the definitions are diverging, and may at first seem rather
confusing. Lancaster (1968:1) defines an information system as
something not intended to increase the level of knowledge, as this
increase in information is achieved only after the user has read and
understood the documents to which he has been referred. As an
information retrieval system gives references only to
literature, one will not - according to Lancaster - retrieve
information. Lancaster is himself critical to the concept of
information retrieval:
"It should be clear from the discussion above that
information retrieval is not a particularly satisfactory term to
describe the type of activity to which it is usually applied. An
information retrieval system does not retrieve information. Indeed,
information is something quite intangible; it is not possible to see,
hear, or feel it. We are 'informed" on a subject if our state of
knowledge on it is somewhat changed ... Information transfer can only
take place if the user reads the document and understands it.
Information, then, is something that changes a person's state of
knowledge on a subject."
It should be noted that the cited definitions of information
retrieval systems deviates from those developed in this book, and that
this may be due to the different user situation and the typical
different design of a legal information retrieval systems
compared to those literature systems implied in the general
discussion.
4.1.2 Sketch of an information retrieval system
In fig 4/1 are sketched the elements of an information retrieval
system. The elements are grouped in two parts: the one on the top
describes the updating, the one below the retrieval.
[Page 145 ]
Fig 4/1 - Sketch of an information retrieval system
The updating consists of the subsystem preparing data for
retrieval. The objective is to have the data represented in the system
in such a way that it may efficiently be exploited in the retrieval
process. The updating consists of two subprocesses:
- (1) Preparing data (text) for retrieval (adding value)
- (2) Storing of data
The first process will be to transform information on the object to
be retrieved (and the object may be a book or a text) to data. This
will generally imply an analysis of the information and representation
of this as physical symbols with a structure corresponding to the one
given for the retrieval system. In a document retrieval system one
will frequently, from practical or economical reasons, select a
representation of the source different from the authentic text - for
instance an abstract or a set of indexing terms. The document design,
for instance the
[Page 146 ]
intellectual indexing, is such a transformation of source to
document.
Preparing data for retrieval often takes place manually, but in
some systems there are automatic routines for some of this activity.
Such a computerizations implies that it is possible to specify exactly
how the data is to be processed, and this may be quite difficult. Much
research has been aimed at developing efficient and advanced methods
for automatic indexing, but few of these methods have been implemented
(cfr the work of Salton). During the last years, the work within the
area of artificial intelligence has opened new possibilities for such
work.
The storage includes a systematization of data in such a way that
it may be retrieved fast and efficiently. In a manual information
retrieval system one would, for instance, sort indexing terms
alphabetically, while a computerized system has a number of more
sophisticated possibilities. In a document retrieval system, one will
generally store each word in an inverted file, and associate each word
with references to the documents in which the word occurs. The
retrieval process can be made more efficient by creating an index to
the inverted file.
The retrieval process is the part of the system most characteristic
of an information retrieval system. Its objective is to retrieve the
documents required by the user - and these only. This subsystem
consists of three major processes:
- (3) Transformation of problem to search request
- (4) Retrieval
- (5) Presentation of the result
The first step in a retrieval process is the construction of search
requests. The user transforms his problem into a form (the request)
which is compatible to the data base design and the requirements of
the retrieval system. It is important for the retrieval performance
that the request corresponds to the way in which the information is
represented in the retrieval system. If there is a lack of
correspondence, it will be difficult - or impossible - to achieve a
satisfactory result.
The retrieval itself consists of matching the search
[Page 147 ]
request against the data base, ie retrieving a certain telephone
number from the directory, or all documents containing a certain
phrase.
The way in which the result is represented to the user will
primarily depend on the type of information retrieval system employed.
In a manual system, the result is in a fixed format, while a
computerized system may give a choice of different formats and
responses. Several systems also have the possibility to rank documents
according to criteria for possible relevance.
4.1.3 Different user situations
In a discussion of information retrieval, it is desirable to
differentiate between various user or retrieval situations. Some users
desire a precise answer to a request, for instance, "Which sum is
allocated to social securities in the budget for 1984?", while others
desire everything written on a subject, for instance, "Find all
documents discussing the social security budget for 1984". It is
important to be specific about which type of requests the system is
intended to handle before designing the retrieval system.
In this book, we shall distinguish between the different user
situations by looking at the information needs. We shall use the terms
"fact retrieval" and "interest retrieval", and let the meaning of
these correspond roughly to what is generally meant by "data
retrieval" and "information retrieval". It is, however, important to
emphasize that the objectives and background of the definitions vary.
Some authors use the term "reference retrieval" in stead
of "interest retrieval" - for instance Bing/Harvold 1977. We have
discarded this term because many document retrieval systems allow
different and more adequate representation of the source than a
reference, for instance the authentic text or an abstract. Also, the
term - like "document retrieval" - does not characterize the essential
elements of interest retrieval, as fact retrieval to a certain extent
may also be carried
[Page 148 ]
out by a document retrieval system. It should be noted
that the term "interest retrieval" is associated with the theory of
communication processes presented in chapter 2, where the situation of
the user is characterized by his "area of interest", cfr sect 3.5.2.
Typical for interest retrieval is a user with a problem within his
area of interest, and who wants to find documents illuminating this
problem - for instance "Retrieve all documents discussing the
importance of the morale of the mother in cases on child custody". In
interest retrieval, one does not look for an exact answer to the
problem, but documents discussing the problem. In such a situation the
result will depend primarily on the basis of retrieval and only
secondarily upon the retrieval system. The basis for retrieval is the
documents contained in the system and the search request. The better
the sources are represented in the system, the better the chances for
a satisfactory result. But it is impossible to design a system which
will guarantee that all and only relevant documents will be retrieved,
since the assessment of relevance is partly subjective to each user.
The retrieval result therefore may be considered only as a proposal of
what documents should be considered relevant.
Fact retrieval is characterized by a user looking for a specific
response (facts) to his problem, for instance "Who has written the
book Lord Jim?" The illustration of Salton of the difference between a
"reference retrieval system" and a "question answering system" also
gives an adequate illustration of the difference between interest and
fact retrieval (Salton 1968:392):
"... if a potential customer were to ask, 'What is the
boiling point of water?', an effective question answering system would
reply 'one hundred degrees centigrade', whereas a reference retrieval
system might provide a citation to a textbook in elementary physics
which in turn would contain information about the physical properties
of water."
To satisfy fact retrieval, information must be represented and
stored in the system in such a way that the system may make the
necessary identification.
[Page 149 ]
This implies for instance that the request to be satisfied must be
foreseen. Correspondingly, the search request must contain a detailed
and unambigious description of the problem, giving the necessary
instructions to the retrieval system. The retrieval system will either
respond to the request - which presumes a one hundred per cent match
with the request, or it will not respond at all.
[Page 150 ]
4.2 Characteristics of text retrieval
4.2.1 Document retrieval
The concept "text retrieval" is used today to describe a certain
computerized method for retrieving documents. The process itself is
quite general and often used in everyday situations. One does, for
instance, suspect that one has caught a certain germ which has been
reported in the news lately. The most usual way to find the news items
would probably be to browse through newspapers until hitting the
relevant articles. Reading with sufficient care, one would obtain a
perfect result in the sense that all relevant articles are being
retrieved.
This manual retrieval does not offer problems as long as the number
of documents (in our example, the stack of newspapers) is small.
Confronted with all the newspapers of a country, however, the
situation would be different. It would indeed be doubtful whether one
would consider the task at all. In such cases computerized retrieval
would be advantageous.
The computer has a capacity which would make it possible to search
several thousand pages of text in a few seconds. It would not
understand what it reads, and would have to be instructed in great
detail for what to look. Requested to retrieve all items containing
the word "germ", it would not find items in which the word "bacteria"
was used. It would not consider the two synonymous words unless this
is specified.
4.2.2 Full text retrieval
The term "full text retrieval" is not used in this book as it is
ambiguous. It may either denote that the source is represented in
authentic form, or that the principle of text retrieval is employed,
regardless of the document design. In this book, text retrieval is the
name only of a certain retrieval method and implies nothing of the
document design.
Text retrieval does not presume that the source is edited before
being converted to machine readable
[Page 151 ]
form. It is usual to represent the sources in a general text
format. In such cases, one often characterizes the method as full text
retrieval. In the input phase, it is in principle sufficient that "end
of document" and "end of word" is defined. In some systems, also "end
of sentence" and "end of paragraph" may be defined, allowing retrieval
also on the distance between words in terms of these text segments.
Retrieval is based on word occurences and combinations, and in a
text retrieval system any word in the material may be retrieved. In
order to obtain a high degree of efficiency, one does however exclude
some common words or stop words like prepositions, conjunctions,
articles, pronouns, certain verbs, etc occurring frequently in the
text and which are rarely helpful in describing the content of a
document with respect to the retrieval system. Such stop words
generally make up some 35-45 per cent of the text volume.
4.2.3 Interest retrieval
Text retrieval systems are well suited for interest retrieval. The
documents are stored in natural language, making it possible for the
user to employ any word of the text in the search request. For
specific problems, specific requests may be constructed; and
properties of natural language documents make it possible to rank
retrieved documents, for instance using word statistics.
They are, however, less suited for fact retrieval. The system has
no exact data on the content of the documents (only on which words
they contain), and cannot respond to requests presuming such data. The
system may furnish information on which words the documents contain
and in what sequence, and if this is sufficient for the fact retrieval
request, it may serve such a purpose. Retrieving on a compilation of
statutes, it may for instance specify how many documents are citing a
certain statute by using the identification of this statute as a
search request.
Though free text formats do not presume editing, inclusion of fixed
fields containing predefined data (name of author, title, date,
citations, etc), editing is in practice often implied. The actual
[Page 152 ]
computer system has facilities to exploit such fields as well. In
this chapter we shall, however, concentrate on the simple, unedited
form in which only a raw text is inputted. But systems having a
superstructure of fixed fields have the possibility of being
successfully used both for fact and interest retrieval.
[Page 153 ]
4.3 The retrieval process
The retrieval process is defined as the total process from the
emerging problem (information need) to the final answer from the
system. Fig 4/2 is a sketch of the retrieval process consisting of two
subprocesses - one manual and one computerized (Harvold 1980a:133).
The first process (the transformation of the problem to a request)
is the responsibility of the user. In a certain search language, the
user attempts to express his information need according to the
requirements of the system. This is generally the only possible way in
which the user is permitted to specify his problem. It is therefore
important that the search language is flexible, and is exploited in
such a way that the request reflects the problem.
Some text retrieval systems are based on mid-users. The end-user is
not allowed to use the terminal, but has to rely on certain users
trained in the use of the system. In such cases, the transformation
process itself will have two steps - initially the formulation of the
information need in natural language, and then the translation of this
into the retrieval language of the system. In cases where the system
permits natural language request, such a transformation does not have
to take place even if mid-users are introduced.
The second subprocess is carried out by the retrieval system. Based
on the request, the system attempts to find all and only the relevant
documents.
Generally the formulation of a search request and the search
process has to be repeated several times before the final result is
achieved, reflecting the iterative process described above under sect
2.5. The user may, for instance, in browsing through the retrieved
documents, get ideas for improving the search requests. There have
been quite successful experiments in computerization of this feedback
process, cfr for instance Salton's experiment on relevance feedback
and the experiments within the Responsa project on local metrical
feedback.
[Page 154 ]
Fig 4/2 - The retrieval process
[Page 155 ]
Assessing the result generally implies a comparison of the result
to the set of relevant documents the user would have identified in
reading the total number of documents in the data base. Naming this
set S1, and the retrieved documents S2, the intersection between the
two sets will be a measure of the quality of the computerized search.
Fig 4/3 - Definition of recall and precision
S1: relevant documents
S2: retrieved documents
S3: retrieved documents being relevant
Recall: S3/S1
Precision: S3/S2
The result is usually measured in recall and precision - recall
being a measure for the number of relevant documents being retrieved,
and precision a measure for how many of the retrieved documents are
relevant.
The concept of relevance is referred to several times in
this chapter without being specified or defined. In sect 2.6.1 above
we have discussed the relevance of legal sources and suggested a
definition for legal information retrieval systems. A more general
discussion of the concept will follow below as an appendix to this
chapter
[Page 156 ]
4.4 Choosing the data base
In this section, we shall discuss some general aspects related to
document design and choice of data base. To some extent we shall
repeat some of the arguments offered above under sect 3.2 and 3.3, but
this time related specifically to text retrieval systems.
When establishing a computerized system, one will start by defining
a documentation area, and within this area choose the sources to be
represented in the system as documents. In this book, the term
"document" is reserved for the representation in the system (cfr above
under sect 3.3.1), though frequently it is used both for the original
source and its representation in the system.
The choice of documentation area and the document selection and
design are central elements in establishing an information system. An
information retrieval system will never be yield more information than
has been put into it. The data base limits the possible performance of
the system.
When discussing the retrieval process, one will perceive both the
problem of the user and the data base of the system as determined
outside the process. It should, however, be emphasized that the data
base and the document design are essential for system performance, and
should not be overlooked in the search for factors explaining this
performance. Especially document design escapes notice surprisingly
often as a critical element in the system.
There are a number of considerations in selecting sources from the
documentation area. The sources should be up to date, the data base to
be constructed should be representative for the area selected and -
not least important - the system should operate within acceptable
economic limits.
From the user's point of view, it is essential that the information
need is covered as good as possible by one system only, thus
eliminating the necessity of employing frequently more than one system
to solve the problem. This is usually measured by coverage, which is
discussed above at sect 3.5.3 (2) in terms
[Page 157 ]
>of user-constructed information systems, and including some
critical comments to the general use of the concept.
In sect 3.3 above is also discussed the problem of document design,
a discussion concentrating on three typical designs - indexing terms,
abstracts and authentic texts. As stated several times, many
properties of the retrieval systems are determined by this design
rather than by the retrieval software. Also, this design will largely
determine the costs of establishing the data base - cost factors like
- (1) the cost of editing the sources for input
- (2) the cost of converting the text to machine readable form
- (3) the cost of updating and storaging the data base
- (4) the cost of processing the request
In principle the authentic text requires no editing, and is
consequently the less costly solution according to factor (1). Often
even abstracts are established outside the system (for instance as an
introduction or headnote to the authentic text), but indexing terms
are more seldom easily available in this way.
The editing of an abstract or assignment of indexing terms are
processes which are rather expensive. Each source must be assessed by
at least one qualified person and the document must be designed
according to the prevailing norms. In intellectual indexing, complex
indexing schemes are frequently employed - the development and
maintenance of these (like thesauri of controlled indexing
vocabularies) may by themselves be quite expensive.
The costs of factors (3)-(5) are relative to the volume of the
material. Because indexing characteristics will generally be briefer
documents than the other possibilities, they will also be less
expensive.
The technological development produces cheaper hardware with
increased capacity. The differences in the cost of converting etc for
the three types of documents have been reduced over the last decade,
and this will continue in the future. If sources are
[Page 158 ]
available in machine readable form from word processing
equipment, the authentic form may become the cheapest one to be put
into the system.
But even if the text is available in machine readable
form, considerable effort may be needed to bring them into the form
required by the information system. If, for instance, the text is
riddled with printer's codes, it will have to be cleaned and perhaps
new codes for the information system will have to be introduced. The
fact that the text is machine readable is a minimum condition for
reuse in an information system, but it is by no means a sufficient
condition.
The conclusion therefore will be that one should exploit existing
material from outside the system as widely as possible. Costs of
manual resources should be assessed in respect to the cost for
processing resources, and in this assessment, the probable relative
cost development should be taken into account. One should consider
both to possible technological development and the possible
availability of qualified persons.
[Page 159 ]
4.5 Retrieval strategies
4.5.1 Introduction
A retrieval strategy is a method on which the retrieval process
(cfr fig 4/2) is based. It will include
- (1) guidelines for formulating the search request
- (2) choices for selection and ranking of documents
- (3) modification of search requests based on feedback from the
system to the user
The retrieval strategies are those aspects of text retrieval system
attracting most attention. The choice of retrieval strategy will be
decisive for the way in which retrieval system will function in
practice. It will influence factors like
- (1) user friendliness, ie how easily the user is able to employ
the text retrieval system;
- (2) what information may be fed back to the user;
- (3) retrieval performance (measured in recall and precision);
- (4) response time, ie the time interval between entering a search
request and the feed-back information on retrieved documents.
A retrieval strategy must be developed to suit the actual user
environment. Requirements must often be given priorities, as it is
difficult to develop a strategy which satisfy simultaneously all
requirements equally well. The requirement for high user friendliness
will, for instance, easily come into conflict with high retrieval
performance and short response time. When basing retrieval strategies
on requests in natural language, it will be difficult to achieve the
same level of retrieval performance as when done with "artificial"
search languages.
4.5.2 The retrieval function
(1) Simulating relevance assessment
Comparing the retrieval process with a manual reading
[Page 160 ]
of the data base, the retrieval function may be perceived as a
simulation of the relevance assessment. Its objective is to identify
all and only relevant documents - ie the same documents as would be
identified by reading the data base. A retrieval function therefore
must be based on the same criteria for relevance, and should be
constructed on the basis of an analysis of relevance assessment.
It is obviously no easy task to design a perfect retrieval function
- ie a function retrieving all and only relevant documents. Firstly,
one will not always agree on which documents are actually relevant in
respect to a certain problem (cfr the appendix to this chapter on the
concept of relevance). Secondly, it may pose problems to represent the
criteria of relevance in a computerized system. Also, the retrieval
criteria must be selected from the characteristics of natural language
texts. Traditionally, one have three such characteristics:
- (1) the words (the vocabulary) of the document
- (2) the frequency by which each word occurs in the document
- (3) the sequence of the words in the document
The difference in the relevance assessment in fact and interest
retrieval must be appreciated also when designing retrieval functions.
Generally one makes a distinction between two categories of retrieval
functions - identity and nearness functions.
(2) Identity functions
An identity function is characterized by the selection only of
documents which satisfy all criteria in the search request. It
actually bisects the data base, one segment being the retrieved
documents, the other the unretrieved documents. Retrieved documents
typically are not ranked before presented to the user. This strategy
is well suited for fact retrieval, but less suitable for interest
retrieval.
The simplest form of an identity function is to retrieve all
documents containing a certain, specified search term. This is typical
of retrieval in defined fields, where one makes requirements according
to the
[Page 161 ]
value of the field (for instance "author" equals "SALTON").
Boolean retrieval. Most text retrieval systems offer more
advanced identity functions, making it possible both to define several
search terms and to formulate requirements concerning the relations
between the terms. The best known is based on Boolean algebra, named
after the British mathematician George Boole (who actually based some
of his classical examples on Jewish dietary law).
Boolean logic gives the user the possibility of qualifying
relations between search terms by Boolean operators like AND
(conjunction), OR (disjunction) or NOT (negation). For instance the
search request
R = T1 AND T2 NOT T3
make the system retrieve all documents containing both the search
terms T1 and T2, but excluding those containing these two terms as
well as those containing T3. In fig 4/4 this set of documents is
indicated:
Fig 4/4 - Illustration of Boolean retrieval
Di = set of documents containing Ti
The operator OR may be used to represent synonyms:
| S' = | | (T1 OR
T12 OR ... OR T1m) |
AND
(T2 OR T22 OR ... OR T2n)
NOT
(T3 OR T32 OR ... OR T3j)
[Page 162 ]
The strength of the retrieval strategy is its flexibility, and the
possibility it offers for experienced users to construct complicated
and well-performing requests. In principle, high retrieval performance
is always possible.
The drawback of the function is the high demands posed to the user,
both with respect to search language and to specification of search
terms. An inexperienced user will have difficulties in exploiting the
possibilities, and a request may therefore easily have a structure
which is different from the one intended by the user (for instance the
user mixes up the ORs or ANDs, or is unaware of the sequence in which
they are executed).
Since Boolean requests presume that all specified requirements must
be satisfied by a document, this implies a strict demand on the user
to specify terms. If the user excludes a synonym to T1 (cfr the
example above), this will result in the non retrieval of a document
only using this term to express the relevant idea - even if the other
requirements in the request are met.
Or more generally - a request of the type "T1 AND T2 AND ... AND
Tn) will result in the non retrieval of a document containing all
terms except one. This document will be qualified as just as
uninteresting as a document not containing any of the specified terms.
As we know that specificity (less than exhaustive specification of
search terms) is the most important cause for recall failure (60 - 70
per cent), this is a serious objection to the use of Boolean requests
in text retrieval (cfr Fjeldvig 1980:161-163).
Nevertheless, Boolean retrieval is by far the most widely used
strategy - also for text retrieval. This does not imply that this
retrieval strategy is better than others, but rather that its
traditions are long, and that it works with short response-times in an
on-line system. The latter property is important, as it gives the user
a real possibility to reformulate and improve his request, and in that
way introduce a dynamic property to the otherwise quite static
identity function.
[Page 163 ]
(3) Nearness functions
Nearness functions are designed for interest retrieval.
Characteristic for interest retrieval is the difficulty - if not
impossibility - of defining unambigious criteria for relevance. The
relevance assessment is often complicated by and based on vague
considerations related to the documents. Nearness functions therefore
result in a list of documents, ranked according to the probability of
relevance. In browsing through the documents ranked highest, one will
get the highest probability of identifying relevant documents or get
ideas on how to improve the search request.
A nearness function calculates similarity between the search
request and the documents from criteria assumed to indicate relevance.
Identity between the document and the search request denotes a high
probability of relevance.
A nearness function presumes that
- (1) it is possible to specify criteria indicating relevance, and
- (2) that these criteria may be used to rank documents in such a
way that probable relevant documents will be ranked high.
Word frequency based ranking algorithms. There are several
different nearness functions. Most are based on one of the following
ranking criteria.
- (1) The total number of search terms occur ing in the document.
- (2) The number of different search terms occurring in the
document.
- (3) The number of conceptors in the document, a conceptor being
defined as a set of terms denoting the same idea.
The simplest and most usual forms of ranking are based on frequency
(1) or (2). These criteria are based on a hypothesis that the more
search terms a document contain, the higher the probability for an
idea to be present in the document. The hypothesis is tested in for
instance the controlled experiments in
[Page 164 ]
text retrieval at the Norwegian Research Center for Computers and
Law, where both frequencies gave positive results when used as ranking
criteria (cfr Bing/Harvold 1974:52-55). Also later experiments have
demonstrated that the ranking improved when the frequencies were
modified with respect to document length (cfr Fjeldvig 1976:57-66).
This would seem to correspond to intuitive ideas, as the higher
frequency of search terms in one document compared to another might
simply be caused by the first document being relatively longer than
the second.
There are ranking algorithms combining frequencies (1) and (2), and
also using them combined with other criteria. In the IBM text
retrieval system STAIRS/VS the user may choose between 5 different
algorithms of this kind, and several of these combine the frequencies
with the number of documents containing a certain term (cfr
Bing/Harvold/Kjønstad/Stabell 1976:94-101).
Vector retrieval. Best known of the nearness strategies is
perhaps the cosine function in vector retrieval, which for instance
Salton has analyzed in the SMART project (cfr for instance Salton
1971).
The cosine function is based on a vector representation of the
documents. Each document is described in the form of an n-dimensional
vector, where n is the number of different terms in the data base.
d = T1, T2, T3, .., Tn
Each component, Ti, represents a certain term in the data base, and
its value equals the weight assigned to this term in the document in
question. If the term does not occur, its weight equals zero. In the
example in fig 4/5, the data base contains only 11 different terms.
Document 1, represented by vector d1, contains 6 of these words, and
is assigned the weights 1, 10,0, 3, 0, .., 0. If the weights equal the
frequency of the words in the documents, this implies that word 3, 5,
6, 7 and 11 do not occur in the document - and they are consequently
assigned the weight 0.
When retrieving, the search request is also represented by a vector
constructed in the same way, s (cfr the example). The similarity
between the search
[Page 165 ]
request (search vector s) and each of the documents (document
vectors, di) is measured by cosine as the angle between the vectors:
The smaller the angle, the greater the similarity between the
vectors and the higher the cosine value. At the cosine value equaling
1 the search vector is identical to the document vector - ie the
search request and the document contains exactly the same terms.
Fig 4/5 - Example of vector retrieval
Let there be 6 documents in the data base, represented by
the vectors below:
d1 = [1,10,0, 3,0,0,0,4,6,8,0]
d2 = [0, 0,1, 2,1,0,0,3,1,0,0]
d3 = [2, 0,4,10,3,0,7,1,0,0,3]
d4 = [3, 0,1, 8,2,0,0,2,0,1,1]
d5 = [0, 0,5, 2,4,0,0,3,0,1,1]
d6 = [2, 3,5, 3,0,0,2,0,1,0,0]
The search request consists of 5 search terms each occur
ing once:
s = [1, 0,0, 1,1,1,0,0,0,1,0]
Cosine values
cos(s,d1) = 0,103
cos(s,d2) = 0,194
cos(s,d3) = 0,126
cos(s,d4) = 0,183
cos(s,d5) = 0,158
cos(s,d6) = 0,139
Ranked result
Document 2
Document 4
Document 5
Document 6
Document 3
Document 1
The cosine function differs from a function based on word
frequencies (cfr frequency (1) above) as it is also a function of the
length of the document vector.
[Page 166 ]
The scalar product will generally correspond to the total number of
terms in the search request (frequency (1)), as the terms in the
request generally occur only once. If the search term frequency is
identical for more documents, the cosine function will give priority
to documents with the lesser vector length - ie small document length
and low and even term frequency. The latter may lead to the result
that documents treating only one topic may be ranked above documents
treating several topics - an effect which do not adequately reflect
the probability of relevance of the document.
In the project NORIS (8) III at the Norwegian Research Center for
Computers and Law, the cosine function was compared to a number of
other ranking algorithms, among these frequency (1) adjusted for
document length. The only difference in these two functions is that
the cosine function favours documents with a low and even term
frequency. Controlled experiments demonstrated, however, that this
effect gave little effect (cfr Fjeldvig 1976:133-137).
Vector retrieval is a special retrieval technique giving
possibilities for a number of different proximity functions - the
cosine function being only one of them (cfr Salton 1968). Intuitively
the technique would seem well adapted to measure similarity between
document and search request vectors. But like any other search
strategy, it does have its drawbacks.
One of the drawbacks is that it requires higher processing
resources than the traditional strategies based on an inverted file
structure. Even if document vectors are usually gererated only once,
it is nevertheless necessary to gererate for each request a search
vector and (in principle) compare this to all the document vectors.
In the retrieval system of the Norwegian Research Center
for Computers and Law, developed by Tove Fjeldvig within the project
NORIS (28) - cfr Fjeldvig 1978 - a somewhat different procedure was
selected. Rather than comparing the search vector to all the document
vectors, a preliminary selection was made of the sub-set of vectors
having at least one term in common with the search vector. This was
made possible by integration with the text retrieval system
NOVA*STATUS,
[Page 167 ]
using its inverted file structure. Studies demonstrated
that even for small data bases this required far less resources than
comparing the search vector to all document vectors.
Clustering based retrieval strategies are one attempt to
solve the problem of processing resources (for instance Salton
1968:254- 265, Rijsbergen 1975:85-89). These imply that documents with
similar properties are clustered, as they also tend to be relevant to
the same requests. Each cluster (or group) is represented by a vector
- a "profile" or "cluster profile", a characteristic of the documents
in the cluster. These are in turn compared to the search vector. There
are a number of different ways to create the clusters (they may for
instance be organized in hierarchies), and there are many proposals as
to how they may be dynamically adapted to the type of requests
directed to the information system. Salton states that this type of
strategies requires less resources than the traditional vector
retrieval, but reduces retrieval performance. The tendency is that the
bigger the cluster, the less resources are required, and the lower the
retrieval performance (cfr Salton 1968:254-265).
The most serious objecton to the vector technique is nevertheless
that it does not offer the user a possibility of specifying the
relations between the terms of the search request. When generating the
document vectors, one may, of course, group synonymous terms and have
these represented in a common vector component. This does not,
however, solve the problem of context dependent synonyms, as they must
be specified relative to the actual search request. The resources
available, however, make the regeneration of vectors for each search
less than realistic.
Vector retrieval is not alone in ignoring the semantic relation
between terms. This holds true for all ranking algorithms based on the
frequency of search terms and the weight of search terms. If search
requests contain two or more ideas, this may easily lead to search
functions not taking into consideration the relevance criteria of the
user.
Vector based retrieval strategies are not often used in operational
systems, but are popular tools in experimental systems (for instance
SMART, cfr Salton
[Page 168 ]
1971, or Tapper's case citation retrieval system, cfr Tapper
1982).
Conceptor based retrieval. The last ranking criterion, the
frequency of conceptors, takes into consideration the relation between
search terms and the ideas they represent. A conceptor is defined as a
class of terms representing the same idea (usually synonyms). A
strategy of ranking documents primarily on the frequency of
conceptors, is called a conceptor based retrieval strategy.
An "idea" is used in this book to denote a simple,
semantic concept - defined within the pragmatic context of the problem
which the user tries to solve using an information system. It is
preferred to "concept" as an "idea" may be rather vague or temporary.
The reasoning behind conceptor based retrieval is centered on an
assumption that the request may be given a structure corresponding to
ideas inherent in the user's problem. A necessary condition for a
document to be relevant, is that these ideas are contained in that
document. Even if this condition is not sufficient, it is necessary
for relevancy. It would therefore be reasonable to rank documents on
the basis of the probability that they contain all the ideas. It is
also reasonable to assume that this probability is a function of the
number of ideas to be identified in a document. If, for instance, a
document contain for instance all the ideas of the search request,
this should be ranked on top. If only two of three ideas are
identified, this should be ranked lower. But it is more probable for
this document to be relevant than for a document in which only one of
the ideas is identified.
When formulating the search request, each idea is described by a
conceptor. In the Norwegian text retrieval system NOVA*STATUS, for
instance, the user describes his ideas through a dialog with the
system, where the user is prompted for each conceptor. Each conceptor
is described as a class in the following way:
[Page 169 ]
CLASS 1: T11, T12, T13, .., T1n
CLASS 2: T21, T22, T23, .., T2m
CLASS 3: T31, ...
where the term Tij describes the idea i.
An idea is considered identified in the document when at least one
of the terms describing the idea is present in the document. This is
not always true - for instance, the word may be a homograph. In
NOVA*STATUS one has decided to introduce the frequency of the total
number of terms from the search request occurring in the document as a
secondary rank criterion. If two documents have an equal number of
conceptors, the system will favour that document in which the
conceptors have the strongest representation - ie which contain more
search terms.
An example of conceptor based retrieval using NOVA*STATUS
is given below, modelled on the example of vector retrieval described
above. Notice that the ranking of retrieved documents is quite
different from vector retrieval using the cosine function.
| Document 1: | | | T1(1), T2(10), T4(3), T8(4), T9(6), T10(8) |
| Document 2: | | | T3(1), T4(2), T5(1), T8(3), T9(1) |
| Document 3: | | | T1(2), T3(4), T4(10), T5(3), T7(7), T8(1), T11(3)
|
| Document 4: | | | T1(3), T3(1), T4(8), T5(2), T8(2), T10(1), T11(1)
|
| Document 5: | | | T3(5), T4(2), T5(4), T8(3), T10(1), T11(1) |
| Document 6: | | | T1(2), T2(3), T3(5), T4(3), T7(2), T9(1) |
Search request:
CLASS 1: T1
CLASS 2: T4, T5
CLASS 3: T6, T10
[Page 170 ]
Ranked result
| Document | Class frequency (conceptors)
| Number of search terms |
| 4 | 3 | 14
|
| 1 | 3 | 12
|
| 3 | 2 | 15
|
| 5 | 2 | 7
|
| 6 | 2 | 5
|
| 2 | 1 | 3
|
Conceptor based retrieval is the strategy which has given the best
performance in the controlled experiments of the Norwegian Research
Center for Computers and Law. The curves of fig 4/6/1 and 4/6/1 are
reproduced from the projects NORIS (8) I and III, and give the
relative difference in performance by different strategies, cfr
Bing/Harvold/Kjønstad/Stabell 1976:49 and Fjeldvig 1976:18. Results
are presented as average rp-curves (recall-precision curves).
Each search is represented as a rp-curve. The curve is
drawn by dividing the list of retrieved documents in rank sets, and
for calculating a value for recall and precision for each set. The
points are plotted into a rp-diagram, and connected to form a simple
and individual rp-curve. An average rp-curve is calculated on the
basis of several individual rp-curves. In the experiments of Norwegian
Research Center for Computers and Law, the average curves were
constructed by calculating average precision values for the recall
values 0.1, 0.2 ... 1.0. The points were plotted in a rp-diagram and
connected to a curve.
[Page 171 ]
Fig 4/6/1 - Average rp-curves based on different ranking
algorithms, project NORIS (8) I
Fig 4/6/2 - Average rp-curves based on different ranking
algorithms, project NORIS (8) III
[Page 172 ]
(4) Combinations of identity and nearness functions
Most conventional text retrieval systems are based on Boolean
strategies, though several of these are used as much for interest
retrieval as for fact retrieval. In the last years, several
researchers have been developing methods combining Boolean retrieval
with ranking. This will increase precision and make the users more
satisfied with the results.
In typical Boolean systems the terms are considered binary - a term
which either occur in a document or does not cccur. This makes
implementing a ranking algorithm difficult, but not impossible.
However, it is easier in systems where the terms are weighted,
reflecting the importance of that term in the document (cfr the
description of nearness functions above). In the experimental system
SIRE and the VEXT text retrieval system of the Norwegian Research
Center for Computers and Law the results of a Boolean retrieval is
ranked by calculating the similarity between the retrieved documents
and the search request. Similarity is calculated using only the terms
contained in the search request - the operators are not taken into
account. In the examples above, the cosine function is used as a
measure for similarity.
This strategy may be profitable to the user if the Boolean request
has retrieved a great number of documents. When using Boolean requests
with a conjunctive structure the requirements for retrieving a
document may be too strict for retrieving a list of documents of
sufficient length for ranking to have any effect.
Conceptor based retrieval may actually be considered an
example of a combination of Boolean retrieval and ranking in a way
favouring interest retrieval. If each conceptor is considered a
Boolean argument, the ranking will be determined by how many such
arguments are satisfied in the documents. In this way one may express
the requirements to each rank set as a Boolean argument. If for
instance the search request contain three conceptors (classes), the
requirement of the first (S1), the second (S2) and the third (S3) rank
sets may be expressed as below (cfr Harvold 1980a:142-145).
[Page 173 ]
| S1 = | | class 1
AND class 2 AND class 3 |
| S2 = | | ((class
1 AND class 2) OR (class 1 AND class 3) OR (class 2 AND class 3)) NOT
S1 |
| S3 = | | (class 1
OR class 2 OR class 3) NOT (S1 OR S2) |
Another way of using term weights in Boolean retrieval, is
permitting using weighted terms directly in the Boolean request. This
is a considerably more complex procedure, as a normal Boolean
retrieval strategy is based on the occurence or non-occurence of a
term. Introducing weights may imply, for instance, that a term occur
only partially, or that one term is more important than another. It
is, however, necessary to arrive at an interpretation of requests
composed of weighted terms combined by Boolean operators.
"Fuzzy set retrieval" is an example of such a procedure. It
is based on the "fuzzy set" theory which is again based on the
presumption that an element may be a partial member of a set. In the
model of retrieval, the weight of the term T is a document set equal
to the membership degree of the document in the set of documents
containing term T. If the weight equals 1, the document is a full
member, and if the weight equals 0 the document is not a member. A
weight between 0 and 1 will denote partial membership.
The model also takes into account dependency between terms in such
a way that several terms may belong to the same concept class with
variable membership.
The theory is especially attractive as it satisfy the general rules
for Boolean retrieval when weights equal 1 or 0. It does not, however,
contain the possibility of assigning weights to the terms of the
request. One may therefore characterize fuzzy set retrieval as an
extention of normal Boolean retrieval. It has been experimentally
tested in different contexts (cfr Salton 1983:421-422), but is not yet
implemented in any operative system.
[Page 174 ]
4.5.3 Iterative techniques
(1) Introduction
In an interactive retrieval system, the retrieval process will
often be a process of learning. One may start with rather vague
notions of how the problem at hand may be represented in the
documents, and only after several searches formulate the request in
such a way that it will correspond to the way in which the problem is
described in the data base.
Usually the request is reformulated without the assistance of the
system. The user will browse through the retrieved documents and
discover how the problem is described in these. The user will have
some information on how the request has performed, for instance how
many documents have been retrieved and which search terms do not occur
at all.
During this iterative process the user may change his understanding
of the problem. The retrieved documents may for instance disclose
aspects of the problem of which the user had not been aware. This form
of learning was clearly demonstrated in one of the controlled
experiments at the Norwegian Research Center for Computers and Law,
where the user after having read the result of the computerized
retrieval, made major adjustments to his predefined list of relevant
documents (cfr Bing/Harvold/Kjønstad/Stabell 1976:127-131 and the
detailed discussion in Kjønstad 1976).
This experiment in reassessment of predefined relevant
documents was conducted within the project NORIS (8) I. The predefined
documents were determined by manual reading of a collection of 100
decisions by the Social Security Court in respect of 20 problems. The
answer set contained a total of 162 documents as relevant to these
problems. After the computerized retrieval, the same person qualifying
the original answer set made a reassessment on the basis of the
results. In this reassessment, 16 of the original target documents
were eliminated, and as much as 61 new were added. The person
responsible for this selection was professor Asbjørn Kjønstad, at this
time working on his doctorate degree within the area covered by the
100 documents. In Kjønstad
[Page 175 ]
1976, a unique, detailed discussion of each reassessment
is made. As the most important cause of the major adjustments is
stated that originally the documents either were not fully understood
or they were even misunderstood. Several of the problems basic to the
requests were also amended on the strength of the information in the
retrieved documents. Due to the qualification of the experimenter,
together with the detailed analysis, we find this experiment of
considerable general value.
During the last years a need has been felt of making the
information system more active in this feedback process. Several
different methods have been proposed, and several of these have also
been tested in different experimental systems (for instance SMART,
FAKYR, Responsa and MEDUSA - the latter based on MEDLINE). Below some
types of feedback mechanisms will be discussed.
(2) Relevance feedback
Relevance feedback is a common term for methods by which the system
takes part in governing the feedback process based on an
identification of relevant documents. The method was introduced some
years ago, and is described for instance by Salton 1968.
In short, the method implies that after a search, the user is asked
to make a relevance assessment of the first documents on the list of
results - for example the 10 first. The user responds by indicating
which documents are relevant (or irrelevant). On this basis, the
system identifies which properties characterize the relevant in
contrast to the irrelevant documents, and amends or proposes
amendments to the search request - for instance by increasing the
weights of those terms more typical to relevant than to irrelevant
documents. The system may propose new terms, which for instance occur
only in the relevant documents, or delete terms which occur only in
the irrelevant documents. The actual reformulation of the request is
in some system a manual process based on suggestions from the system
(like CITE), or in other system an automatic process (like SMART).
[Page 176 ]
Most experiments with relevance feedback have demonstrated
improvements in performance (for instance Salton 1983:243). It may,
however, be rather difficult to measure this improvement exactly, as
it may be caused both by the relevance assessment, and by the
inclusion of new, relevant documents. It is the latter improvement
which is desireable from the user's point of view.
A drawback with this technique is that it requires information on
the terms contained in the retrieved and assessed documents. This
presumes either a data structure which conserves the relation between
the individual document and the terms contained in that document (like
for instance a vector based system), or that these documents are read
sequentially.
One of the few examples - perhaps the only - of operative use of
relevance feedback is the system CITE, which is an interface to
MEDLINE based on requests in natural language (cfr Doszkocs 1982).
The vector retrieval system VEXT of the Norwegian
Research Center for Computers and Law lends itself to a simple example
of relevance feedback. The system gives the user the possibility to
define a whole document as a search request by a simple command. If a
relevant document is retrieved by the usual strategy, one may as a
second step use this document as a request. The method is heavy on
processing resources, but has proved its usefulness in retrieving
bibliographical documents, which are generally rather brief.
(3) Local metrical feedback
Local metrical feedback is an example of another type of feedback
techniques which has been tested for instance in the Responsa project
giving encouraging results (cfr Attar/Frankel 1980). The method
attempts to identify synonyms to the search terms ("searchonyms") by
mapping the words occurring in proximity with the search terms in the
retrieved documents. These words are either listed to the user, who
then reformulates the search request; or they cause an
[Page 177 ]
automatic expansion of the request.
The method is qualified as "local" to stress that it exploits only
information from documents already retrieved by a request - and not
from the whole data base. The method is qualified as "metrical" to
characterize the basic algorithm which takes into accoun the distance
to the search words in the text (for instance the number of words,
sentences or paragraphs).
The method is also mentioned in chapter 5 under the description of
the Responsa project.
(4) Computerized processing of search requests - snowball
functions
Snowball functions is the name of another retrieval technique which
differs from those mentioned above as it allows for the system to
retrieve new documents similar to those already retrieved. The system
loops, ie results in one search initiating a new search, and so on. In
this way the number of retrieved documents increases rapidly, and the
process does not halt until a certain number of loops have been made,
or until the similarity between retrieved documents and the search
request have reached a minimum value.
Snowball functions may be especially useful in citation systems.
For instance, they may be used to trace all documents citing or cited
by a certain target document. This document may be a case, and one
might be interested in older cases cited by this one, or more recent
cases citing the target case (cfr Loosjes 1973:77-84).
(5) A preprocessor to text retrieval systems
As an alternative to the techniques mentioned, one might want to
invoke the idea of a more advanced process governed by the system,
which could, for instance through a dialogue with the user produce a
richer search request than the request specified by the unaided user.
[Page 178 ]
In 1982 a project was initiated at the Norwegian Research Center
for Computers and Law - NORIS (58) FORT - which will analyse this
possibility. The objective is to develop an "intelligent" preprocessor
to a text retrieval system, which, based for instance on knowledge of
earlier terminal sessions, properties of the documents, thesauri,
analysis of the search requests etc, suggests improvements in the
search request. These suggestions may be accepted by the user, or they
may be supplemented or ignored. The preprocessor may be turned on or
off. Turning it on, the preprocessor would start, for instance with an
analysis of the request and explain its effect to the user. For
example, it may suggest that the Boolean structure is not proper, it
may further suggest deletion of terms with a high frequency in the
data base, improved truncation, splitting of compounded words or
supplemental terms presumed to be synonymous to the search terms etc.
The project will investigate techniques from artificial intelligence
research for constructing a network of concepts covering sectors of
the documentation area of the system. Rather than structuring a
universe as a model of real objects and activities, one may structure
a universe of the words and phrases used to describe this part of
reality. The objective is to make this network for suggesting synonyms
available to the user.
As a last possibility, one will look at the usefulness of
implementing a model of automatic estimation of recall and precision
(Harvold 1980b:172-219). On the basis of such estimated values the
user may decide whether a new search should be conducted by a
reformulated request, or whether to quit the terminal session.
This project is funded by the Norwegian Research Council for
Science and Industry, with a planned termination in 1985.
[Page 179 ]
4.6 Aids in formulating search requests
4.6.1 Formulating the request
The basic problem in formulating the search request concerns the
transformation of the semantic content of the problem into the
syntactic criteria accepted by the retrieval function. This presumes
an ability to analyse the information need and to identify the ideas
("semantic building blocks") from which this need is constructed.
Secondly, it presumes that each idea may be described by terms
assumably used in the documents representing that idea.
A well formulated search request presumes satisfactory results of
both these processes. In practice, only a few users make a distinction
between them, analysing and specifying simultaneously the search
terms. Experience shows, however, that the formulation of the search
request may be found somewhat easier, and give better result, if a
more stepwise approach is adopted.
In the literature great attention has been paid to the
specification of search terms, search languages and tools used in the
formulation of the request. Many users experience the formulation of a
request as the most difficult task, and it is also the specification
of search terms which represents the major cause of performance
failure (Fjeldvig 1980:161-167).
For a request to be complete, it must contain all terms used in the
document collection to represent the ideas - including the grammatical
variations. This task is rather difficult, as there may be many words
and phrases representing the same idea, and a single word may
represent many ideas. The idea 'the moral of the mother' may be
represented for instance by "the moral of the mother", "her moral"
(implicitly that of the mother), "behaviour of the mother" or "dock
girl" (examples from the source material to NORIS (34) STANS). The
word "bar" may denote an organization of lawyers, a place serving
beverages or it may mean a piece of metal.
Some systems offer aids to the user in specifying synonyms or
resolve ambiguities (for instance thesauri), while others require this
to be done by the user
[Page 180 ]
for and is restricted to giving simple aids like right hand
trunction. It is important, however, the the retrieval function gives
the user the possibility of specifying relation between terms, for
instance synonyms. In many cases there will also exist a need to
specify the context of the terms, for instance "retrieval" in the
sense of 'text retrieval system' or "computation" in the sense of
'computational linguistics '.
4.6.2 Examples of different aids
(1) Truncation
Truncation is a simple and valuable aid offered by most text
retrieval systems. The user is given the possibility of specifying
several search terms by specifying only a certain string of characters
to be contained by the search term. Any word containing the defined
string is qualified as a search term.
The most usual form of truncation is right hand truncation.
For instance the request "car*" (where "*" is used as a symbol
denoting truncation) will include any word beginning with the three
letters "car" - for instance "cars", "carriage", "carload", "caravan",
"caricature", etc. A drawback is the inclusion of improper search
terms (like "caravan" and "caricature"), and its inability to cover
vowel changes (for instance strong verbs or irregular substantives).
In general, however, the improper terms are rather peripheral in
respect to the problem. If the retrieved documents are ranked, the
improper terms consequently will not effect the upper part of the
ranked list of documents.
The Norwegian Research Center for Computers and Law has
developed a method for automatic right hand truncation which performs
quite as good as manual truncation for Norwegian texts.
Left hand truncation is especially important in language
making frequent use of compounded words (for instance German and
Scandinavian languages) and
[Page 181 ]
prefixes (for instance German, Finnish and Hebrew). The request
"*boat" would include words like "steamboat", "riverboat",
"cargoboat", etc (though English is a language in which adequate
examples are difficult to conceive). Left hand truncation presumes
that the words are stored in a sequence different from right hand
truncation (sorted backwards). As most users have given priority to
right hand truncation, the left hand truncation may require
considerable resources. Therefore, rather few systems offer left hand
truncation.
(2) Mask functions
There are systems which permit masking of any part of a word, not
only truncations (for instance 3RIP and the Responsa project). The
user is free to choose any string of characters for specifying the
search words, leaving any characters masked. For instance the word
"*col*rful*" would retrieve any word containing the two strings "col"
and "rful" in the specified sequence - solving the problem of
different spellings in English and American.
(3) Automatic stemming
As an alternative to right hand truncation, some systems offer to
supply the request automatically with all grammatical variations of
the search terms. This is either solved by having the system consult a
dictionary where this type of information is stored, or using a
routine for automatic stemming. Such methods are, for instance,
developed for English, Norwegian, and Finnish.
(4) Thesaurus
A popular aid is synonym thesauri - or synonym dictionaries, which
they are also named. Especially in France and Italy considerable
resources have been
[Page 182 ]
used in developing quite complex thesauri with relations in all
directions. Attempts have been made to develop an automatic synonym
thesauri (for instance in the SMART and CONDOR projects), but we are
not aware of their implementation in operational systems.
A synonym thesaurus will be able to specify only context
independent synonyms. The actual efficiency of synonym thesauri in
relation to their development and maintenance cost is therefore a
matter of considerable dispute. Experiments have failed to prove large
increases in performance. Saracevic (1970b:677) found that the use of
a thesaurus did not increase performance in terms of recall and
precision, while Salton (Salton/Lesk 1968:330-334) found a small
improvement. It should, however, be made clear that both these
experiments were based on rather small data bases, and that there is
reason to believe that the importance of context independent synonyms
increases with the size of the data base.
A large part of the context independent synonyms will be
grammatical variations of the same word stem. Experiments show that
approximately 75 per cent of the context independent synonyms are
resolved by right hand truncation (Harvold 1974). In supplement of
right hand truncation many systems offer the users the possibility to
specify their own synonym thesauri (for instance the macro command in
NOVA*STATUS).
4.6.3 Choosing the level of performance
In spite of well developed aids to specify search terms, experience
shows that it may be difficult - or perhaps impossible - to achieve
both 100 per cent recall and precision, ie to retrieve all and only
relevant documents. Even if all search terms are specified, there will
be a possibility of loosing an idea which may be "hidden" in the text,
or of finding a document containing a specified idea not related to
the other ideas in the search argument.
In addition, it is difficult to find terms which are sufficiently
specific to secure that only relevant documents are retrieved, and
sufficiently general to see to all relevant documents being retrieved,
High
[Page 183 ]
recall performance generally implies low precision, and vice
versa - they are actually inversely proportional.
When specifying a search request, one may influence to a certain
degree the result in the direction of high recall or high precision by
- (1) extending or limiting the description of each idea;
- (2) choosing the level of generality in the description;
- (3) determining which or how many classes (descriptions of ideas)
shall co-occur in a relevant document.
In using nearness functions, it may also be decided how far down
the list of ranked documents one wants to hunt for possibly relevant
documents. The further down, the higher the recall and the lower the
precision.
[Page 184 ]
4.7 Examples of file structures
Finally, we shall mention in short how the data are stored and
structured in a text retrieval system - ie which files are created,
how are the files interrelated, and how does the system retrieve the
correct data.
A common misconception among users (and others) is that the text
retrieval system searches the text sequentially, hunting for search
terms. This is, of course, quite possible, but would demand large
resources and is therefore quite unusual. In the text retrieval
systems of today, the documents are processed by the system for
efficient retrieval. In this section we shall give examples of two
different ways of representing the data (file structures) - one based
on traditional text retrieval (like Boolean retrieval), and one based
on vector retrieval.
4.7.1 Inverted file structure
Most text retrieval systems are based on an inverted file structure
with access through specified search terms. The file system consists
of an index file (also called concordance, search file or dictionary
file) in addition to a document file (also called text file). An
example of an inverted file structer is given i fig 4/8.
The index file contains all words which may be used as
search terms. To each word is associated an address or reference to
each location in the documents where that word occurs. In NOVA*STATUS,
the address contains a reference to the document, the sentence within
the document and the word within the document. Often the address have
more levels (for instance a paragraph of the document, and a field
within the document).
In some systems has also been integrated a pointer structure
(chains) in the index file, making possible a specification of the
interrelations between the words as well (like synonyms). In the new
Norwegian text retrieval system SIFT (cfr SIFT 1980) it is
[Page 185 ]
possible to associate each word with an unlimited number of
pointers.
In order for the system to efficiently retrieve a certain word in
the index file, the words are sorted in a certain sequence. Usually,
they are sorted alphabetically. For further increasing retrieval
performance, there is often established an index to the index file,
specifying the position of the word within the file (cfr fig 4/8).
The drawback of the traditional file structure (as illustrated in
fig 4/8), is the difficulties in maintaining the files. The words are
stored in a certain sequence, and when new words are introduced into
the data base, the whole file must be reorganized. It is, of course,
possible to create garbage collection (extra areas linked to the main
file), but after a certain number of new words, the files must
nevertheless be reorganized in order to keep response time down to an
acceptable level.
In the text retrieval system SIFT one has solved this problem by
opting for a slightly different structure, the so called B-trees. In
this structure, the words are stored alphabetically. But rather than
in a sequential file, the words are organized in blocks. In
introducing a new word, this is placed on the first free slot in the
correct block. If this block is full, it is automatically split into
two new blocks. In this way a complete reorganization of the file is
avoided. Consequently B-trees make it possible to maintain a short
response time while at the same time the system may easily be updated.
Fig 4/7 gives a sketch of the structure of B-trees.
[Page 186 ]
Fig 4/7 - B-trees
[Page 187 ]
Fig 4/8 - Example of the file system in NOVA*STATUS
[Page 188 ]
The text file contains all documents as they were input.
This makes it possible for the user to list a whole document or only
part of the document on his terminal. Usually, the documents are
stored in chronological order, and when updating, new documents are
added at the end of the file. To make the access to the text file more
efficient, NOVA*STATUS have chosen for instance to have an index added
to the text file giving information on the position in the file of a
certain document. Fig 4/8 illustrates the file system of NOVA*STATUS
as implemented at the University of Oslo, giving an example of such a
structure of the text file as well.
4.7.2 Vector based systems
In a vector based system, the data are centred on the documents
rather than on search terms. Each document is represented as a vector,
and these vectors are employed in retrieval. A vector consists of as
many components as there are different terms in the data base. Each
word is represented by a fixed component in the vector, and the value
of the component equals that one given to the word in the document -
for instance by utilizing the frequency in which the word occurs in
the document. If a word does not occur, the weight is set equal to
zero. A document vector will therefore generally contain numerous
zeroes (since many of the different words in the data base do not
occur in any single document), and data compression is therefore
necessary when storing vectors (cfr Fjeldvig 1978:24-27).
Document vectors are constructed only once. Even if there are
introduced new documents with new words at a later date, this will
have no influence on the existing document vectors. For each search,
however, a search vector must be constructed with a structure
identical to that of the document vectors. This presumes that the
system keeps track of which components represent which words. In the
vector retrieval system VEXT of the Norwegian Research Center for
Computers and Law, this problem is solved by the creation of a file
containing all the different words in the data base and the associated
component numbers. The file is sorted alphabetically, and has an index
sequential structure.
[Page 189 ]
4.8 Appendix I: Some important text retrieval systems
As far as the systems mentioned below is used in a legal
information service, a description will be included in one or more of
the discussions on these systems following in chapter 6-7.
(1) 3RIP
This is a Swedish system developed at the end of the 1970ies,
running on DEC hardware under the operative system TOPS. The system is
based on Boolean retrieval, interval retrieval, distance operators and
masking. The search may be limited to certain areas of the document.
It includes several output formats and and makes it possible to edit
on DEC-10. The search language includes CCL (Common Command Language).
We are not aware of any legal application of the system, though
experiments have been made in Dublin (Republic of Ireland). The system
is marketed by Paralog AB and is documented in 3RIP manual
1979.
(3) DATA+PLUS
This is a Swiss system constructed in 1968-70. The system was
initially named CONTEXT, but was marketed under the name DATA+PLUS
(Bing/Harvold 1977:140). This is one of the few - perhaps the only -
operational system based on vector retrieval. The search strategy is
designed for natural language requests, and a document in the data
base may be used as a request. The system implies the possibility of a
synonym thesaurus. A description may be found for instance in
DATA+PLUS 100 (1976).
(4) DIALOG
A US system developed by Lockheed Information Systems (Alto,
California). The system is very simple and is based on Boolean
retrieval and right hand truncation. In May 1980 as much as 122
different data bases were accessible under the system (Salton
1983:30-34), of which some are of legal interest.
[Page 190 ]
(5) IMDOC
A Swedish system developed at the end of the 1960ies by Industri-
Matematik AB, marketed in different versions and sometimes under
different names (Factfinder). The system is simple in both design and
use. It offers Boolean retrieval, interval retrieval, synonyms (using
a thesaurus) and right hand truncation. Left hand truncation is
permitted for intra-document search. The system does not include the
possibility of searching for defined fields of the document.
IMDOC is widely used in Sweden, both by public agencies and private
industries. It is used in RAETTSDATA, which is a group of coordinated
legal information systems. The program is implemented on various
hardware. A brief description of the system is given for instance by
Bing/Harvold 1977:135-136, Svoboda 1978 and IR-meddelanden nr 21.
(6) FIND
This is an Italian retrieval system used in the national legal
information system ITALGIURE. The system is based on Boolean retrieval
and permits interval retrieval, right hand truncation, masking,
distance operators etc. It is also possible to search defined fields
of the document. The system permits the use of synonym thesauri, and
will expand the request automatically with synonyms and grammatical
variations of the search terms. A description is given by Svoboda
1968.
(7) GOLEM
This is a German system developed by Siemens AG, used for instance
in the legal information retrieval system JURIS. The system has
roughly the same possibilities as FIND, including the use of synonym
thesauri. An accessory program for automatic indexing known as PASSAT
has been developed, analyzing the documents and performing som
computational linguistics.
GOLEM is released in a new version approximately once a year. At
one time there were plans for replacing
[Page 191 ]
the system by a new and more advanced system known as CONDOR.
This project was terminated in 1981.
The system runs only on Siemens hardware. It is described by
Svoboda 1978 and IR-meddelanden nr 21.
(8) MEDLARS
A US system used in the medical information system MEDLINE at the
National Library of Medicine. The system is actually not a true text
retrieval system, but is well known and should therefore be mentioned.
It was developed in 1964 and was the first on-line information
retrieval system.
MEDLARS is especially designed for retrieval in bibliographical
data, and is based on documents designed in a controlled vocabulary.
It is based on Boolean retrieval, with a retrieval process limited to
certain fields. To make the system more friendly for end-users, a
general user interface has been developed (CITE) based on natural
language requests and relevance feedback.
The system has been used in many experiments in text retrieval. A
description is given by Salton 1983:30-34.
(9) MINTU
A Finnish system developed by Statens datamaskincentral (Government
computer center) in Finland and used in the Finnish legal information
service FINLEX. The retrieval is based on Boolean requests and
interval retrieval. Synonym retrieval is an accessory function. It is
not possible to search within defined fields of the documents. In the
basic version, document length is limited. It is implemented for
several minicomputers.
The system is documented in a manual from 1982.
(10) MISTRAL
This is a French system developed by Honeywell Bull.
[Page 192 ]
In a somewhat modified version it is operated by Telesystems in
Paris under the name QUESTEL, a system which (in 1980) included 20
data bases and some 1.8 million bibliographic entries.
The system is designed for bibliographical data bases, and the user
must specify which fields of the document to be included in the
search. The retrieval is governed by arithmetical and Boolean
operators. Help functions, synonym search, right hand truncation and
masking are available. The search language includes CCL (Common
Command Language). In France, an accessory system for automatic
indexing has also been developed. This system will flag spelling
errors and recognize grammatical versions of the same word.
(11) NOVA*STATUS
The Norwegian version of the British STATUS system developed by the
Government Institution for Organization and Management in cooperation
with Norwegian Research Center for Computers and Law. The system is
based on the British STATUS I, but it is estimated that at present it
retrains less than 30 per cent of the British coding.
The system permits both Boolean and conceptor based retrieval,
right hand truncation, interval retrieval and use of distance
operators. It is not possible to create a synonym thesaurus in the
system, but macroes may be established.
The system is available on several different installations in
government agencies, including the Ministry of Justice and LAWDATA.
(12) LEXIS
A US system developed by Mead Data Central (MDC) for retrieving
legal and other text data bases. LEXIS is considered the largest text
retrieval system in the world (measured in data bases). The system has
a large number of users, not only in the US, but also in the United
Kingdom, France and elsewhere.
[Page 193 ]
The retrieval system is quite simple, based on traditional use of
Boolean and distance operators. The system does also have simple
routines for recognition of grammatical variants, and will
consequently be able to recognize and process similar words with the
same basic form. The retrieval system is documented for instance in
Salton 1983:46.
(13) POLYDOC
A Norwegian system developed by the Norwegian Center for
Informatics Ltd, and especially designed for retrieving
bibliographical data - but may also be used for text retrieval. In
retrieving, the field within the document must be specified, but this
may have the form of a lengthy text. The system is based on Boolean
requests and permits right hand truncation.
POLYDOC is characterized by its output format, producing KWAC, KWOC
or KWIC lists and a number of other indexes. A micro computer oriented
version has also been developed, MICROPOLYDOC. A description may be
found in IR-meddelanden nr 21.
(14) RESPONSA
An Israeli system developed as part of the famous Responsa project
at the Institute for Information Retrieval and Computational
Linguistics at the University of Jerusalem. The system is especially
constructed for text retrieval and large text bases. Today there are
some 180 data bases and 30-35.000 documents.
The system is being continually developed. Today it includes in
addition to the functions found in conventional retrieval systems,
several advanced forms of automatic indexing. A linguistic program has
been developed, KEDMA, which inflexes verbs and substantives and
performs a morphological analysis of the words. The language is
Hebrew. In the project, great effort has been made to develop methods
for retrieval and storage of data which do not require large
resources.
The system is used for experiments in information retrieval. A
description is given by Choueka 1980.
[Page 194 ]
(15) SIFT
A Norwegian system under development in a cooperation between
several institutions, and under the direction and inspiration of the
Government Institution for Organization and Management.
The system opens the possibility for Boolean and conceptor based
retrieval. It enables users to search defined fields of documents and
if desirable in several data bases simultaneously. The inverted file
structure is based on B-trees, and it contains possibilities for
several different relations between words (for instance synonyms).
Interval retrieval, right hand truncation, masking and distance
operators are available. Operators make possible a relational
searching in table structures. Historical versions of a document are
available in browse mode. The system does not imply any limits in
document length, data base size or number of data bases. A description
is given in the SIFT specification of 1980.
(16) SPIRIT (Systeme syntactique et probabiliste d'indexation et
de recherche d'information textuelles)
A French system developed by Systex. The system is released, but
has still the imprint of a research project and is so far used only
for experiments (IR-meddelanden nr 4). It has been tested on legal
data.
The system is different from traditional text retrieval systems in
several ways. The requests are based on natural language. The
similarity between the request and the documents is measured on the
basis of a morphological, syntactic and statistic analysis, and is
supposed to express the semantic distance between the request and the
document. The system contains advanced methods for automatic indexing,
in which the documents are analysed grammatically, the root words
lemmatized, insignificant words deleted, compounded words split,
homographs identified and resolved, typing errors corrected, etc.
Synonym thesauri are included.
The system operates on mini computers, and is
[Page 195 ]
described for instance in Fluhr 1981.
(17) STAIRS (Storage and information retrieval system)
This is an IBM program product running on IBM hardware and existing
in several versions (for instance STAIRS/VS, STAIRS/TLS or
STAIRS/DL/1).
The system is based on Boolean retrieval and proximity functions.
CCL (Common Command Language) is included. Right hand truncation, left
hand truncation (through sequential retrieval) and distance operators
are available. Several data bases may be concatenated for simultaneous
retrieval (up to 16). For the French, English and German version the
extension TLS (Thesaurus and Linguistic Integrated System) is
available, which permits synonym retrieval and identification of word
with the same basic form.
A description of the system is given in IR-meddelanden nr 25 and
Svoboda 1978.
(18) STATUS II
A British system used today in several information services, for
instance EUROLEX. The system is a new version of STATUS I, and it also
has several common characteristics with NOVA*STATUS. It is based on
Boolean retrieval. The retrieval may be limited to certain fields of
the document, and to certain sub-sets of the data base. Right hand and
left hand (sequential retrieval) truncation are available, and macroes
may be defined. The document length, the size of the data base and the
number of data bases are unlimited.
The system is running on different hardware. A brief description is
found in IR-meddelanden nr 7.
[Page 196 ]
4.9 Appendix II: The concept of relevance
The measure of performance of text retrieval systems in terms of
recall and precision may seem simple at first glance, but this
impression is something of an illusion. The measures depend entirely
on the concept of relevance, and this is probably one of the most
difficult concepts in the whole field of information retrieval.
There are at least three issues of importance to our understanding
of relevance. The first issue concerns the type of relevance. Are we
really dealing with formal, content, or subjective relevance
(Koenigova 1971)? The second issue concerns the nature of relevance.
Is relevance absolute or is it relative to each user? The last issue
conserns the grading of relevance. Is relevance a matter of degree or
is it an either/or proposition?
4.9.1 Types of relevance: Formal, content, and subjective
relevance
Relevance is a relator in the sense that it says something about
one thing in relation to another thing. The two "things" might be the
syntactics of two texts, in which case we talk of formal relevance. Or
they may be a problem and the content of a text, in which case we talk
of content or subjective relevance, depending on the type of situation
in which the evaluation takes place.
Formal relevance measures the syntactic similarity between two
texts. The texts may be search request and document or two documents.
Thus the formal relevance of a document is a value assigned to the
document by a matching function. Formal relevance is based on the
syntactic structure of the document, not on the content of the
document, nor on its usefulness to the user. Formal relevance will
usually reflect the similarity between two texts as measured, for
example by a matching of words. But it may also reflect general
criteria like type and age of document, author, and so on. The nature
of formal relevance is absolute. Given two texts and a matching
function,
[Page 197 ]
the relevance value is unambiguously defined. Depending on the
matching function, the grading may be either/or (binary) or by
degrees.
Content relevance is defined as the adequacy of the content of a
document as a response to the request. Subjective relevance will
depend on a host of factors, inclusing the user's previous knowledge.
In the literature subjective relevance has also been characterized
"utility" (Cooper 1971) and "pertinence" (Foskett 1972).
The choice between content and subjective relevance must to a
certain extent reflect the type of decision-making situation in which
the user finds himself.
In an informal decision-making situation where the value of the
decision depends on future consequences, there are no rules that the
decision-maker is forced to consider. The only thing that counts is
the decision itself. How it is arrived at is irrelevant.
The quite opposite situation exists where the decision process is
formalized in such a way that the validity of the decision depends
mainly on the premises on which the decision is based. The validity of
a decision made by a court of law, for instance, will depend on
whether or not certain procedures demand that certain legal sources
shall be consulted. If the judge neglects to consult one of the
required sources, his decision is invalid, even if it turns out that
the same decision would have been arrived at had the source been taken
into account.
In legal decision-making, as in other formal situations, it is
therefore not appropriate to regard relevance as entirely subjective.
The relevance of a document cannot be defined only in terms of its
usefulness to the user, since such a definition implies that only
documents which in some way cause the decision-maker to change his
mind are relevant. Instead we must base relevancy on the content of
the document. Whether or not the document is relevant will depend on
the adequacy of the content of the document as a response to the
request. As part of the content will be things like date of
publication, author, and so on.
[Page 198 ]
4.9.2 The nature of relevance: Absolute and relative relevance
We have earlier remarked that formal relevance is absolute in the
sense that it does not depend on individual value judgements.
Subjective relevance, on the other hand, is relative in the sense that
it is particular to each individual user. In fact subjective relevance
is not only relative to each user, but to each user situation, since
the background knowledge of the user, which changes constantly, is a
main factor affecting the utility of additional information.
The nature of content relevance is more difficult to assess.
Content relevance measures the adequacy of a document as a response to
the request. Content relevance does not depend on whether or not the
user himself finds the information useful in the sense that it is new
to him.
In a strictly formal system there are definite rules for evaluating
relevance. There is little or no room for the personal opinion of the
individual user. In such a system content relevance tends to be
absolute.
While the legal system has certain characteristics of a formal
system, these are not sufficiently prominent to make the assessment of
content relevance absolute. This is not only evident in legal theory
with its emphasis on human judgement within the often broad limits of
the law, but is also born out of several empirical investigations
regarding relevance assessment. This is not to say that relevance
assessment in the legal system is completely relative. Clearly it is
not. As so often in the law, the solution must be found somewhere in
between the extremes, the different solutions in each case deciding to
which side the scale will tip.
4.9.3 The grading of relevance: Grading by degrees or binary
grading?
Even when we know the type and nature of the relevance assessment,
we are still left with the question of
[Page 199 ]
grading. In fact, the question of whether or not the relevance
assessment should be graded by degrees or given a binary grading does
not follow automatically from our previous choices of type and nature.
Yet the grading of relevance is never completely independent of type
and nature.
Consider for example formal relevance. We have established that the
assessment of formal relevance as performed by a matching function is
absolute. Since there is no uncertainty regarding formal relevance, it
might seem to be an appropriate candidate for binary grading. However,
such a conclusion would be premature. It must be remembered that
formal relevance as applied in a retrieval system is only an
approximation of content or subjective relevance, the approximation
will not be perfect and may even be quite inaccurate. As long as
formal relevance is an approximation, the system should grade
documents by degrees. This is appropriate even in the cases where the
user himself would grade documents according to a binary scale. In
these situations formal relevance is only a measure of similarity and
should not be presumed as a measure of identity.
The grading of documents by the user according to content or
subjective relevance is an entirely different matter. If it is assumed
that the user assesses documents according to subjective utility, it
is obvious that some documents are going to be more useful than
others. It is doubtful, however, whether or not the user will be able
to assign to every document a unique rank according to utility. It
seems safe, at the very least, to assume that the user will not make
any distinctions among the documents which are clearly irrelevant. And
in most cases it also seems safe to assume that the user will in fact
be unable to assign a unique rank to each relevant document.
Experience has shown that humans are generally incapable of mentally
making a complete comparision of more than a few items.
The user may be able to classify the relevant documents under a few
headings like:
- on point
- relevant
- related
[Page 200 ]
But again it is doubtful how much use he will have of such a
classification. It is also a doubtful empirical question whether users
of retrieval systems normally classify documents in this manner, or if
the practise is reserved for panelists taking part in relevance
experiments.
The behaviour of panelists has often been cited in support of the
proposition that relevance is a matter of degree. Gebhardt (1975) for
example, refers the the Joint American Bar Foundation and IBM Project
(Eldridge 1968) and points out that the panelists seem to disagree as
often as they agree on relevance assignments. The typical user
situation, however, does not consist of a panel, but of one single
user. For a given document in a given situation the user will normally
be able to decide whether or not the document is clearly irrelevant,
or whether it might be relevant. Most of the time he will probably not
assign a unique rank to the document, and an attempt to do so might
prove difficult.
Assuming, however, that legal documents are assessed according to
their content, then relevance values corresponding to their
hierarchical rank may be assigned. Thus the constitution may be given
a higher relevance value than a statute, a statute may be given a
higher value than a regulation, and so on. But such a scheme is
neither very realistic nor, probably, very useful. The rank of a legal
source is not in itself absolute. The respective rank of a supreme
court decision, a statute, and a regulation, for example, depend on
several factors, including how long ago the respective sources were
written, how directly they affect the issue at hand, the
reasonableness of the result which each document favours, and so on
(cfr the discussion of the weight of arguments above at sect 2.6.1).
In fact, in most areas of the law so much depends on human judgement
that it does not seem practicable to implement any kind of rigid
scheme for assigning relevance values to documents.
A complete ranking of documents on a utility basis, rather than on
the basis of content, is not normally performed by users. One of the
findings reported by Eldridge (1968) was that each panelist seemed to
have his own favourite relevance group in which he tended to place
documents. It is likely that the same tendency is found among ordinary
users.
[Page 201 ]
If a request is complex, the user may rank documents according to
the aspect of the request to which the document refers. In a criminal
case, for example, the user may find a document discussing the correct
legal reaction once guilt is established. It is probably appropriate,
however, to regard this kind of preference as a ranking of the various
aspects of the problem rather than as a ranking of retrieved
documents.
The user may change his relevance assessment as he gains new
insight into the nature of the problem. He may disregard documents he
previously overlooked. Cfr the results reported above in sect 4.5.4
(1) on the reassessment of the relevance of documents, causing major
changes. A re-evaluation of previous assessment results is probably
common and illustrates the relativity of relevance. However, this fact
by itself has little bearing on the appropriate relevance grading of
the documents.
What we are left with as a conclusion is that the user at any one
time disregards irrelevant documents. The remaining documents may be,
and sometimes are, classified in a few relevance categories. But they
will almost never be assigned unique rank values. Documents which at
first glance seem to be of doubious relevance are usually reassessed
and either disregarded or accepted as relevant. The user will not
normally leave them in an uncertain state, as this is of little value
to him.
The grading of content and subjective relevance must therefore,
generally speaking, be regarded as binary, that is as an either/or
proposition. The user can make use of a few relevance categories, but
will not, as a rule, make a full ranking of the documents.
4.9.4 The relevance concept as used in this book
In this book the relevance concept has been developed in sect 1-3,
cfr especially sect 2.6.1 for a discussion and a definition.
It should be clearly understood that the selected
[Page 202 ]
definition is designed to meet the special needs of the analysis
of the decision and communication process. The definition is based
mainly on content relevance (referring to legal meta-norms), but
supplemented by subjective relevance. Consequently the concept is
relative, though a core of objectivity is to be found to the extent
that the legal meta-norms are well defined and not controversial.
The concept is binary, based on the arguments presented above - but
some of the grading has been pushed into further stages of the
decision process, a grading of weights of the arguments derived from
the documents.
This is a concept developed for analytical purposes, the main point
being its interpretation with respect to the measure of the
performance of retrieval systems. We would like to maintain that a
high performance is achieved if such documents are retrieved which the
user thinks worthwhile to consider, even if finally put aside as of no
use. In order to arrive at such a concept, we have to be rather
"generous", including as relevant documents also such documents which
have doubious utility. The definition is not intended to be
descriptive (a description of how relevance assessment actually is
carried out by lawyers), but rather prescriptive (determining our own
use of the concept in the discussion of legal information retrieval
systems).
[Page 203 ]
5 RESEARCH REGARDING THE PERFORMANCE AND DESIGN OF TEXT
RETRIEVAL SYSTEMS
5.1 Introduction
The science of information retrieval lacks a comprehensive
theoretical foundation. Part of the explanation lies probably in the
fact that information retrieval, like other computer-dependent
sciences, is a relatively young science. Another explanation may be
found in the nature of information retrieval itself, and especially in
the problems related to the concept of relevance. The main questions
concerning the type, nature, and grading of relevance are central in
any attempt to measure the performance of a retrieval system, but so
far, there has been no accepted general way of treating them.
An explanation can also be traced back to the relatively slow
progress made in the field of artificial intelligence. In a
fundamental sense many of the problems of information retrieval depend
on the ability to understand text, which machines cannot do - yet. And
it is a sobering thought, that even if we had in our possession a
clever machine capable of text comprehension, we would still be left
with many of the problems associated with subjective relevance. As for
the time being, we are not only left to deal with the problems of
relevance, but also with the problems connected with retrieving
documents according to purely syntactic criteria.
There are many serious practical limitations connected with
empirical investigations of retrieval performance. Most of these
limitations are again caused by problems connected with relevance.
In order to measure retrieval performance, the relevance of both
the retrieved and the non-retrieved documents must be evaluated.
Essentially this means that every document in the test data base must
be read and evaluated by a juror. Ideally, only one juror should be
used, since experience shows that different jurors arrive at quite
divergent results,
[Page 204 ]
depending on their familiarity with the problems and their
general background. Differences in the relevance assessment can
probably be reduced if the jurors are instructed to look for content
relevance and not subjective relevance, and if they are given
guidelines for assessing content relevance. However, as long as there
is some element of judgement involved, the opinion of jurors will, and
should, differ. Even one single juror will normally change his
assessment over time. We must remember that the assessment process is
also a learning process. By reading the documents the juror gains new
insight, which may cause him to reconsider some of his earlier
evaluations.
The uncertainty associated with relevance assessment need not,
however, be a serious obstacle to the general credibility of the
results of retrieval system evaluation. Salton/Lesk (1968) report that
in an experiment designed especially to test the questions, no
significant relationship was found between differing relevance
assessments and the evaluation results. There is a sound reason for
such a result. In view of the definitions of the recall and precision
ratios, it is likely that uncertainty regarding the relevance of
marginal documents will, on the average, affect the numerator and
denominator of each ratio proportionately and thus leave the values of
the ratios unchanged. If, however, a shift in the relevance assessment
is caused by the juror's misunderstanding of the question, the mistake
would very likely show up as a corresponding shift in the performance
figures.
Another and more serious problem connected with relevance is that,
in order to get accurate values for recall, it is necessary to assess
manually the relevance of all documents in the data base. This imposes
severe practical limitations on the size of the data base. An
unfortunate situation, since we have no reason to believe that the
effect which the different system factors have on performance can be
extrapolated linearily. Two different solutions to the problem suggest
themselves. One solution is to estimate recall on the basis of the
results from two independent searches. The other solution involves
constructing general models of the retrieval system in which the size
of the data base is one of the variables. A model of this type has
several other advantages as well. It can be used to gain insight into
how the
[Page 205 ]
factors which determine performance interact, and the relative
importance of each factor. If the model is flexible in the sense that
the parameters of the model can be set to reflect varying retrieval
situations, it will also be possible to predict performance in extreme
situations - situations which may appear in the future, but which is
out of the question to test empirically for the time being, Harvold
1976.
We now turn our attention to a survey of selected experimental
projects in which various aspects of retrieval performance and design
of text retrieval systems have been tested. We certainly do not
pretend that our survey is complete. Our selection reflects both our
familiarity with these projects as well as our special interest in
legal retrieval systems.
It will be noted that a number of relevant studies and
research will not be discussed in this book. The major studies of user
interaction with information systems ("user research") will not be
included in the survey. We shall not comment on tools or methods for
designing information retrieval system, for instance for system
analysis, user participation, etc.
[Page 206 ]
5.2 General research
5.2.1 The Aslib-Cranfield Projects: 1960-1966
The aim of the first Aslib-Cranfield project was to investigate the
operational performance of four different indexing systems. The
project is described by Cleverdon (1960, 1962) and by
Aitchison/Cleverdon 1963.
The most important result of the project was probably not its
findings - the general conclusion being that the four systems, the
Universal Decimal Classification, a facet classification, an
alphabetical subject catalogue, and the Uniterm system of coordinate
indexing all operated at about the same level of effectiveness. In the
course of the project, however, valuable experience was gained in the
testing of retrieval systems, and a methodology was developed upon
which later investigations have been extensively built. For the first
time performance was measured by the five criteria coverage, recall,
precision, response time, presentation and user effort.
These criteria have later become classical and widely used in a
variety of retrieval experiments. Coverage, response time,
presentation and user effort measure the quality of the operational
features of the system, while recall and precision measure the quality
of the individual retrieval results.
On the basis of his tests, Cleverdon suggested that a retrieval
system is made up of a basic vocabulary and a number of retrieval
devices. The retrieval devices are made up of recall and precision
devices. Examples of recall devices are:
- grouping of synonyms
- confounding of word forms
- formation of classes of related terms
Examples of precision devices are:
- coordination
- links, eg wind tunnel, welding of aluminium
- roles, eg wood (fuel), wood (forestry)
This conceptual apparatus was used in the second
[Page 207 ]
Cranfield project to investigate retrieval devices in isolation and
in all practical combinations in order to measure the effect of each
device on performance. The project is documented by
Cleverdon/Mills/Keen 1966 and Cleverdon 1967.
During the tests, factors affecting both indexing and search
strategy were varied. The test conditions were rigidly controlled, and
only one factor was varied at a time. Three main types of index
languages were investigated. The first type was based on single terms
selected from the authentic text of the source (eg axial, flow,
compressor). The second type was based on concepts selected from the
authentic text of the source (eg axial flow compressor). The third
type was based on various groupings of a set of controlled terms.
Search strategies used included:
- grouping of synonyms
- confounding of word forms
- use of hierarchical classes
The test data base consisted of 1.400 research papers mainly in the
field of aerodynamics. The number of questions used was 221. The
retrieved documents were ranked according to the term co-occurrence
level, and a normalized recall figure was calculated for each test.
The results, as it turned out, were unexpected at the time,
although they seem more reasonable today. The best indexing language
turned out to be the single term authentic language, which consisted
of words selected from the text of the papers. Furthermore, the best
results were obtained when the endings of these terms were confounded
(stemming). Slightly inferior results were obtained when synonyms were
grouped or the terms used in their natural language form. Any further
reduction in specificity, for example by grouping quasi synonyms or
using hierarchies, resulted in reduced performance. The use of all
other index languages gave inferior results. Languages based on
controlled terms performed, on the average, better than concept
languages, but were still inferior to the single term language.
The results seem to indicate that the best results are obtained
when preprocessing, in the way of standardizing the documents, is kept
to a minimum.
[Page 208 ]
The only preprocessing of the documents which will improve
performance is the grouping of context-independent synonyms - the so-
called true synonyms. Grouping of context-dependent synonyms (quasi
synonyms) or any other form of standardization, like the use of
controlled vocabularies, hierarachies, or concepts, will lead to
inferior results, presumably because standardization at this stage
implies that information is deleted, not added, to documents. It is
the context-dependent synonyms that are the main cause of trouble. By
definition, however, these cannot be grouped once and for all, but
can, if information is not to be lost, be grouped only by the search
request itself.
5.2.2 The SMART project: 1964-1983
The results of the Cranfield tests increased the belief in
automatic indexing as an alternative to intellectual indexing. Though
the tests had been based on intellectual indexing, the results had
proved that the best indexing language was also the most simple one -
selecting the indexing terms from the authentic language of the
sources. Application of advanced linguistic tools did not yield the
expected results - they actually proved quite disappointing, with the
exception of thesauri for context-independent synonyms. The results
were somewhat unexpected and very challenging, strengthening the
alternative of automatic indexing.
Professor Gerard Salton launched a major research program in
automatic indexing in the early 1960ies. At that time professor Salton
was at Harvard University, but in 1965 both he and the research
program moved to Cornell University. The experiments were carried out
using the experimental retrieval system SMART under the direction of
Salton. It is perhaps the best known and most comprehensive research
program in text retrieval ever carried out. Its rationale is the
shortcomings of the conventional retrieval methodology based on
Boolean search requests and inverted file methods.
SMART is known for, among other things, its unique
[Page 209 ]
file structure where the documents are represented by vectors
denoting which terms the documents contain and the weight of the term
(its "importance") within the document. Similarity between documents
and the search request is calculated by a similarity quotient, and the
calculated values are used to rank the retrieved documents (cfr above
on vector based retrieval at sect 4.5.2 (3)).
The experiments with SMART are documented in a number of
publications. Salton/McGill 1983 contains a recent survey of these.
In the first version of SMART, different methods for automatic
indexing were tried out. The system was aided by sophisticated
linguistic tools like synonym thesauri, hierarchical term structures,
statistical and syntactic techniques for construction of phrases and a
semantic language analysis. Different algorithms for weighing the
terms were also included.
The methods were tested in controlled experiments in text
retrieval, and the results presented as recall-precision graphs.
Altogether four different data bases on three different subjects were
used: medicine, computer science and documentation.
The results verified the conclusions of the Cranfield tests. The
use of sophisticated tools did not contribute to increased retrieval
performance with the exception of the synonym thesaurus. The results
may be briefly summarized as follows:
- Use of word stems gave better results than the use of the
individual words. A term was therefore equalled to the group of words
with common word stem.
- Use of synonym thesaurus gave better results than the use of word
stems.
- Terms associated with weights gave better results than those not
associated with weights. For instance the weight was equalled to the
relation between the occurrence of terms in the document and the
number of documents in which the term occurred (inverse document
frequency).
- Use of hierarchical term structures did not improve results. The
idea behind such structures is to expand the search request by more
general or more specific terms, working up or down the
[Page 210 ]
hierarchy.
- Use of syntactic or semantic analysis did not improve the result.
These analyses were an attempt to improve more efficient
identification and weighting of phrases.
The experiments also proved that abstracts as a document form gave
better performance than just the titles. Full summaries (2.000 words)
performed better than abstracts (150 words), and abstracts performed
considerably better than titles only.
The results were very interesting, and the experiments using SMART
continued. The linguistic tools were now discarded, and attention paid
to simple and automatic means which might improve retrieval
performance.
This does not imply that the idea of the more
sophisticated tools were discarded as uninteresting for text
retrieval, merely that it was difficult to find adequate methods for
any kind of search request without demanding too large resources. Even
with the progress made since then, it is difficult to find
satisfactory methods with respect to the use of resources.
In further experiments one decided to discard intellectually
developed synonym thesauri as well. It may actually be questioned
whether the improved performance using such thesauri found in the
early experiments opened up a promising avenue for further
development. With respect to large, heterogeneous data bases with a
high growth rate, the development and maintenance of such thesauri
would prove very expensive.
The linguistic tools were replaced by procedures for automatic
generation of thesaurus classes and the construction of phrases. The
phrases were constructed around terms with a high frequency (at least
one of the words of the phrase had to have a high frequency) within
the same sentence. The quality of the phrase was determined by the
distance between the words and the frequency of the phrase itself. The
thesaurus classes were composed of words with a rather low frequency
occurring in the same document. In both cases the objective was to
compose "search units" with an average frequency. Experiments with
[Page 211 ]
weighting of terms had demonstrated that the best search terms
were those of average document frequency (number of documents in which
the term occur) or positive discrimination value. Terms with a high
document frequency were often too general as search terms, and would
consequently reduce precision. On the other hand, terms occurring with
low document frequency were often too specific as search terms, and
would have little effect on retrieval performance.
Experiments proved that both the automatic generation of phrases
and the thesaurus classes had a positive effect on retrieval
performance. As a conclusion of the work on automatic indexing, the
following procedure was indicated:
- (1) Use abstracts of documents when possible
- (2) Eliminate stop words
- (3) Generate word stems for the remaining words and consider words
with a common stem as one term.
- (4) Calculate the inverse document frequency of each term
(discrimination value)
- (5) Construct phrases of terms of high frequency (or negative
discrimination value)
- (6) Collect groups of terms of low frequency (discrimination value
approaching zero) in thesaurus classes.
- (7) Calculate weights for the remaining terms, phrases and
thesaurus classes for use in document vectors. A suggested weight is
the product of the inverse document frequency and the frequency of the
term in the document.
- (8) Generate for each document vectors denoting which terms,
phrases and theasurus classes are contained in that document.
Further improvement is possible utilizing the dialog between user
and system. SMART has been employed to reformulate automatically the
search request on the basis of knowledge of the relevance assessment
of the user. This "relevance feedback technique" is based on the user
making a relevance assessment of the first documents - for instance
the first 10 documents - on the ranked result list. The result is then
analysed by the system, which attempts to reformulate the search
request in order to make it more "similar" to the relevant documents.
For instance the request is
[Page 212 ]
supplemented by terms occurring only in the relevant documents,
or the request is modified by deletion of terms occurring only in the
irrelevant documents. Also the weight of the search terms may be
amended, as well as the weights of the terms in the document vectors.
Experiments with relevance feedback have shown positive results. On
an average precision was improved by 50 per cent for high recall
retrieval, and 20 per cent for low recall retrieval.
To assess the methods for automatic indexing with respect to the
intellectual indexing, it was compared to the intellectual methods
employed by MEDLARS at the National Library of Medicine. In this
information system both documents and search requests are manually
prepared for retrieval.
The experiment compared retrieval performance in MED-LARS and SMART
with the same data as its basis. In SMART the documents were
automatically indexed and the original search requests were used
(Salton/McGill 1983:103-105). Experiments proved that a SMART indexing
using an intellectually developed thesaurus gave retrieval performance
on the same level as intellectual indexing. Supplementing the SMART
indexing with methods like relevance feedback, retrieval performance
increased by some 30 per cent (Salton 1981:322).
SMART has also been used in several other experiments, for instance
in automatic document classification (clustering). The objective of
this technique is to have grouped documents treating the same topic.
In this way, retrieval may be limited to one or several of these
topics.
Several methods for clustering have been tested. The procedure is
similar to that of vector retrieval, but in this case every document
is compared to all the other documents. Documents with a sufficient
similarity are grouped in the same cluster. A document may be a member
of several clusters, and it is possible to establish a hierarchy of
clusters by splitting one cluster into smaller clusters, etc. For each
cluster a representative is constructed, called the "centroid
document". In retrieval, the search request is first compared to these
representatives, and afterwards to the documents within the clusters
selected on the
[Page 213 ]
basis of the similarity between the request and the centroid
document.
Clustering has yet to be tested on really large data bases, as the
construction of clusters demands quite large resources. However,
experiments have been conducted on smaller data bases, but the
indications do not promise improved retrieval performance. Even with
respect to extremely small data bases it has proved difficult to
achieve high recall values. But the method may lead to shorter
response time and lower costs. Clusters are also easily updated, as
new documents are treated as search requests.
5.2.3 The MEDLARS evaluation: 1966-1967
The MEDLARS evaluation represented something new in retrieval
experimentation. For the first time the analysis of retrieval
performance was broadened to include also an analysis of the causes of
retrieval failure. This is a kind of hindsight analysis involving
manual examination of
- the sources in authentic text
- the document representing the source by a set of indexing terms
etc
- the question
- the search request
- the relevance assessment
The analysis is time-consuming, but valuable, as it tells us not
only to what extent a search failed, but also why it failed. The
MEDLARS evaluation is described by Lancaster 1968b and 1969.
The MEDLARS data base consisted of more than 800.000 citations from
biomedical journal articles prior to January 1964 and all subsequent
issues of the monthly Index Medicus. The data base was too big
for the answer set to be found by manual means. Recall was estimated
by obtaining two independent samples of relevant documents. One sample
was obtained by MED-LARS, another sample was obtained by other means,
for example through a local librarian, through the professional
knowledge of the searcher or his colleagues, or in some other way
independent of MEDLARS.
[Page 214 ]
A total of 302 searches were completely analzyed, and recall and
precision ratios were obtained for 299 (3 questions had no answer
set). Overall recall and precision ratios were 50.4 and 57.7 per cent,
and there were 797 cases of recall failure (irrelevant documents that
were retrieved).
The failures were analysed in terms of
- exhaustivity
- specificity
- entry vocabulary
The terms apply both to the indexing of a source and the
formulation of a search request. Exhaustivity means the extent to
which all the concepts in the source (question) is represented.
Specificity means the generic level at which a concept is represented.
The entry vocabulary refers to the text form in which the document is
formulated. The entry vocabulary may be the title, an abstract, or the
authentic text of the source. In the case of a text retrieval system,
the entry vocabulary will be identical to the indexing terms.
A detailed account of all the results will not be given here. It is
interesting to note, however, that system dependent factors of the
index, including the inadequacy of the indexing language, and of the
user-system interaction, accounted for 74 per cent of total recall
failures and 65.6 per cent of total precision failures. Searching
factors accounted for 35 per cent of total recall failures and 32 per
cent of total precision failures. The totals add up to more than 100
per cent because the same failure may be caused by more than one
factor. For a more detailed account of the failures, see tables in
figure 5/1 and 5/2.
The evaluation results are especially interesting in the light of
the retrieval systems that are available today. Had MEDLARS been an
on-line text retrieval system, most of the failures caused by the
system dependent factors would have been avoided. Performance would
have been significantly improved. In fact, if we use the average
performance figures given for the evaluation, recall would increase
from about 60 to about 85 per cent and precision would increase
[Page 215 ]
from about 50 to about 75 per cent - a result which would have
been surprisingly good.
Fig 5/1 - Reasons for recall failures in the MED-LARS evaluation
(Lancaster 1969)
| Indexing language |
| - lack of appropriate specific
terms | 10.2 % |
| Searching |
| - all reasonable approaches not
covered | 21.5 % |
| - request too exhaustive
| 8.4 % |
| - request too specific
| 2.5 % |
| - other |
2.6 % |
| Indexing |
| - insufficiently specific
| 5.8 % |
| - insufficiently exhaustive
(topics) | 20.3 % |
| - important concept omitted
| 9.8 % |
| - other |
1.5% |
| Computer processing |
1.4 % |
| Inadequate user-system interaction
| 25.0 %
|
Fig 5/2 - Reasons for precision failures in the MEDLARS
evaluation (Lancaster 1969)
| Indexing language |
| - lack of appropriate specific
term | 17.6 % |
| - false coordinations
| 11.3 % |
| - incorrect term relationship
| 6.8 % |
| - defective hierarchical
structure | 0.3 % |
| Searching |
| - not specific |
15.2 % |
| - not exhaustive
| 11.7 % |
| - inappropriate terms
| 4.3 % |
| - inappropriate logic
| 1.1 % |
| Indexing |
| - exhaustive |
11.5 % |
| - other |
1.4% |
| Inadequate user-system interaction
| 16.6 % |
| Computer processing |
0.1 % |
| Value judgements |
2.3 % |
| "Inevitable" retrieval |
0.1 %
|
[Page 216 ]
5.2.4 The "Comparative Systems Laboratory Experiments" Project:
1963- 1968
At Case Western Reserve University a rather large project was
undertaken in the mid 1960ies to investigate the relationship between
the variable components of retrieval systems and performance. The
project is documented by Saracevic et al (1968) and by Saracevic
(1970).
The components of a retrieval system were described in terms of the
purpose and function of the system. The purpose of a retrieval system
was subdivided into
- discipline
- class of users
- size of file
while the function of the system was subdivided into
- acquisition (content of file)
- source of input (input vocabulary)
- indexing language
- coding
- organization of file
- question analysis
- searching strategy
- format of output
The data base used in the experiment consisted of 600 documents
selected from the 1960 volume of Tropical Diseases Bulletin
(indexed in five languages). On the basis of 124 questions, 4.448
search requests were submitted for searching. Answer sets to the
questions were established by asking the users to evaluate the
retrieved documents. The non-retrieved documents were evaluated by a
separate expert, who tried to interpolate the relevance judgments of
the users. It turned out that of the 124 questions, only 63 had
relevant answers.
"Sensitivity" and "specificity" were used to measure performance.
Sensitivity (Se) was defined identical to recall, while specificity
(Sp) was defined as the ratio of the number of non-relevant
unretrieved documents to the total number of non-relevant documents
in the data base. Effectiveness (E) was defined as: E = Se + Sp - 1.
[Page 217 ]
A main purpose of the experiment was to investigate the relative
effectiveness of various indexing languages. However, the
effectiveness of the indexing languages compared to authentic text was
not investigated. Of greater interest to us is therefore the analysis
of different search strategies. The tests included the use of two type
of search requests:
- a narrow request that represented the stated question as
completely and accurately as possible
- a broad request that represented the question by only one subject
aspect.
The search request could be expanded by use of a thesaurus or by
use of any other available source. Use of the thesaurus did not prove
as effective as manual elaboration of the request. One of the most
important findings, however, showed that it was nearly impossible by
any means to expand the narrow requests to the extent where all
relevant documents could be found. It was not until all but one
category were dropped (broad search) that most relevant documents were
found, but then at the expense of a considerable drop in precision.
These results correspond well to the expected behaviour of text
retrieval systems (Harvold 1976).
Rather more unexpected was the observation that, when a narrow
request was expanded, an almost linear relationship was found to exist
between total output and the number of relevant and non relevant
answers.
The experiment included a test on relevance judgment based on
different formats of output. The formats used were title, abstracts
and the authentic text. The results are cited in fig 5/3.
|