PART II: TEXT RETRIEVAL
[Page 141 ]
4 ANATOMY OF TEXT RETRIEVAL
4.1.1 Defining an information retrieval system
The concept of "information retrieval" is broad and often employed in an imprecise meaning. It is used to denote systems designed to provide users or a group of users with information. Salton/McGill (1983:x1) give this definition:
"An information retrieval system is an information system, that is, a system used to store items of information that need to be processed, searched, retrieved, and disseminated to various user populations."
In sect 3.4 above we have given a definition of an information system as part of the legal communication process - a definition which corresponds to the more general definition cited above. In our context also the concept of "information" has been defined, and we shall remember from our earlier discussion that it was defined roughly as the meaning or content of data.
In the literature, there are a number of different definitions of "information retrieval". The concept is generally used in a more precise meaning than the one above, making retrieval systems one type of information systems. But a further limitation is generally made, often implicit - to computerized retrieval systems or bibliographical retrieval systems. Perhaps one might say that "information retrieval system" has become accepted as denoting the type of systems discussed in the works of Salton, Cleverdon, Lancaster, Sparck Jones, and others (cfr Rijsbergen 1979:1). Lancaster gives the following definition of an "information retrieval system" (Lancaster 1978:11-12):
[Page 143 ]
"As it is most commonly used, the term information retrieval is really synonymous with literature searching. Information retrieval is the process of searching some collection of documents, using the term document in its widest sense, in order to identify those documents which deal with a particular subject. Any system that is designed to facilitate this literature searching may legitimately be called an information retrieval system. "
Common to the authors mentioned is that they work within the field of literature retrieval, most of them having their background in library environments. The phrase "information retrieval" then becomes synonymous to "document retrieval", "literature searching", "reference retrieval", etc.
The term "reference retrieval" is reserved by some authors for a special user situation, but does really originate within systems for literature searching where it is possible to retrieve references only to books.
In limiting the concept of information retrieval to document retrieval, one excludes systems which are not oriented towards the retrieval of texts, like "data base management systems" or "question answering systems". Such systems are often called "data retrieval systems" (Rijsbergen 1979:1-3) or "fact retrieval systems" (Salton), where "data" and "fact" both denote an actual data element referring to some fact in reality. Some authors distinguish between "fact retrieval" and "question answering systems". Kochen uses the phrase "fact retrieval systems" of systems where the answer (result of the retrieval) is copied directly from the "data base", and "question answering systems" of more intelligent systems which themselves make conclusions on the basis of stored data. The distinction is illustrated by the following example (Kochen 1974:66):
"A map which gives the average driving time between major U.S. cities can be used for fact-retrieval if it is asked for the time from Los Angeles to San Fransisco; it is used as a question-answering system if it is asked for the time from Los Angeles to Portland, Oregon, even if the map shows only the time from Los Angeles
[Page 144 ]
to San Fransisco, and from San Fransisco to Portland, because the time from Los Angeles to Portland has to be deduced."
Many of the definitions are diverging, and may at first seem rather confusing. Lancaster (1968:1) defines an information system as something not intended to increase the level of knowledge, as this increase in information is achieved only after the user has read and understood the documents to which he has been referred. As an information retrieval system gives references only to literature, one will not - according to Lancaster - retrieve information. Lancaster is himself critical to the concept of information retrieval:
"It should be clear from the discussion above that information retrieval is not a particularly satisfactory term to describe the type of activity to which it is usually applied. An information retrieval system does not retrieve information. Indeed, information is something quite intangible; it is not possible to see, hear, or feel it. We are 'informed" on a subject if our state of knowledge on it is somewhat changed ... Information transfer can only take place if the user reads the document and understands it. Information, then, is something that changes a person's state of knowledge on a subject."
It should be noted that the cited definitions of information retrieval systems deviates from those developed in this book, and that this may be due to the different user situation and the typical different design of a legal information retrieval systems compared to those literature systems implied in the general discussion.
4.1.2 Sketch of an information retrieval system
In fig 4/1 are sketched the elements of an information retrieval system. The elements are grouped in two parts: the one on the top describes the updating, the one below the retrieval.
[Page 145 ]
Fig 4/1 - Sketch of an information retrieval system
The updating consists of the subsystem preparing data for retrieval. The objective is to have the data represented in the system in such a way that it may efficiently be exploited in the retrieval process. The updating consists of two subprocesses:
The first process will be to transform information on the object to be retrieved (and the object may be a book or a text) to data. This will generally imply an analysis of the information and representation of this as physical symbols with a structure corresponding to the one given for the retrieval system. In a document retrieval system one will frequently, from practical or economical reasons, select a representation of the source different from the authentic text - for instance an abstract or a set of indexing terms. The document design, for instance the
[Page 146 ]
intellectual indexing, is such a transformation of source to document.
Preparing data for retrieval often takes place manually, but in some systems there are automatic routines for some of this activity. Such a computerizations implies that it is possible to specify exactly how the data is to be processed, and this may be quite difficult. Much research has been aimed at developing efficient and advanced methods for automatic indexing, but few of these methods have been implemented (cfr the work of Salton). During the last years, the work within the area of artificial intelligence has opened new possibilities for such work.
The storage includes a systematization of data in such a way that it may be retrieved fast and efficiently. In a manual information retrieval system one would, for instance, sort indexing terms alphabetically, while a computerized system has a number of more sophisticated possibilities. In a document retrieval system, one will generally store each word in an inverted file, and associate each word with references to the documents in which the word occurs. The retrieval process can be made more efficient by creating an index to the inverted file.
The retrieval process is the part of the system most characteristic of an information retrieval system. Its objective is to retrieve the documents required by the user - and these only. This subsystem consists of three major processes:
The first step in a retrieval process is the construction of search requests. The user transforms his problem into a form (the request) which is compatible to the data base design and the requirements of the retrieval system. It is important for the retrieval performance that the request corresponds to the way in which the information is represented in the retrieval system. If there is a lack of correspondence, it will be difficult - or impossible - to achieve a satisfactory result.
The retrieval itself consists of matching the search
[Page 147 ]
request against the data base, ie retrieving a certain telephone number from the directory, or all documents containing a certain phrase.
The way in which the result is represented to the user will primarily depend on the type of information retrieval system employed. In a manual system, the result is in a fixed format, while a computerized system may give a choice of different formats and responses. Several systems also have the possibility to rank documents according to criteria for possible relevance.
4.1.3 Different user situations
In a discussion of information retrieval, it is desirable to differentiate between various user or retrieval situations. Some users desire a precise answer to a request, for instance, "Which sum is allocated to social securities in the budget for 1984?", while others desire everything written on a subject, for instance, "Find all documents discussing the social security budget for 1984". It is important to be specific about which type of requests the system is intended to handle before designing the retrieval system.
In this book, we shall distinguish between the different user situations by looking at the information needs. We shall use the terms "fact retrieval" and "interest retrieval", and let the meaning of these correspond roughly to what is generally meant by "data retrieval" and "information retrieval". It is, however, important to emphasize that the objectives and background of the definitions vary.
Some authors use the term "reference retrieval" in stead of "interest retrieval" - for instance Bing/Harvold 1977. We have discarded this term because many document retrieval systems allow different and more adequate representation of the source than a reference, for instance the authentic text or an abstract. Also, the term - like "document retrieval" - does not characterize the essential elements of interest retrieval, as fact retrieval to a certain extent may also be carried
[Page 148 ]
out by a document retrieval system. It should be noted that the term "interest retrieval" is associated with the theory of communication processes presented in chapter 2, where the situation of the user is characterized by his "area of interest", cfr sect 3.5.2.
Typical for interest retrieval is a user with a problem within his area of interest, and who wants to find documents illuminating this problem - for instance "Retrieve all documents discussing the importance of the morale of the mother in cases on child custody". In interest retrieval, one does not look for an exact answer to the problem, but documents discussing the problem. In such a situation the result will depend primarily on the basis of retrieval and only secondarily upon the retrieval system. The basis for retrieval is the documents contained in the system and the search request. The better the sources are represented in the system, the better the chances for a satisfactory result. But it is impossible to design a system which will guarantee that all and only relevant documents will be retrieved, since the assessment of relevance is partly subjective to each user. The retrieval result therefore may be considered only as a proposal of what documents should be considered relevant.
Fact retrieval is characterized by a user looking for a specific response (facts) to his problem, for instance "Who has written the book Lord Jim?" The illustration of Salton of the difference between a "reference retrieval system" and a "question answering system" also gives an adequate illustration of the difference between interest and fact retrieval (Salton 1968:392):
"... if a potential customer were to ask, 'What is the boiling point of water?', an effective question answering system would reply 'one hundred degrees centigrade', whereas a reference retrieval system might provide a citation to a textbook in elementary physics which in turn would contain information about the physical properties of water."
To satisfy fact retrieval, information must be represented and stored in the system in such a way that the system may make the necessary identification.
[Page 149 ]
This implies for instance that the request to be satisfied must be foreseen. Correspondingly, the search request must contain a detailed and unambigious description of the problem, giving the necessary instructions to the retrieval system. The retrieval system will either respond to the request - which presumes a one hundred per cent match with the request, or it will not respond at all.
[Page 150 ]
4.2 Characteristics of text retrieval
4.2.1 Document retrieval
The concept "text retrieval" is used today to describe a certain computerized method for retrieving documents. The process itself is quite general and often used in everyday situations. One does, for instance, suspect that one has caught a certain germ which has been reported in the news lately. The most usual way to find the news items would probably be to browse through newspapers until hitting the relevant articles. Reading with sufficient care, one would obtain a perfect result in the sense that all relevant articles are being retrieved.
This manual retrieval does not offer problems as long as the number of documents (in our example, the stack of newspapers) is small. Confronted with all the newspapers of a country, however, the situation would be different. It would indeed be doubtful whether one would consider the task at all. In such cases computerized retrieval would be advantageous.
The computer has a capacity which would make it possible to search several thousand pages of text in a few seconds. It would not understand what it reads, and would have to be instructed in great detail for what to look. Requested to retrieve all items containing the word "germ", it would not find items in which the word "bacteria" was used. It would not consider the two synonymous words unless this is specified.
4.2.2 Full text retrieval
The term "full text retrieval" is not used in this book as it is ambiguous. It may either denote that the source is represented in authentic form, or that the principle of text retrieval is employed, regardless of the document design. In this book, text retrieval is the name only of a certain retrieval method and implies nothing of the document design.
Text retrieval does not presume that the source is edited before being converted to machine readable
[Page 151 ]
form. It is usual to represent the sources in a general text format. In such cases, one often characterizes the method as full text retrieval. In the input phase, it is in principle sufficient that "end of document" and "end of word" is defined. In some systems, also "end of sentence" and "end of paragraph" may be defined, allowing retrieval also on the distance between words in terms of these text segments.
Retrieval is based on word occurences and combinations, and in a text retrieval system any word in the material may be retrieved. In order to obtain a high degree of efficiency, one does however exclude some common words or stop words like prepositions, conjunctions, articles, pronouns, certain verbs, etc occurring frequently in the text and which are rarely helpful in describing the content of a document with respect to the retrieval system. Such stop words generally make up some 35-45 per cent of the text volume.
4.2.3 Interest retrieval
Text retrieval systems are well suited for interest retrieval. The documents are stored in natural language, making it possible for the user to employ any word of the text in the search request. For specific problems, specific requests may be constructed; and properties of natural language documents make it possible to rank retrieved documents, for instance using word statistics.
They are, however, less suited for fact retrieval. The system has no exact data on the content of the documents (only on which words they contain), and cannot respond to requests presuming such data. The system may furnish information on which words the documents contain and in what sequence, and if this is sufficient for the fact retrieval request, it may serve such a purpose. Retrieving on a compilation of statutes, it may for instance specify how many documents are citing a certain statute by using the identification of this statute as a search request.
Though free text formats do not presume editing, inclusion of fixed fields containing predefined data (name of author, title, date, citations, etc), editing is in practice often implied. The actual
[Page 152 ]
computer system has facilities to exploit such fields as well. In this chapter we shall, however, concentrate on the simple, unedited form in which only a raw text is inputted. But systems having a superstructure of fixed fields have the possibility of being successfully used both for fact and interest retrieval.
[Page 153 ]
4.3 The retrieval process
The retrieval process is defined as the total process from the emerging problem (information need) to the final answer from the system. Fig 4/2 is a sketch of the retrieval process consisting of two subprocesses - one manual and one computerized (Harvold 1980a:133).
The first process (the transformation of the problem to a request) is the responsibility of the user. In a certain search language, the user attempts to express his information need according to the requirements of the system. This is generally the only possible way in which the user is permitted to specify his problem. It is therefore important that the search language is flexible, and is exploited in such a way that the request reflects the problem.
Some text retrieval systems are based on mid-users. The end-user is not allowed to use the terminal, but has to rely on certain users trained in the use of the system. In such cases, the transformation process itself will have two steps - initially the formulation of the information need in natural language, and then the translation of this into the retrieval language of the system. In cases where the system permits natural language request, such a transformation does not have to take place even if mid-users are introduced.
The second subprocess is carried out by the retrieval system. Based on the request, the system attempts to find all and only the relevant documents.
Generally the formulation of a search request and the search process has to be repeated several times before the final result is achieved, reflecting the iterative process described above under sect 2.5. The user may, for instance, in browsing through the retrieved documents, get ideas for improving the search requests. There have been quite successful experiments in computerization of this feedback process, cfr for instance Salton's experiment on relevance feedback and the experiments within the Responsa project on local metrical feedback.
[Page 154 ]
Fig 4/2 - The retrieval process
[Page 155 ]
Assessing the result generally implies a comparison of the result to the set of relevant documents the user would have identified in reading the total number of documents in the data base. Naming this set S1, and the retrieved documents S2, the intersection between the two sets will be a measure of the quality of the computerized search.
Fig 4/3 - Definition of recall and precision
The result is usually measured in recall and precision - recall being a measure for the number of relevant documents being retrieved, and precision a measure for how many of the retrieved documents are relevant.
The concept of relevance is referred to several times in this chapter without being specified or defined. In sect 2.6.1 above we have discussed the relevance of legal sources and suggested a definition for legal information retrieval systems. A more general discussion of the concept will follow below as an appendix to this chapter
[Page 156 ]
4.4 Choosing the data base
In this section, we shall discuss some general aspects related to document design and choice of data base. To some extent we shall repeat some of the arguments offered above under sect 3.2 and 3.3, but this time related specifically to text retrieval systems.
When establishing a computerized system, one will start by defining a documentation area, and within this area choose the sources to be represented in the system as documents. In this book, the term "document" is reserved for the representation in the system (cfr above under sect 3.3.1), though frequently it is used both for the original source and its representation in the system.
The choice of documentation area and the document selection and design are central elements in establishing an information system. An information retrieval system will never be yield more information than has been put into it. The data base limits the possible performance of the system.
When discussing the retrieval process, one will perceive both the problem of the user and the data base of the system as determined outside the process. It should, however, be emphasized that the data base and the document design are essential for system performance, and should not be overlooked in the search for factors explaining this performance. Especially document design escapes notice surprisingly often as a critical element in the system.
There are a number of considerations in selecting sources from the documentation area. The sources should be up to date, the data base to be constructed should be representative for the area selected and - not least important - the system should operate within acceptable economic limits.
From the user's point of view, it is essential that the information need is covered as good as possible by one system only, thus eliminating the necessity of employing frequently more than one system to solve the problem. This is usually measured by coverage, which is discussed above at sect 3.5.3 (2) in terms
[Page 157 ]
>of user-constructed information systems, and including some critical comments to the general use of the concept.
In sect 3.3 above is also discussed the problem of document design, a discussion concentrating on three typical designs - indexing terms, abstracts and authentic texts. As stated several times, many properties of the retrieval systems are determined by this design rather than by the retrieval software. Also, this design will largely determine the costs of establishing the data base - cost factors like
In principle the authentic text requires no editing, and is consequently the less costly solution according to factor (1). Often even abstracts are established outside the system (for instance as an introduction or headnote to the authentic text), but indexing terms are more seldom easily available in this way.
The editing of an abstract or assignment of indexing terms are processes which are rather expensive. Each source must be assessed by at least one qualified person and the document must be designed according to the prevailing norms. In intellectual indexing, complex indexing schemes are frequently employed - the development and maintenance of these (like thesauri of controlled indexing vocabularies) may by themselves be quite expensive.
The costs of factors (3)-(5) are relative to the volume of the material. Because indexing characteristics will generally be briefer documents than the other possibilities, they will also be less expensive.
The technological development produces cheaper hardware with increased capacity. The differences in the cost of converting etc for the three types of documents have been reduced over the last decade, and this will continue in the future. If sources are
[Page 158 ]
available in machine readable form from word processing equipment, the authentic form may become the cheapest one to be put into the system.
But even if the text is available in machine readable form, considerable effort may be needed to bring them into the form required by the information system. If, for instance, the text is riddled with printer's codes, it will have to be cleaned and perhaps new codes for the information system will have to be introduced. The fact that the text is machine readable is a minimum condition for reuse in an information system, but it is by no means a sufficient condition.
The conclusion therefore will be that one should exploit existing material from outside the system as widely as possible. Costs of manual resources should be assessed in respect to the cost for processing resources, and in this assessment, the probable relative cost development should be taken into account. One should consider both to possible technological development and the possible availability of qualified persons.
[Page 159 ]
4.5 Retrieval strategies
A retrieval strategy is a method on which the retrieval process (cfr fig 4/2) is based. It will include
The retrieval strategies are those aspects of text retrieval system attracting most attention. The choice of retrieval strategy will be decisive for the way in which retrieval system will function in practice. It will influence factors like
A retrieval strategy must be developed to suit the actual user environment. Requirements must often be given priorities, as it is difficult to develop a strategy which satisfy simultaneously all requirements equally well. The requirement for high user friendliness will, for instance, easily come into conflict with high retrieval performance and short response time. When basing retrieval strategies on requests in natural language, it will be difficult to achieve the same level of retrieval performance as when done with "artificial" search languages.
4.5.2 The retrieval function
(1) Simulating relevance assessment
Comparing the retrieval process with a manual reading
[Page 160 ]
of the data base, the retrieval function may be perceived as a simulation of the relevance assessment. Its objective is to identify all and only relevant documents - ie the same documents as would be identified by reading the data base. A retrieval function therefore must be based on the same criteria for relevance, and should be constructed on the basis of an analysis of relevance assessment.
It is obviously no easy task to design a perfect retrieval function - ie a function retrieving all and only relevant documents. Firstly, one will not always agree on which documents are actually relevant in respect to a certain problem (cfr the appendix to this chapter on the concept of relevance). Secondly, it may pose problems to represent the criteria of relevance in a computerized system. Also, the retrieval criteria must be selected from the characteristics of natural language texts. Traditionally, one have three such characteristics:
The difference in the relevance assessment in fact and interest retrieval must be appreciated also when designing retrieval functions. Generally one makes a distinction between two categories of retrieval functions - identity and nearness functions.
(2) Identity functions
An identity function is characterized by the selection only of documents which satisfy all criteria in the search request. It actually bisects the data base, one segment being the retrieved documents, the other the unretrieved documents. Retrieved documents typically are not ranked before presented to the user. This strategy is well suited for fact retrieval, but less suitable for interest retrieval.
The simplest form of an identity function is to retrieve all documents containing a certain, specified search term. This is typical of retrieval in defined fields, where one makes requirements according to the
[Page 161 ]
value of the field (for instance "author" equals "SALTON").
Boolean retrieval. Most text retrieval systems offer more advanced identity functions, making it possible both to define several search terms and to formulate requirements concerning the relations between the terms. The best known is based on Boolean algebra, named after the British mathematician George Boole (who actually based some of his classical examples on Jewish dietary law).
Boolean logic gives the user the possibility of qualifying relations between search terms by Boolean operators like AND (conjunction), OR (disjunction) or NOT (negation). For instance the search request
R = T1 AND T2 NOT T3
make the system retrieve all documents containing both the search terms T1 and T2, but excluding those containing these two terms as well as those containing T3. In fig 4/4 this set of documents is indicated:
Fig 4/4 - Illustration of Boolean retrieval
The operator OR may be used to represent synonyms:
(T2 OR T22 OR ... OR T2n)
(T3 OR T32 OR ... OR T3j)
[Page 162 ]
The strength of the retrieval strategy is its flexibility, and the possibility it offers for experienced users to construct complicated and well-performing requests. In principle, high retrieval performance is always possible.
The drawback of the function is the high demands posed to the user, both with respect to search language and to specification of search terms. An inexperienced user will have difficulties in exploiting the possibilities, and a request may therefore easily have a structure which is different from the one intended by the user (for instance the user mixes up the ORs or ANDs, or is unaware of the sequence in which they are executed).
Since Boolean requests presume that all specified requirements must be satisfied by a document, this implies a strict demand on the user to specify terms. If the user excludes a synonym to T1 (cfr the example above), this will result in the non retrieval of a document only using this term to express the relevant idea - even if the other requirements in the request are met.
Or more generally - a request of the type "T1 AND T2 AND ... AND Tn) will result in the non retrieval of a document containing all terms except one. This document will be qualified as just as uninteresting as a document not containing any of the specified terms. As we know that specificity (less than exhaustive specification of search terms) is the most important cause for recall failure (60 - 70 per cent), this is a serious objection to the use of Boolean requests in text retrieval (cfr Fjeldvig 1980:161-163).
Nevertheless, Boolean retrieval is by far the most widely used strategy - also for text retrieval. This does not imply that this retrieval strategy is better than others, but rather that its traditions are long, and that it works with short response-times in an on-line system. The latter property is important, as it gives the user a real possibility to reformulate and improve his request, and in that way introduce a dynamic property to the otherwise quite static identity function.
[Page 163 ]
(3) Nearness functions
Nearness functions are designed for interest retrieval. Characteristic for interest retrieval is the difficulty - if not impossibility - of defining unambigious criteria for relevance. The relevance assessment is often complicated by and based on vague considerations related to the documents. Nearness functions therefore result in a list of documents, ranked according to the probability of relevance. In browsing through the documents ranked highest, one will get the highest probability of identifying relevant documents or get ideas on how to improve the search request.
A nearness function calculates similarity between the search request and the documents from criteria assumed to indicate relevance. Identity between the document and the search request denotes a high probability of relevance.
A nearness function presumes that
Word frequency based ranking algorithms. There are several different nearness functions. Most are based on one of the following ranking criteria.
The simplest and most usual forms of ranking are based on frequency (1) or (2). These criteria are based on a hypothesis that the more search terms a document contain, the higher the probability for an idea to be present in the document. The hypothesis is tested in for instance the controlled experiments in
[Page 164 ]
text retrieval at the Norwegian Research Center for Computers and Law, where both frequencies gave positive results when used as ranking criteria (cfr Bing/Harvold 1974:52-55). Also later experiments have demonstrated that the ranking improved when the frequencies were modified with respect to document length (cfr Fjeldvig 1976:57-66). This would seem to correspond to intuitive ideas, as the higher frequency of search terms in one document compared to another might simply be caused by the first document being relatively longer than the second.
There are ranking algorithms combining frequencies (1) and (2), and also using them combined with other criteria. In the IBM text retrieval system STAIRS/VS the user may choose between 5 different algorithms of this kind, and several of these combine the frequencies with the number of documents containing a certain term (cfr Bing/Harvold/Kj°nstad/Stabell 1976:94-101).
Vector retrieval. Best known of the nearness strategies is perhaps the cosine function in vector retrieval, which for instance Salton has analyzed in the SMART project (cfr for instance Salton 1971).
The cosine function is based on a vector representation of the documents. Each document is described in the form of an n-dimensional vector, where n is the number of different terms in the data base.
d = T1, T2, T3, .., Tn
Each component, Ti, represents a certain term in the data base, and its value equals the weight assigned to this term in the document in question. If the term does not occur, its weight equals zero. In the example in fig 4/5, the data base contains only 11 different terms. Document 1, represented by vector d1, contains 6 of these words, and is assigned the weights 1, 10,0, 3, 0, .., 0. If the weights equal the frequency of the words in the documents, this implies that word 3, 5, 6, 7 and 11 do not occur in the document - and they are consequently assigned the weight 0.
When retrieving, the search request is also represented by a vector constructed in the same way, s (cfr the example). The similarity between the search
[Page 165 ]
request (search vector s) and each of the documents (document vectors, di) is measured by cosine as the angle between the vectors:
The smaller the angle, the greater the similarity between the vectors and the higher the cosine value. At the cosine value equaling 1 the search vector is identical to the document vector - ie the search request and the document contains exactly the same terms.
Fig 4/5 - Example of vector retrieval
[Page 170 ]
[Page 171 ]
Fig 4/6/1 - Average rp-curves based on different ranking algorithms, project NORIS (8) I
Fig 4/6/2 - Average rp-curves based on different ranking algorithms, project NORIS (8) III
[Page 172 ]
(4) Combinations of identity and nearness functions
Most conventional text retrieval systems are based on Boolean strategies, though several of these are used as much for interest retrieval as for fact retrieval. In the last years, several researchers have been developing methods combining Boolean retrieval with ranking. This will increase precision and make the users more satisfied with the results.
In typical Boolean systems the terms are considered binary - a term which either occur in a document or does not cccur. This makes implementing a ranking algorithm difficult, but not impossible. However, it is easier in systems where the terms are weighted, reflecting the importance of that term in the document (cfr the description of nearness functions above). In the experimental system SIRE and the VEXT text retrieval system of the Norwegian Research Center for Computers and Law the results of a Boolean retrieval is ranked by calculating the similarity between the retrieved documents and the search request. Similarity is calculated using only the terms contained in the search request - the operators are not taken into account. In the examples above, the cosine function is used as a measure for similarity.
This strategy may be profitable to the user if the Boolean request has retrieved a great number of documents. When using Boolean requests with a conjunctive structure the requirements for retrieving a document may be too strict for retrieving a list of documents of sufficient length for ranking to have any effect.
Conceptor based retrieval may actually be considered an example of a combination of Boolean retrieval and ranking in a way favouring interest retrieval. If each conceptor is considered a Boolean argument, the ranking will be determined by how many such arguments are satisfied in the documents. In this way one may express the requirements to each rank set as a Boolean argument. If for instance the search request contain three conceptors (classes), the requirement of the first (S1), the second (S2) and the third (S3) rank sets may be expressed as below (cfr Harvold 1980a:142-145).
[Page 173 ]
Another way of using term weights in Boolean retrieval, is permitting using weighted terms directly in the Boolean request. This is a considerably more complex procedure, as a normal Boolean retrieval strategy is based on the occurence or non-occurence of a term. Introducing weights may imply, for instance, that a term occur only partially, or that one term is more important than another. It is, however, necessary to arrive at an interpretation of requests composed of weighted terms combined by Boolean operators.
"Fuzzy set retrieval" is an example of such a procedure. It is based on the "fuzzy set" theory which is again based on the presumption that an element may be a partial member of a set. In the model of retrieval, the weight of the term T is a document set equal to the membership degree of the document in the set of documents containing term T. If the weight equals 1, the document is a full member, and if the weight equals 0 the document is not a member. A weight between 0 and 1 will denote partial membership.
The model also takes into account dependency between terms in such a way that several terms may belong to the same concept class with variable membership.
The theory is especially attractive as it satisfy the general rules for Boolean retrieval when weights equal 1 or 0. It does not, however, contain the possibility of assigning weights to the terms of the request. One may therefore characterize fuzzy set retrieval as an extention of normal Boolean retrieval. It has been experimentally tested in different contexts (cfr Salton 1983:421-422), but is not yet implemented in any operative system.
[Page 174 ]
4.5.3 Iterative techniques
In an interactive retrieval system, the retrieval process will often be a process of learning. One may start with rather vague notions of how the problem at hand may be represented in the documents, and only after several searches formulate the request in such a way that it will correspond to the way in which the problem is described in the data base.
Usually the request is reformulated without the assistance of the system. The user will browse through the retrieved documents and discover how the problem is described in these. The user will have some information on how the request has performed, for instance how many documents have been retrieved and which search terms do not occur at all.
During this iterative process the user may change his understanding of the problem. The retrieved documents may for instance disclose aspects of the problem of which the user had not been aware. This form of learning was clearly demonstrated in one of the controlled experiments at the Norwegian Research Center for Computers and Law, where the user after having read the result of the computerized retrieval, made major adjustments to his predefined list of relevant documents (cfr Bing/Harvold/Kj°nstad/Stabell 1976:127-131 and the detailed discussion in Kj°nstad 1976).
This experiment in reassessment of predefined relevant documents was conducted within the project NORIS (8) I. The predefined documents were determined by manual reading of a collection of 100 decisions by the Social Security Court in respect of 20 problems. The answer set contained a total of 162 documents as relevant to these problems. After the computerized retrieval, the same person qualifying the original answer set made a reassessment on the basis of the results. In this reassessment, 16 of the original target documents were eliminated, and as much as 61 new were added. The person responsible for this selection was professor Asbj°rn Kj°nstad, at this time working on his doctorate degree within the area covered by the 100 documents. In Kj°nstad
[Page 175 ]
1976, a unique, detailed discussion of each reassessment is made. As the most important cause of the major adjustments is stated that originally the documents either were not fully understood or they were even misunderstood. Several of the problems basic to the requests were also amended on the strength of the information in the retrieved documents. Due to the qualification of the experimenter, together with the detailed analysis, we find this experiment of considerable general value.
During the last years a need has been felt of making the information system more active in this feedback process. Several different methods have been proposed, and several of these have also been tested in different experimental systems (for instance SMART, FAKYR, Responsa and MEDUSA - the latter based on MEDLINE). Below some types of feedback mechanisms will be discussed.
(2) Relevance feedback
Relevance feedback is a common term for methods by which the system takes part in governing the feedback process based on an identification of relevant documents. The method was introduced some years ago, and is described for instance by Salton 1968.
In short, the method implies that after a search, the user is asked to make a relevance assessment of the first documents on the list of results - for example the 10 first. The user responds by indicating which documents are relevant (or irrelevant). On this basis, the system identifies which properties characterize the relevant in contrast to the irrelevant documents, and amends or proposes amendments to the search request - for instance by increasing the weights of those terms more typical to relevant than to irrelevant documents. The system may propose new terms, which for instance occur only in the relevant documents, or delete terms which occur only in the irrelevant documents. The actual reformulation of the request is in some system a manual process based on suggestions from the system (like CITE), or in other system an automatic process (like SMART).
[Page 176 ]
Most experiments with relevance feedback have demonstrated improvements in performance (for instance Salton 1983:243). It may, however, be rather difficult to measure this improvement exactly, as it may be caused both by the relevance assessment, and by the inclusion of new, relevant documents. It is the latter improvement which is desireable from the user's point of view.
A drawback with this technique is that it requires information on the terms contained in the retrieved and assessed documents. This presumes either a data structure which conserves the relation between the individual document and the terms contained in that document (like for instance a vector based system), or that these documents are read sequentially.
One of the few examples - perhaps the only - of operative use of relevance feedback is the system CITE, which is an interface to MEDLINE based on requests in natural language (cfr Doszkocs 1982).
The vector retrieval system VEXT of the Norwegian Research Center for Computers and Law lends itself to a simple example of relevance feedback. The system gives the user the possibility to define a whole document as a search request by a simple command. If a relevant document is retrieved by the usual strategy, one may as a second step use this document as a request. The method is heavy on processing resources, but has proved its usefulness in retrieving bibliographical documents, which are generally rather brief.
(3) Local metrical feedback
Local metrical feedback is an example of another type of feedback techniques which has been tested for instance in the Responsa project giving encouraging results (cfr Attar/Frankel 1980). The method attempts to identify synonyms to the search terms ("searchonyms") by mapping the words occurring in proximity with the search terms in the retrieved documents. These words are either listed to the user, who then reformulates the search request; or they cause an
[Page 177 ]
automatic expansion of the request.
The method is qualified as "local" to stress that it exploits only information from documents already retrieved by a request - and not from the whole data base. The method is qualified as "metrical" to characterize the basic algorithm which takes into accoun the distance to the search words in the text (for instance the number of words, sentences or paragraphs).
The method is also mentioned in chapter 5 under the description of the Responsa project.
(4) Computerized processing of search requests - snowball functions
Snowball functions is the name of another retrieval technique which differs from those mentioned above as it allows for the system to retrieve new documents similar to those already retrieved. The system loops, ie results in one search initiating a new search, and so on. In this way the number of retrieved documents increases rapidly, and the process does not halt until a certain number of loops have been made, or until the similarity between retrieved documents and the search request have reached a minimum value.
Snowball functions may be especially useful in citation systems. For instance, they may be used to trace all documents citing or cited by a certain target document. This document may be a case, and one might be interested in older cases cited by this one, or more recent cases citing the target case (cfr Loosjes 1973:77-84).
(5) A preprocessor to text retrieval systems
As an alternative to the techniques mentioned, one might want to invoke the idea of a more advanced process governed by the system, which could, for instance through a dialogue with the user produce a richer search request than the request specified by the unaided user.
[Page 178 ]
In 1982 a project was initiated at the Norwegian Research Center for Computers and Law - NORIS (58) FORT - which will analyse this possibility. The objective is to develop an "intelligent" preprocessor to a text retrieval system, which, based for instance on knowledge of earlier terminal sessions, properties of the documents, thesauri, analysis of the search requests etc, suggests improvements in the search request. These suggestions may be accepted by the user, or they may be supplemented or ignored. The preprocessor may be turned on or off. Turning it on, the preprocessor would start, for instance with an analysis of the request and explain its effect to the user. For example, it may suggest that the Boolean structure is not proper, it may further suggest deletion of terms with a high frequency in the data base, improved truncation, splitting of compounded words or supplemental terms presumed to be synonymous to the search terms etc. The project will investigate techniques from artificial intelligence research for constructing a network of concepts covering sectors of the documentation area of the system. Rather than structuring a universe as a model of real objects and activities, one may structure a universe of the words and phrases used to describe this part of reality. The objective is to make this network for suggesting synonyms available to the user.
As a last possibility, one will look at the usefulness of implementing a model of automatic estimation of recall and precision (Harvold 1980b:172-219). On the basis of such estimated values the user may decide whether a new search should be conducted by a reformulated request, or whether to quit the terminal session.
This project is funded by the Norwegian Research Council for Science and Industry, with a planned termination in 1985.
[Page 179 ]
4.6 Aids in formulating search requests
4.6.1 Formulating the request
The basic problem in formulating the search request concerns the transformation of the semantic content of the problem into the syntactic criteria accepted by the retrieval function. This presumes an ability to analyse the information need and to identify the ideas ("semantic building blocks") from which this need is constructed. Secondly, it presumes that each idea may be described by terms assumably used in the documents representing that idea.
A well formulated search request presumes satisfactory results of both these processes. In practice, only a few users make a distinction between them, analysing and specifying simultaneously the search terms. Experience shows, however, that the formulation of the search request may be found somewhat easier, and give better result, if a more stepwise approach is adopted.
In the literature great attention has been paid to the specification of search terms, search languages and tools used in the formulation of the request. Many users experience the formulation of a request as the most difficult task, and it is also the specification of search terms which represents the major cause of performance failure (Fjeldvig 1980:161-167).
For a request to be complete, it must contain all terms used in the document collection to represent the ideas - including the grammatical variations. This task is rather difficult, as there may be many words and phrases representing the same idea, and a single word may represent many ideas. The idea 'the moral of the mother' may be represented for instance by "the moral of the mother", "her moral" (implicitly that of the mother), "behaviour of the mother" or "dock girl" (examples from the source material to NORIS (34) STANS). The word "bar" may denote an organization of lawyers, a place serving beverages or it may mean a piece of metal.
Some systems offer aids to the user in specifying synonyms or resolve ambiguities (for instance thesauri), while others require this to be done by the user
[Page 180 ]
for and is restricted to giving simple aids like right hand trunction. It is important, however, the the retrieval function gives the user the possibility of specifying relation between terms, for instance synonyms. In many cases there will also exist a need to specify the context of the terms, for instance "retrieval" in the sense of 'text retrieval system' or "computation" in the sense of 'computational linguistics '.
4.6.2 Examples of different aids
Truncation is a simple and valuable aid offered by most text retrieval systems. The user is given the possibility of specifying several search terms by specifying only a certain string of characters to be contained by the search term. Any word containing the defined string is qualified as a search term.
The most usual form of truncation is right hand truncation. For instance the request "car*" (where "*" is used as a symbol denoting truncation) will include any word beginning with the three letters "car" - for instance "cars", "carriage", "carload", "caravan", "caricature", etc. A drawback is the inclusion of improper search terms (like "caravan" and "caricature"), and its inability to cover vowel changes (for instance strong verbs or irregular substantives).
In general, however, the improper terms are rather peripheral in respect to the problem. If the retrieved documents are ranked, the improper terms consequently will not effect the upper part of the ranked list of documents.
The Norwegian Research Center for Computers and Law has developed a method for automatic right hand truncation which performs quite as good as manual truncation for Norwegian texts.
Left hand truncation is especially important in language making frequent use of compounded words (for instance German and Scandinavian languages) and
[Page 181 ]
prefixes (for instance German, Finnish and Hebrew). The request "*boat" would include words like "steamboat", "riverboat", "cargoboat", etc (though English is a language in which adequate examples are difficult to conceive). Left hand truncation presumes that the words are stored in a sequence different from right hand truncation (sorted backwards). As most users have given priority to right hand truncation, the left hand truncation may require considerable resources. Therefore, rather few systems offer left hand truncation.
(2) Mask functions
There are systems which permit masking of any part of a word, not only truncations (for instance 3RIP and the Responsa project). The user is free to choose any string of characters for specifying the search words, leaving any characters masked. For instance the word "*col*rful*" would retrieve any word containing the two strings "col" and "rful" in the specified sequence - solving the problem of different spellings in English and American.
(3) Automatic stemming
As an alternative to right hand truncation, some systems offer to supply the request automatically with all grammatical variations of the search terms. This is either solved by having the system consult a dictionary where this type of information is stored, or using a routine for automatic stemming. Such methods are, for instance, developed for English, Norwegian, and Finnish.
A popular aid is synonym thesauri - or synonym dictionaries, which they are also named. Especially in France and Italy considerable resources have been
[Page 182 ]
used in developing quite complex thesauri with relations in all directions. Attempts have been made to develop an automatic synonym thesauri (for instance in the SMART and CONDOR projects), but we are not aware of their implementation in operational systems.
A synonym thesaurus will be able to specify only context independent synonyms. The actual efficiency of synonym thesauri in relation to their development and maintenance cost is therefore a matter of considerable dispute. Experiments have failed to prove large increases in performance. Saracevic (1970b:677) found that the use of a thesaurus did not increase performance in terms of recall and precision, while Salton (Salton/Lesk 1968:330-334) found a small improvement. It should, however, be made clear that both these experiments were based on rather small data bases, and that there is reason to believe that the importance of context independent synonyms increases with the size of the data base.
A large part of the context independent synonyms will be grammatical variations of the same word stem. Experiments show that approximately 75 per cent of the context independent synonyms are resolved by right hand truncation (Harvold 1974). In supplement of right hand truncation many systems offer the users the possibility to specify their own synonym thesauri (for instance the macro command in NOVA*STATUS).
4.6.3 Choosing the level of performance
In spite of well developed aids to specify search terms, experience shows that it may be difficult - or perhaps impossible - to achieve both 100 per cent recall and precision, ie to retrieve all and only relevant documents. Even if all search terms are specified, there will be a possibility of loosing an idea which may be "hidden" in the text, or of finding a document containing a specified idea not related to the other ideas in the search argument.
In addition, it is difficult to find terms which are sufficiently specific to secure that only relevant documents are retrieved, and sufficiently general to see to all relevant documents being retrieved, High
[Page 183 ]
recall performance generally implies low precision, and vice versa - they are actually inversely proportional.
When specifying a search request, one may influence to a certain degree the result in the direction of high recall or high precision by
In using nearness functions, it may also be decided how far down the list of ranked documents one wants to hunt for possibly relevant documents. The further down, the higher the recall and the lower the precision.
[Page 184 ]
4.7 Examples of file structures
Finally, we shall mention in short how the data are stored and structured in a text retrieval system - ie which files are created, how are the files interrelated, and how does the system retrieve the correct data.
A common misconception among users (and others) is that the text retrieval system searches the text sequentially, hunting for search terms. This is, of course, quite possible, but would demand large resources and is therefore quite unusual. In the text retrieval systems of today, the documents are processed by the system for efficient retrieval. In this section we shall give examples of two different ways of representing the data (file structures) - one based on traditional text retrieval (like Boolean retrieval), and one based on vector retrieval.
4.7.1 Inverted file structure
Most text retrieval systems are based on an inverted file structure with access through specified search terms. The file system consists of an index file (also called concordance, search file or dictionary file) in addition to a document file (also called text file). An example of an inverted file structer is given i fig 4/8.
The index file contains all words which may be used as search terms. To each word is associated an address or reference to each location in the documents where that word occurs. In NOVA*STATUS, the address contains a reference to the document, the sentence within the document and the word within the document. Often the address have more levels (for instance a paragraph of the document, and a field within the document).
In some systems has also been integrated a pointer structure (chains) in the index file, making possible a specification of the interrelations between the words as well (like synonyms). In the new Norwegian text retrieval system SIFT (cfr SIFT 1980) it is
[Page 185 ]
possible to associate each word with an unlimited number of pointers.
In order for the system to efficiently retrieve a certain word in the index file, the words are sorted in a certain sequence. Usually, they are sorted alphabetically. For further increasing retrieval performance, there is often established an index to the index file, specifying the position of the word within the file (cfr fig 4/8).
The drawback of the traditional file structure (as illustrated in fig 4/8), is the difficulties in maintaining the files. The words are stored in a certain sequence, and when new words are introduced into the data base, the whole file must be reorganized. It is, of course, possible to create garbage collection (extra areas linked to the main file), but after a certain number of new words, the files must nevertheless be reorganized in order to keep response time down to an acceptable level.
In the text retrieval system SIFT one has solved this problem by opting for a slightly different structure, the so called B-trees. In this structure, the words are stored alphabetically. But rather than in a sequential file, the words are organized in blocks. In introducing a new word, this is placed on the first free slot in the correct block. If this block is full, it is automatically split into two new blocks. In this way a complete reorganization of the file is avoided. Consequently B-trees make it possible to maintain a short response time while at the same time the system may easily be updated. Fig 4/7 gives a sketch of the structure of B-trees.
[Page 186 ]
Fig 4/7 - B-trees
[Page 187 ]
Fig 4/8 - Example of the file system in NOVA*STATUS
[Page 188 ]
The text file contains all documents as they were input. This makes it possible for the user to list a whole document or only part of the document on his terminal. Usually, the documents are stored in chronological order, and when updating, new documents are added at the end of the file. To make the access to the text file more efficient, NOVA*STATUS have chosen for instance to have an index added to the text file giving information on the position in the file of a certain document. Fig 4/8 illustrates the file system of NOVA*STATUS as implemented at the University of Oslo, giving an example of such a structure of the text file as well.
4.7.2 Vector based systems
In a vector based system, the data are centred on the documents rather than on search terms. Each document is represented as a vector, and these vectors are employed in retrieval. A vector consists of as many components as there are different terms in the data base. Each word is represented by a fixed component in the vector, and the value of the component equals that one given to the word in the document - for instance by utilizing the frequency in which the word occurs in the document. If a word does not occur, the weight is set equal to zero. A document vector will therefore generally contain numerous zeroes (since many of the different words in the data base do not occur in any single document), and data compression is therefore necessary when storing vectors (cfr Fjeldvig 1978:24-27).
Document vectors are constructed only once. Even if there are introduced new documents with new words at a later date, this will have no influence on the existing document vectors. For each search, however, a search vector must be constructed with a structure identical to that of the document vectors. This presumes that the system keeps track of which components represent which words. In the vector retrieval system VEXT of the Norwegian Research Center for Computers and Law, this problem is solved by the creation of a file containing all the different words in the data base and the associated component numbers. The file is sorted alphabetically, and has an index sequential structure.
[Page 189 ]
4.8 Appendix I: Some important text retrieval systems
As far as the systems mentioned below is used in a legal information service, a description will be included in one or more of the discussions on these systems following in chapter 6-7.
This is a Swedish system developed at the end of the 1970ies, running on DEC hardware under the operative system TOPS. The system is based on Boolean retrieval, interval retrieval, distance operators and masking. The search may be limited to certain areas of the document. It includes several output formats and and makes it possible to edit on DEC-10. The search language includes CCL (Common Command Language).
We are not aware of any legal application of the system, though experiments have been made in Dublin (Republic of Ireland). The system is marketed by Paralog AB and is documented in 3RIP manual 1979.
This is a Swiss system constructed in 1968-70. The system was initially named CONTEXT, but was marketed under the name DATA+PLUS (Bing/Harvold 1977:140). This is one of the few - perhaps the only - operational system based on vector retrieval. The search strategy is designed for natural language requests, and a document in the data base may be used as a request. The system implies the possibility of a synonym thesaurus. A description may be found for instance in DATA+PLUS 100 (1976).
A US system developed by Lockheed Information Systems (Alto, California). The system is very simple and is based on Boolean retrieval and right hand truncation. In May 1980 as much as 122 different data bases were accessible under the system (Salton 1983:30-34), of which some are of legal interest.
[Page 190 ]
A Swedish system developed at the end of the 1960ies by Industri- Matematik AB, marketed in different versions and sometimes under different names (Factfinder). The system is simple in both design and use. It offers Boolean retrieval, interval retrieval, synonyms (using a thesaurus) and right hand truncation. Left hand truncation is permitted for intra-document search. The system does not include the possibility of searching for defined fields of the document.
IMDOC is widely used in Sweden, both by public agencies and private industries. It is used in RAETTSDATA, which is a group of coordinated legal information systems. The program is implemented on various hardware. A brief description of the system is given for instance by Bing/Harvold 1977:135-136, Svoboda 1978 and IR-meddelanden nr 21.
This is an Italian retrieval system used in the national legal information system ITALGIURE. The system is based on Boolean retrieval and permits interval retrieval, right hand truncation, masking, distance operators etc. It is also possible to search defined fields of the document. The system permits the use of synonym thesauri, and will expand the request automatically with synonyms and grammatical variations of the search terms. A description is given by Svoboda 1968.
This is a German system developed by Siemens AG, used for instance in the legal information retrieval system JURIS. The system has roughly the same possibilities as FIND, including the use of synonym thesauri. An accessory program for automatic indexing known as PASSAT has been developed, analyzing the documents and performing som computational linguistics.
GOLEM is released in a new version approximately once a year. At one time there were plans for replacing
[Page 191 ]
the system by a new and more advanced system known as CONDOR. This project was terminated in 1981.
The system runs only on Siemens hardware. It is described by Svoboda 1978 and IR-meddelanden nr 21.
A US system used in the medical information system MEDLINE at the National Library of Medicine. The system is actually not a true text retrieval system, but is well known and should therefore be mentioned. It was developed in 1964 and was the first on-line information retrieval system.
MEDLARS is especially designed for retrieval in bibliographical data, and is based on documents designed in a controlled vocabulary. It is based on Boolean retrieval, with a retrieval process limited to certain fields. To make the system more friendly for end-users, a general user interface has been developed (CITE) based on natural language requests and relevance feedback.
The system has been used in many experiments in text retrieval. A description is given by Salton 1983:30-34.
A Finnish system developed by Statens datamaskincentral (Government computer center) in Finland and used in the Finnish legal information service FINLEX. The retrieval is based on Boolean requests and interval retrieval. Synonym retrieval is an accessory function. It is not possible to search within defined fields of the documents. In the basic version, document length is limited. It is implemented for several minicomputers.
The system is documented in a manual from 1982.
This is a French system developed by Honeywell Bull.
[Page 192 ]
In a somewhat modified version it is operated by Telesystems in Paris under the name QUESTEL, a system which (in 1980) included 20 data bases and some 1.8 million bibliographic entries.
The system is designed for bibliographical data bases, and the user must specify which fields of the document to be included in the search. The retrieval is governed by arithmetical and Boolean operators. Help functions, synonym search, right hand truncation and masking are available. The search language includes CCL (Common Command Language). In France, an accessory system for automatic indexing has also been developed. This system will flag spelling errors and recognize grammatical versions of the same word.
The Norwegian version of the British STATUS system developed by the Government Institution for Organization and Management in cooperation with Norwegian Research Center for Computers and Law. The system is based on the British STATUS I, but it is estimated that at present it retrains less than 30 per cent of the British coding.
The system permits both Boolean and conceptor based retrieval, right hand truncation, interval retrieval and use of distance operators. It is not possible to create a synonym thesaurus in the system, but macroes may be established.
The system is available on several different installations in government agencies, including the Ministry of Justice and LAWDATA.
A US system developed by Mead Data Central (MDC) for retrieving legal and other text data bases. LEXIS is considered the largest text retrieval system in the world (measured in data bases). The system has a large number of users, not only in the US, but also in the United Kingdom, France and elsewhere.
[Page 193 ]
The retrieval system is quite simple, based on traditional use of Boolean and distance operators. The system does also have simple routines for recognition of grammatical variants, and will consequently be able to recognize and process similar words with the same basic form. The retrieval system is documented for instance in Salton 1983:46.
A Norwegian system developed by the Norwegian Center for Informatics Ltd, and especially designed for retrieving bibliographical data - but may also be used for text retrieval. In retrieving, the field within the document must be specified, but this may have the form of a lengthy text. The system is based on Boolean requests and permits right hand truncation.
POLYDOC is characterized by its output format, producing KWAC, KWOC or KWIC lists and a number of other indexes. A micro computer oriented version has also been developed, MICROPOLYDOC. A description may be found in IR-meddelanden nr 21.
An Israeli system developed as part of the famous Responsa project at the Institute for Information Retrieval and Computational Linguistics at the University of Jerusalem. The system is especially constructed for text retrieval and large text bases. Today there are some 180 data bases and 30-35.000 documents.
The system is being continually developed. Today it includes in addition to the functions found in conventional retrieval systems, several advanced forms of automatic indexing. A linguistic program has been developed, KEDMA, which inflexes verbs and substantives and performs a morphological analysis of the words. The language is Hebrew. In the project, great effort has been made to develop methods for retrieval and storage of data which do not require large resources.
The system is used for experiments in information retrieval. A description is given by Choueka 1980.
[Page 194 ]
A Norwegian system under development in a cooperation between several institutions, and under the direction and inspiration of the Government Institution for Organization and Management.
The system opens the possibility for Boolean and conceptor based retrieval. It enables users to search defined fields of documents and if desirable in several data bases simultaneously. The inverted file structure is based on B-trees, and it contains possibilities for several different relations between words (for instance synonyms). Interval retrieval, right hand truncation, masking and distance operators are available. Operators make possible a relational searching in table structures. Historical versions of a document are available in browse mode. The system does not imply any limits in document length, data base size or number of data bases. A description is given in the SIFT specification of 1980.
(16) SPIRIT (Systeme syntactique et probabiliste d'indexation et de recherche d'information textuelles)
A French system developed by Systex. The system is released, but has still the imprint of a research project and is so far used only for experiments (IR-meddelanden nr 4). It has been tested on legal data.
The system is different from traditional text retrieval systems in several ways. The requests are based on natural language. The similarity between the request and the documents is measured on the basis of a morphological, syntactic and statistic analysis, and is supposed to express the semantic distance between the request and the document. The system contains advanced methods for automatic indexing, in which the documents are analysed grammatically, the root words lemmatized, insignificant words deleted, compounded words split, homographs identified and resolved, typing errors corrected, etc. Synonym thesauri are included.
The system operates on mini computers, and is
[Page 195 ]
described for instance in Fluhr 1981.
(17) STAIRS (Storage and information retrieval system)
This is an IBM program product running on IBM hardware and existing in several versions (for instance STAIRS/VS, STAIRS/TLS or STAIRS/DL/1).
The system is based on Boolean retrieval and proximity functions. CCL (Common Command Language) is included. Right hand truncation, left hand truncation (through sequential retrieval) and distance operators are available. Several data bases may be concatenated for simultaneous retrieval (up to 16). For the French, English and German version the extension TLS (Thesaurus and Linguistic Integrated System) is available, which permits synonym retrieval and identification of word with the same basic form.
A description of the system is given in IR-meddelanden nr 25 and Svoboda 1978.
(18) STATUS II
A British system used today in several information services, for instance EUROLEX. The system is a new version of STATUS I, and it also has several common characteristics with NOVA*STATUS. It is based on Boolean retrieval. The retrieval may be limited to certain fields of the document, and to certain sub-sets of the data base. Right hand and left hand (sequential retrieval) truncation are available, and macroes may be defined. The document length, the size of the data base and the number of data bases are unlimited.
The system is running on different hardware. A brief description is found in IR-meddelanden nr 7.
[Page 196 ]
4.9 Appendix II: The concept of relevance
The measure of performance of text retrieval systems in terms of recall and precision may seem simple at first glance, but this impression is something of an illusion. The measures depend entirely on the concept of relevance, and this is probably one of the most difficult concepts in the whole field of information retrieval.
There are at least three issues of importance to our understanding of relevance. The first issue concerns the type of relevance. Are we really dealing with formal, content, or subjective relevance (Koenigova 1971)? The second issue concerns the nature of relevance. Is relevance absolute or is it relative to each user? The last issue conserns the grading of relevance. Is relevance a matter of degree or is it an either/or proposition?
4.9.1 Types of relevance: Formal, content, and subjective relevance
Relevance is a relator in the sense that it says something about one thing in relation to another thing. The two "things" might be the syntactics of two texts, in which case we talk of formal relevance. Or they may be a problem and the content of a text, in which case we talk of content or subjective relevance, depending on the type of situation in which the evaluation takes place.
Formal relevance measures the syntactic similarity between two texts. The texts may be search request and document or two documents. Thus the formal relevance of a document is a value assigned to the document by a matching function. Formal relevance is based on the syntactic structure of the document, not on the content of the document, nor on its usefulness to the user. Formal relevance will usually reflect the similarity between two texts as measured, for example by a matching of words. But it may also reflect general criteria like type and age of document, author, and so on. The nature of formal relevance is absolute. Given two texts and a matching function,
[Page 197 ]
the relevance value is unambiguously defined. Depending on the matching function, the grading may be either/or (binary) or by degrees.
Content relevance is defined as the adequacy of the content of a document as a response to the request. Subjective relevance will depend on a host of factors, inclusing the user's previous knowledge. In the literature subjective relevance has also been characterized "utility" (Cooper 1971) and "pertinence" (Foskett 1972).
The choice between content and subjective relevance must to a certain extent reflect the type of decision-making situation in which the user finds himself.
In an informal decision-making situation where the value of the decision depends on future consequences, there are no rules that the decision-maker is forced to consider. The only thing that counts is the decision itself. How it is arrived at is irrelevant.
The quite opposite situation exists where the decision process is formalized in such a way that the validity of the decision depends mainly on the premises on which the decision is based. The validity of a decision made by a court of law, for instance, will depend on whether or not certain procedures demand that certain legal sources shall be consulted. If the judge neglects to consult one of the required sources, his decision is invalid, even if it turns out that the same decision would have been arrived at had the source been taken into account.
In legal decision-making, as in other formal situations, it is therefore not appropriate to regard relevance as entirely subjective. The relevance of a document cannot be defined only in terms of its usefulness to the user, since such a definition implies that only documents which in some way cause the decision-maker to change his mind are relevant. Instead we must base relevancy on the content of the document. Whether or not the document is relevant will depend on the adequacy of the content of the document as a response to the request. As part of the content will be things like date of publication, author, and so on.
[Page 198 ]
4.9.2 The nature of relevance: Absolute and relative relevance
We have earlier remarked that formal relevance is absolute in the sense that it does not depend on individual value judgements. Subjective relevance, on the other hand, is relative in the sense that it is particular to each individual user. In fact subjective relevance is not only relative to each user, but to each user situation, since the background knowledge of the user, which changes constantly, is a main factor affecting the utility of additional information.
The nature of content relevance is more difficult to assess. Content relevance measures the adequacy of a document as a response to the request. Content relevance does not depend on whether or not the user himself finds the information useful in the sense that it is new to him.
In a strictly formal system there are definite rules for evaluating relevance. There is little or no room for the personal opinion of the individual user. In such a system content relevance tends to be absolute.
While the legal system has certain characteristics of a formal system, these are not sufficiently prominent to make the assessment of content relevance absolute. This is not only evident in legal theory with its emphasis on human judgement within the often broad limits of the law, but is also born out of several empirical investigations regarding relevance assessment. This is not to say that relevance assessment in the legal system is completely relative. Clearly it is not. As so often in the law, the solution must be found somewhere in between the extremes, the different solutions in each case deciding to which side the scale will tip.
4.9.3 The grading of relevance: Grading by degrees or binary grading?
Even when we know the type and nature of the relevance assessment, we are still left with the question of
[Page 199 ]
grading. In fact, the question of whether or not the relevance assessment should be graded by degrees or given a binary grading does not follow automatically from our previous choices of type and nature. Yet the grading of relevance is never completely independent of type and nature.
Consider for example formal relevance. We have established that the assessment of formal relevance as performed by a matching function is absolute. Since there is no uncertainty regarding formal relevance, it might seem to be an appropriate candidate for binary grading. However, such a conclusion would be premature. It must be remembered that formal relevance as applied in a retrieval system is only an approximation of content or subjective relevance, the approximation will not be perfect and may even be quite inaccurate. As long as formal relevance is an approximation, the system should grade documents by degrees. This is appropriate even in the cases where the user himself would grade documents according to a binary scale. In these situations formal relevance is only a measure of similarity and should not be presumed as a measure of identity.
The grading of documents by the user according to content or subjective relevance is an entirely different matter. If it is assumed that the user assesses documents according to subjective utility, it is obvious that some documents are going to be more useful than others. It is doubtful, however, whether or not the user will be able to assign to every document a unique rank according to utility. It seems safe, at the very least, to assume that the user will not make any distinctions among the documents which are clearly irrelevant. And in most cases it also seems safe to assume that the user will in fact be unable to assign a unique rank to each relevant document. Experience has shown that humans are generally incapable of mentally making a complete comparision of more than a few items.
The user may be able to classify the relevant documents under a few headings like:
[Page 200 ]
But again it is doubtful how much use he will have of such a classification. It is also a doubtful empirical question whether users of retrieval systems normally classify documents in this manner, or if the practise is reserved for panelists taking part in relevance experiments.
The behaviour of panelists has often been cited in support of the proposition that relevance is a matter of degree. Gebhardt (1975) for example, refers the the Joint American Bar Foundation and IBM Project (Eldridge 1968) and points out that the panelists seem to disagree as often as they agree on relevance assignments. The typical user situation, however, does not consist of a panel, but of one single user. For a given document in a given situation the user will normally be able to decide whether or not the document is clearly irrelevant, or whether it might be relevant. Most of the time he will probably not assign a unique rank to the document, and an attempt to do so might prove difficult.
Assuming, however, that legal documents are assessed according to their content, then relevance values corresponding to their hierarchical rank may be assigned. Thus the constitution may be given a higher relevance value than a statute, a statute may be given a higher value than a regulation, and so on. But such a scheme is neither very realistic nor, probably, very useful. The rank of a legal source is not in itself absolute. The respective rank of a supreme court decision, a statute, and a regulation, for example, depend on several factors, including how long ago the respective sources were written, how directly they affect the issue at hand, the reasonableness of the result which each document favours, and so on (cfr the discussion of the weight of arguments above at sect 2.6.1). In fact, in most areas of the law so much depends on human judgement that it does not seem practicable to implement any kind of rigid scheme for assigning relevance values to documents.
A complete ranking of documents on a utility basis, rather than on the basis of content, is not normally performed by users. One of the findings reported by Eldridge (1968) was that each panelist seemed to have his own favourite relevance group in which he tended to place documents. It is likely that the same tendency is found among ordinary users.
[Page 201 ]
If a request is complex, the user may rank documents according to the aspect of the request to which the document refers. In a criminal case, for example, the user may find a document discussing the correct legal reaction once guilt is established. It is probably appropriate, however, to regard this kind of preference as a ranking of the various aspects of the problem rather than as a ranking of retrieved documents.
The user may change his relevance assessment as he gains new insight into the nature of the problem. He may disregard documents he previously overlooked. Cfr the results reported above in sect 4.5.4 (1) on the reassessment of the relevance of documents, causing major changes. A re-evaluation of previous assessment results is probably common and illustrates the relativity of relevance. However, this fact by itself has little bearing on the appropriate relevance grading of the documents.
What we are left with as a conclusion is that the user at any one time disregards irrelevant documents. The remaining documents may be, and sometimes are, classified in a few relevance categories. But they will almost never be assigned unique rank values. Documents which at first glance seem to be of doubious relevance are usually reassessed and either disregarded or accepted as relevant. The user will not normally leave them in an uncertain state, as this is of little value to him.
The grading of content and subjective relevance must therefore, generally speaking, be regarded as binary, that is as an either/or proposition. The user can make use of a few relevance categories, but will not, as a rule, make a full ranking of the documents.
4.9.4 The relevance concept as used in this book
In this book the relevance concept has been developed in sect 1-3, cfr especially sect 2.6.1 for a discussion and a definition.
It should be clearly understood that the selected
[Page 202 ]
definition is designed to meet the special needs of the analysis of the decision and communication process. The definition is based mainly on content relevance (referring to legal meta-norms), but supplemented by subjective relevance. Consequently the concept is relative, though a core of objectivity is to be found to the extent that the legal meta-norms are well defined and not controversial. The concept is binary, based on the arguments presented above - but some of the grading has been pushed into further stages of the decision process, a grading of weights of the arguments derived from the documents.
This is a concept developed for analytical purposes, the main point being its interpretation with respect to the measure of the performance of retrieval systems. We would like to maintain that a high performance is achieved if such documents are retrieved which the user thinks worthwhile to consider, even if finally put aside as of no use. In order to arrive at such a concept, we have to be rather "generous", including as relevant documents also such documents which have doubious utility. The definition is not intended to be descriptive (a description of how relevance assessment actually is carried out by lawyers), but rather prescriptive (determining our own use of the concept in the discussion of legal information retrieval systems).
[Page 203 ]
5 RESEARCH REGARDING THE PERFORMANCE AND DESIGN OF TEXT RETRIEVAL SYSTEMS
The science of information retrieval lacks a comprehensive theoretical foundation. Part of the explanation lies probably in the fact that information retrieval, like other computer-dependent sciences, is a relatively young science. Another explanation may be found in the nature of information retrieval itself, and especially in the problems related to the concept of relevance. The main questions concerning the type, nature, and grading of relevance are central in any attempt to measure the performance of a retrieval system, but so far, there has been no accepted general way of treating them.
An explanation can also be traced back to the relatively slow progress made in the field of artificial intelligence. In a fundamental sense many of the problems of information retrieval depend on the ability to understand text, which machines cannot do - yet. And it is a sobering thought, that even if we had in our possession a clever machine capable of text comprehension, we would still be left with many of the problems associated with subjective relevance. As for the time being, we are not only left to deal with the problems of relevance, but also with the problems connected with retrieving documents according to purely syntactic criteria.
There are many serious practical limitations connected with empirical investigations of retrieval performance. Most of these limitations are again caused by problems connected with relevance.
In order to measure retrieval performance, the relevance of both the retrieved and the non-retrieved documents must be evaluated. Essentially this means that every document in the test data base must be read and evaluated by a juror. Ideally, only one juror should be used, since experience shows that different jurors arrive at quite divergent results,
[Page 204 ]
depending on their familiarity with the problems and their general background. Differences in the relevance assessment can probably be reduced if the jurors are instructed to look for content relevance and not subjective relevance, and if they are given guidelines for assessing content relevance. However, as long as there is some element of judgement involved, the opinion of jurors will, and should, differ. Even one single juror will normally change his assessment over time. We must remember that the assessment process is also a learning process. By reading the documents the juror gains new insight, which may cause him to reconsider some of his earlier evaluations.
The uncertainty associated with relevance assessment need not, however, be a serious obstacle to the general credibility of the results of retrieval system evaluation. Salton/Lesk (1968) report that in an experiment designed especially to test the questions, no significant relationship was found between differing relevance assessments and the evaluation results. There is a sound reason for such a result. In view of the definitions of the recall and precision ratios, it is likely that uncertainty regarding the relevance of marginal documents will, on the average, affect the numerator and denominator of each ratio proportionately and thus leave the values of the ratios unchanged. If, however, a shift in the relevance assessment is caused by the juror's misunderstanding of the question, the mistake would very likely show up as a corresponding shift in the performance figures.
Another and more serious problem connected with relevance is that, in order to get accurate values for recall, it is necessary to assess manually the relevance of all documents in the data base. This imposes severe practical limitations on the size of the data base. An unfortunate situation, since we have no reason to believe that the effect which the different system factors have on performance can be extrapolated linearily. Two different solutions to the problem suggest themselves. One solution is to estimate recall on the basis of the results from two independent searches. The other solution involves constructing general models of the retrieval system in which the size of the data base is one of the variables. A model of this type has several other advantages as well. It can be used to gain insight into how the
[Page 205 ]
factors which determine performance interact, and the relative importance of each factor. If the model is flexible in the sense that the parameters of the model can be set to reflect varying retrieval situations, it will also be possible to predict performance in extreme situations - situations which may appear in the future, but which is out of the question to test empirically for the time being, Harvold 1976.
We now turn our attention to a survey of selected experimental projects in which various aspects of retrieval performance and design of text retrieval systems have been tested. We certainly do not pretend that our survey is complete. Our selection reflects both our familiarity with these projects as well as our special interest in legal retrieval systems.
It will be noted that a number of relevant studies and research will not be discussed in this book. The major studies of user interaction with information systems ("user research") will not be included in the survey. We shall not comment on tools or methods for designing information retrieval system, for instance for system analysis, user participation, etc.
[Page 206 ]
5.2 General research
5.2.1 The Aslib-Cranfield Projects: 1960-1966
The aim of the first Aslib-Cranfield project was to investigate the operational performance of four different indexing systems. The project is described by Cleverdon (1960, 1962) and by Aitchison/Cleverdon 1963.
The most important result of the project was probably not its findings - the general conclusion being that the four systems, the Universal Decimal Classification, a facet classification, an alphabetical subject catalogue, and the Uniterm system of coordinate indexing all operated at about the same level of effectiveness. In the course of the project, however, valuable experience was gained in the testing of retrieval systems, and a methodology was developed upon which later investigations have been extensively built. For the first time performance was measured by the five criteria coverage, recall, precision, response time, presentation and user effort.
These criteria have later become classical and widely used in a variety of retrieval experiments. Coverage, response time, presentation and user effort measure the quality of the operational features of the system, while recall and precision measure the quality of the individual retrieval results.
On the basis of his tests, Cleverdon suggested that a retrieval system is made up of a basic vocabulary and a number of retrieval devices. The retrieval devices are made up of recall and precision devices. Examples of recall devices are:
Examples of precision devices are:
This conceptual apparatus was used in the second
[Page 207 ]
Cranfield project to investigate retrieval devices in isolation and in all practical combinations in order to measure the effect of each device on performance. The project is documented by Cleverdon/Mills/Keen 1966 and Cleverdon 1967.
During the tests, factors affecting both indexing and search strategy were varied. The test conditions were rigidly controlled, and only one factor was varied at a time. Three main types of index languages were investigated. The first type was based on single terms selected from the authentic text of the source (eg axial, flow, compressor). The second type was based on concepts selected from the authentic text of the source (eg axial flow compressor). The third type was based on various groupings of a set of controlled terms. Search strategies used included:
The test data base consisted of 1.400 research papers mainly in the field of aerodynamics. The number of questions used was 221. The retrieved documents were ranked according to the term co-occurrence level, and a normalized recall figure was calculated for each test.
The results, as it turned out, were unexpected at the time, although they seem more reasonable today. The best indexing language turned out to be the single term authentic language, which consisted of words selected from the text of the papers. Furthermore, the best results were obtained when the endings of these terms were confounded (stemming). Slightly inferior results were obtained when synonyms were grouped or the terms used in their natural language form. Any further reduction in specificity, for example by grouping quasi synonyms or using hierarchies, resulted in reduced performance. The use of all other index languages gave inferior results. Languages based on controlled terms performed, on the average, better than concept languages, but were still inferior to the single term language.
The results seem to indicate that the best results are obtained when preprocessing, in the way of standardizing the documents, is kept to a minimum.
[Page 208 ]
The only preprocessing of the documents which will improve performance is the grouping of context-independent synonyms - the so- called true synonyms. Grouping of context-dependent synonyms (quasi synonyms) or any other form of standardization, like the use of controlled vocabularies, hierarachies, or concepts, will lead to inferior results, presumably because standardization at this stage implies that information is deleted, not added, to documents. It is the context-dependent synonyms that are the main cause of trouble. By definition, however, these cannot be grouped once and for all, but can, if information is not to be lost, be grouped only by the search request itself.
5.2.2 The SMART project: 1964-1983
The results of the Cranfield tests increased the belief in automatic indexing as an alternative to intellectual indexing. Though the tests had been based on intellectual indexing, the results had proved that the best indexing language was also the most simple one - selecting the indexing terms from the authentic language of the sources. Application of advanced linguistic tools did not yield the expected results - they actually proved quite disappointing, with the exception of thesauri for context-independent synonyms. The results were somewhat unexpected and very challenging, strengthening the alternative of automatic indexing.
Professor Gerard Salton launched a major research program in automatic indexing in the early 1960ies. At that time professor Salton was at Harvard University, but in 1965 both he and the research program moved to Cornell University. The experiments were carried out using the experimental retrieval system SMART under the direction of Salton. It is perhaps the best known and most comprehensive research program in text retrieval ever carried out. Its rationale is the shortcomings of the conventional retrieval methodology based on Boolean search requests and inverted file methods.
SMART is known for, among other things, its unique
[Page 209 ]
file structure where the documents are represented by vectors denoting which terms the documents contain and the weight of the term (its "importance") within the document. Similarity between documents and the search request is calculated by a similarity quotient, and the calculated values are used to rank the retrieved documents (cfr above on vector based retrieval at sect 4.5.2 (3)).
The experiments with SMART are documented in a number of publications. Salton/McGill 1983 contains a recent survey of these.
In the first version of SMART, different methods for automatic indexing were tried out. The system was aided by sophisticated linguistic tools like synonym thesauri, hierarchical term structures, statistical and syntactic techniques for construction of phrases and a semantic language analysis. Different algorithms for weighing the terms were also included.
The methods were tested in controlled experiments in text retrieval, and the results presented as recall-precision graphs. Altogether four different data bases on three different subjects were used: medicine, computer science and documentation.
The results verified the conclusions of the Cranfield tests. The use of sophisticated tools did not contribute to increased retrieval performance with the exception of the synonym thesaurus. The results may be briefly summarized as follows:
[Page 210 ]
The experiments also proved that abstracts as a document form gave better performance than just the titles. Full summaries (2.000 words) performed better than abstracts (150 words), and abstracts performed considerably better than titles only.
The results were very interesting, and the experiments using SMART continued. The linguistic tools were now discarded, and attention paid to simple and automatic means which might improve retrieval performance.
This does not imply that the idea of the more sophisticated tools were discarded as uninteresting for text retrieval, merely that it was difficult to find adequate methods for any kind of search request without demanding too large resources. Even with the progress made since then, it is difficult to find satisfactory methods with respect to the use of resources.
In further experiments one decided to discard intellectually developed synonym thesauri as well. It may actually be questioned whether the improved performance using such thesauri found in the early experiments opened up a promising avenue for further development. With respect to large, heterogeneous data bases with a high growth rate, the development and maintenance of such thesauri would prove very expensive.
The linguistic tools were replaced by procedures for automatic generation of thesaurus classes and the construction of phrases. The phrases were constructed around terms with a high frequency (at least one of the words of the phrase had to have a high frequency) within the same sentence. The quality of the phrase was determined by the distance between the words and the frequency of the phrase itself. The thesaurus classes were composed of words with a rather low frequency occurring in the same document. In both cases the objective was to compose "search units" with an average frequency. Experiments with
[Page 211 ]
weighting of terms had demonstrated that the best search terms were those of average document frequency (number of documents in which the term occur) or positive discrimination value. Terms with a high document frequency were often too general as search terms, and would consequently reduce precision. On the other hand, terms occurring with low document frequency were often too specific as search terms, and would have little effect on retrieval performance.
Experiments proved that both the automatic generation of phrases and the thesaurus classes had a positive effect on retrieval performance. As a conclusion of the work on automatic indexing, the following procedure was indicated:
Further improvement is possible utilizing the dialog between user and system. SMART has been employed to reformulate automatically the search request on the basis of knowledge of the relevance assessment of the user. This "relevance feedback technique" is based on the user making a relevance assessment of the first documents - for instance the first 10 documents - on the ranked result list. The result is then analysed by the system, which attempts to reformulate the search request in order to make it more "similar" to the relevant documents. For instance the request is
[Page 212 ]
supplemented by terms occurring only in the relevant documents, or the request is modified by deletion of terms occurring only in the irrelevant documents. Also the weight of the search terms may be amended, as well as the weights of the terms in the document vectors.
Experiments with relevance feedback have shown positive results. On an average precision was improved by 50 per cent for high recall retrieval, and 20 per cent for low recall retrieval.
To assess the methods for automatic indexing with respect to the intellectual indexing, it was compared to the intellectual methods employed by MEDLARS at the National Library of Medicine. In this information system both documents and search requests are manually prepared for retrieval.
The experiment compared retrieval performance in MED-LARS and SMART with the same data as its basis. In SMART the documents were automatically indexed and the original search requests were used (Salton/McGill 1983:103-105). Experiments proved that a SMART indexing using an intellectually developed thesaurus gave retrieval performance on the same level as intellectual indexing. Supplementing the SMART indexing with methods like relevance feedback, retrieval performance increased by some 30 per cent (Salton 1981:322).
SMART has also been used in several other experiments, for instance in automatic document classification (clustering). The objective of this technique is to have grouped documents treating the same topic. In this way, retrieval may be limited to one or several of these topics.
Several methods for clustering have been tested. The procedure is similar to that of vector retrieval, but in this case every document is compared to all the other documents. Documents with a sufficient similarity are grouped in the same cluster. A document may be a member of several clusters, and it is possible to establish a hierarchy of clusters by splitting one cluster into smaller clusters, etc. For each cluster a representative is constructed, called the "centroid document". In retrieval, the search request is first compared to these representatives, and afterwards to the documents within the clusters selected on the
[Page 213 ]
basis of the similarity between the request and the centroid document.
Clustering has yet to be tested on really large data bases, as the construction of clusters demands quite large resources. However, experiments have been conducted on smaller data bases, but the indications do not promise improved retrieval performance. Even with respect to extremely small data bases it has proved difficult to achieve high recall values. But the method may lead to shorter response time and lower costs. Clusters are also easily updated, as new documents are treated as search requests.
5.2.3 The MEDLARS evaluation: 1966-1967
The MEDLARS evaluation represented something new in retrieval experimentation. For the first time the analysis of retrieval performance was broadened to include also an analysis of the causes of retrieval failure. This is a kind of hindsight analysis involving manual examination of
The analysis is time-consuming, but valuable, as it tells us not only to what extent a search failed, but also why it failed. The MEDLARS evaluation is described by Lancaster 1968b and 1969.
The MEDLARS data base consisted of more than 800.000 citations from biomedical journal articles prior to January 1964 and all subsequent issues of the monthly Index Medicus. The data base was too big for the answer set to be found by manual means. Recall was estimated by obtaining two independent samples of relevant documents. One sample was obtained by MED-LARS, another sample was obtained by other means, for example through a local librarian, through the professional knowledge of the searcher or his colleagues, or in some other way independent of MEDLARS.
[Page 214 ]
A total of 302 searches were completely analzyed, and recall and precision ratios were obtained for 299 (3 questions had no answer set). Overall recall and precision ratios were 50.4 and 57.7 per cent, and there were 797 cases of recall failure (irrelevant documents that were retrieved).
The failures were analysed in terms of
The terms apply both to the indexing of a source and the formulation of a search request. Exhaustivity means the extent to which all the concepts in the source (question) is represented. Specificity means the generic level at which a concept is represented. The entry vocabulary refers to the text form in which the document is formulated. The entry vocabulary may be the title, an abstract, or the authentic text of the source. In the case of a text retrieval system, the entry vocabulary will be identical to the indexing terms.
A detailed account of all the results will not be given here. It is interesting to note, however, that system dependent factors of the index, including the inadequacy of the indexing language, and of the user-system interaction, accounted for 74 per cent of total recall failures and 65.6 per cent of total precision failures. Searching factors accounted for 35 per cent of total recall failures and 32 per cent of total precision failures. The totals add up to more than 100 per cent because the same failure may be caused by more than one factor. For a more detailed account of the failures, see tables in figure 5/1 and 5/2.
The evaluation results are especially interesting in the light of the retrieval systems that are available today. Had MEDLARS been an on-line text retrieval system, most of the failures caused by the system dependent factors would have been avoided. Performance would have been significantly improved. In fact, if we use the average performance figures given for the evaluation, recall would increase from about 60 to about 85 per cent and precision would increase
[Page 215 ]
from about 50 to about 75 per cent - a result which would have been surprisingly good.
Fig 5/1 - Reasons for recall failures in the MED-LARS evaluation (Lancaster 1969)
Fig 5/2 - Reasons for precision failures in the MEDLARS evaluation (Lancaster 1969)
[Page 216 ]
5.2.4 The "Comparative Systems Laboratory Experiments" Project: 1963- 1968
At Case Western Reserve University a rather large project was undertaken in the mid 1960ies to investigate the relationship between the variable components of retrieval systems and performance. The project is documented by Saracevic et al (1968) and by Saracevic (1970).
The components of a retrieval system were described in terms of the purpose and function of the system. The purpose of a retrieval system was subdivided into
while the function of the system was subdivided into
The data base used in the experiment consisted of 600 documents selected from the 1960 volume of Tropical Diseases Bulletin (indexed in five languages). On the basis of 124 questions, 4.448 search requests were submitted for searching. Answer sets to the questions were established by asking the users to evaluate the retrieved documents. The non-retrieved documents were evaluated by a separate expert, who tried to interpolate the relevance judgments of the users. It turned out that of the 124 questions, only 63 had relevant answers.
"Sensitivity" and "specificity" were used to measure performance. Sensitivity (Se) was defined identical to recall, while specificity (Sp) was defined as the ratio of the number of non-relevant unretrieved documents to the total number of non-relevant documents in the data base. Effectiveness (E) was defined as: E = Se + Sp - 1.
[Page 217 ]
A main purpose of the experiment was to investigate the relative effectiveness of various indexing languages. However, the effectiveness of the indexing languages compared to authentic text was not investigated. Of greater interest to us is therefore the analysis of different search strategies. The tests included the use of two type of search requests:
The search request could be expanded by use of a thesaurus or by use of any other available source. Use of the thesaurus did not prove as effective as manual elaboration of the request. One of the most important findings, however, showed that it was nearly impossible by any means to expand the narrow requests to the extent where all relevant documents could be found. It was not until all but one category were dropped (broad search) that most relevant documents were found, but then at the expense of a considerable drop in precision. These results correspond well to the expected behaviour of text retrieval systems (Harvold 1976).
Rather more unexpected was the observation that, when a narrow request was expanded, an almost linear relationship was found to exist between total output and the number of relevant and non relevant answers.
The experiment included a test on relevance judgment based on different formats of output. The formats used were title, abstracts and the authentic text. The results are cited in fig 5/3.
| Fig 5/3 - Relevance judgments based on different formats
[Page 218 ]
The results of the authentic text must be considered the "correct" values. We note that the judgments based on abstracts or even titles are good approximations of the relevance assessment made on full text.
It is also interesting to note that the relevance judgment based on titles or abstracts were superior to the performance of the retrieval system. Below we have calculated recall and precision in both situations.
| Table 5/4 - Performance of retrieval compared to
[Page 219 ]
5.3 Development and design of text retrieval systems
5.3.1 The CONDOR project: 1973-1981
In 1973, the research department of Siemens (Germany) initiated a major project in information retrieval known by the acronym CONDOR (Communication in Natuerlicher Sprache mit Dialog-Orientierten Retrieval-Systeme). The project was oriented towards basic research, but the objective was also to develop a system that would replace the Siemens text retrieval system GOLEM currently used for instance by the JURIS (Germany) system.
The system was never developed beyond an experimental version, and was terminated in 1981. At that time, approximately 50 persons (among these 10 linguists) were working within the project organization. The project included a number of different activities, like automatic analysis of natural language, different retrieval techniques, general system development techniques and multiple processor retrieval. During the last years, the project was supported by the Bundesministerium fuer Forschung und Technologie.
The project is documented in for instance Banerjee 1977 and Fischer 1981. In this section, we shall give a summary description of the experimental system CON-DOR.
CONDOR was a system for retrieval of structured as well as unstructured information. One of the objectives was to develop a linguistic method which would make the system itself identify which part of the search request should be matched against the text of the document, and which part to be matched against the formated part of the document. The system was based on natural language search requests, but permitted formulations also of for instance Boolean requests.
The text retrieval module of CONDOR was based on an associative retrieval strategy. This implies that the system, in contrast to traditional text retrieval systems, does not contain an inverted file in the usual sense, but rather a hierarchy of descriptors. The hierarchy was constructed on the basis of a linguistic analysis of the documents (each sentence
[Page 220 ]
was analysed) where the terms (word stems) and their linguistic weights were determined. The linguistic weight was calculated using word length, word class and syntactical function of the word within the sentence.
The terms were clustered in "priority classes" based on their frequency in the data base - high frequency terms constituting the first priority class etc. Within each priority class, terms occurring in the same document were grouped. In this way a document might be a member of several of the clusters constructed. The clustering was primarily made on the basis of statistical data, but in borderline cases also the linguistic weights were used. In the last phase of the constructing of the term hierarchy, relations between clusters containing terms associated with identical documents were established.
In retrieval, a similar analysis was made of the search request as of that of the documents. The words were reduced to their basic forms, word classes were determined, compounded words were split and the different parts of the sentence were identified. The linguistic analysis was based on morphological criteria and on a list of function words and irregular inflections. It was presumed that the correct parts of the sentence and word classes were identified in 99 per cent of the cases.
The retrieval proper exploited the term hierarchy. Through a dialog with the system, the user was confronted with the relevant part of the hierarchy. Together they would try to find the set of documents most relevant. For instance the user might have a list of "related terms", ie terms occurring in the same clusters as the search terms, and himself decide which of these to include in the request.
Ranking of retrieved documents might be based either on traditional word frequencies or on the linguistic weights of the terms. In addition the user was allowed to allocate weights to the terms himself.
The system also made possible "fine retrieval" aimed at the retrieved documents in order to identify the pertinent sections of the text. In this phase one might require for instance that the words in the text should have the same syntactic functions as the words
[Page 221 ]
in the search request.
These techniques were implemented in the experimental version of CONDOR. It is, however, uncertain what effect they had on retrieval performance. Very few experiments were conducted, and we are not aware of any documentation of these experiments. With reference to the results already documented from advanced linguistic tools in Cranfield II and SMART (cfr above at sect 5.2.1 and 5.2.2), and the costs associated with analyses of the documents, establishment of the hierarchical network etc, we are rather sceptical. We also miss a plan for how to update the data base (including the hierarchy of terms).
In spite of these points of criticism, the CONDOR was a very interesting and even fascinating project, representing an important attempt to innovate the design of text retrieval systems.
5.3.2 The SPIRIT system
In France, considerable attention has been paid to advanced text retrieval employing sophisticated linguistic methods and a data base of documents in authentic language. In the early 1980ies at the institution Systex, research was carried out which resulted in the text retrieval system SPIRIT (Systeme syntaxique et probabiliste d'indexeation et de recherche d'informations textuelles). The system is fully developed. We are not, however, aware of any actual operative use of the system either on its own or - as has been implied - as an enhancement of another standard French retrieval system, MISTRAL. SPIRIT has, however, been used in research environments and also tested on a legal data base (Andreewsky/Binquet/Debili/Fluhr/Puderoux 1981:11-12). The system is implemeted on the mini computer Avion as well as on mainframes from Cii Honeywell Bull.
The search requests are always formulated in natural language. A request may be defined as one of the documents of the data base. In contrast to for instance CONDOR, SPIRIT excludes the possibility of constructing requests by the way of Boolean operators. Natural language requests, it is argued,
[Page 222 ]
requires less user effort, as one such request may replace two or three Boolean requests. The retrieval will therefore be faster and the costs per document less. This argument is not necessarily complete; costs are related also for instance to the length of the request and to the processing involved, which will hardly favour natural language requests.
Retrieval is based on the similarity between the search request and the documents. The measure for similarity takes into account word frequencies and weights, but also whether the linguistic relations between the words of the request correspond to the relations of the words in the document.
The documents are ranked according to this similarity measure. As part of the results, the user will have a specification of which Boolean argument is satisfied by the retrieved documents.
Both the documents and the search requests are analysed by the same linguistic procedure. This may be briefly described:
(1) The text is segmented into words and transformed into a normalized typography. Ambigious operators like "." in numbers and abbreviations are interpreted and given an unambigious representation.
(2) A morphological analysis is carried out, checking the words against a linguistic thesaurus of some 200.000 entries. The thesaurus is based on 35-40.000 basic forms and gives information in the different inflexions of the words and of words losing their accentuation. The analysis checks for spelling errors, and if such are identified, they are either automatically corrected or flagged for manual correction. The words are also associated with additional syntactic and semantic information to be utilized in the later stages of the analysis.
(3) Using a dictionary, ideomatic phrases are identified. The objective is to conserve important phrases with a content not derived from each words of the phrase.
[Page 223 ]
(4) A syntactic analysis to identify some homonyms and establish linguistic links between the words. Examples of homonyms of this type may be the word "still", which may be an adverb as well as a substantive.
The first part of the analysis is carried out by a self- instructing program. Starting with a short text analysed manually, the results are given to the program. The system establishes simple rules for identification of word classes and of syntactic relations between the words, which represents the basis of the automatic analysis. These are tested on a new text, and possible errors are corrected, resulting in an amendment of the rules. This process is continued until the results are satisfactory. The system has obvious advantages because the program may be adapted to most languages, and it has already been tested on English, Russian, Arabic and possibly Spanish and French.
The morphological and syntactical analysis in SPIRIT is maintained to be exceptionally fast, illustrated by the example of an IBM 370/168 processor using only 10-12 minutes for the analysis of 1 million text words.
(5) On the basis of certain grammatical criteria, words are deleted which are not appropriate as search terms (noise words) and the lemmae of the remaining words are identified.
(6) Finally, the words (or rather the lemmae) are allocated a weight based on a statistical analysis using an algorithm which takes the information value of the words into consideration (Bayes formula). The algorithm distinguishes between separate words and compounded words (IR-meddelanden 1981:31).
It may be mentioned that the system identifies negations and other restrictions in the text. This implies that the search request may contain words like "not", "except" etc which will cause the exclusion of undesired documents otherwise included in the answer set.
[Page 224 ]
Several experiments have been conducted with SPIRIT. None of these experiments have, to our knowledge, been controlled experiments in text retrieval. One has observed that the system performs satisfactory, but it is difficult on such a basis to compare this retrieval strategy with for instance a Boolean strategy or a simple word frequency ranking.
[Page 225 ]
5.4 Research regarding legal systems
5.4.1 The joint ABF/IBM project: 1966-1967
The joint American Bar Foundation and International Business Machine project became known not so much for its aim, which was to investigate the degree of satisfaction (as judged by a panel) that could be achieved by the use of a computerized retrieval system, as for its analysis of the difference in the assessments of the panel. The project made use of a vector type retrieval system developed by SF Dennis of IBM. The results of the project are described by Eldridge 1968.
The data base consisted of 5.800 appellate court decisions. The question set consisted of 40 questions taken from the files of practising lawyers. The data base was searched both by the retrieval system and by hand at the American Bar Foundation and in the legal department of IBM. Both answer sets were submitted to a panel of four lawyers for evaluation.
It was found that the retrieval system and the manual search had performed about equally well in terms of recall, and that the manual search was about twice as effective in terms of precision. However, a far more interesting and perhaps surprising result of the investigation was the intensity of disagreement between the four panelists. The panelists were instructed to read the questions and evaluate each retrieved document according to a four-point scale of relevance. The documents were to be assessed according to the contribution they made to the resolution of the issue raised in the question. Thus in effect the panelist were asked to evaluate the documents according to content relevance, not subjective relevance. Even so, the panelists disagreed more often than they agreed. Of a total answer set of 706 documents, the panelist gave only 3 per cent of the documents a unanimous relevant vote (either "on point", "relevant" or "related"), while 31.3 per cent of the documents received a unanimous irrelevant vote. A total of 65.7 per cent of the documents received a mixed vote. The
[Page 226 ]
disagreement, however, turned out to be rather systematic in the sense that each panelist seemed to prefer a certain grade - the academicians in the panel generally preferred the low relevancy grade while the practitioners favoured the high ones.
As an explanation of these results, the report suggests that the disagreement might reflect the fact that the questions were prepared by a practitioner. As a consequence the issues might have been more familiar to the practitioners in the panel than to the academicians. Other explanations might be possible. The experiment seems to emphasize the subjective nature of relevance. In addition the experimental framework was not "life-like", but distorted both by the fact that the different functions of the retrieval system were performed by different people, and by the fact that a relevance scale of four grades was used. As Eldridge himself points out, humans normally have difficulties in making comparative evaluations involving more than three or four documents. And in a practical retrieval situation the user probably has little need of making relevance distinctions beyond rejecting some documents as irrelevant and accepting others as of some use.
5.4.2 The Oxford experiments: 1963-1965, 1976-1982
The Oxford experiments represented one of the first large-scale attempts of evaluating the performance possibilities of full-text retrieval, and it was certainly the first experiment making use of a data base consisting of legal documents. This first phase of the Oxford experiments is described by Tapper (1969, 1973:159-182).
The aim of the experiments was to measure the efficiency of computerized legal information retrieval as compared with the conventional technique of index look up.
Two relatively large data bases were prepared for the purpose of the experiment. The first was a general series of reports of decisions in the High Court, the All England Law Reports. The second was a series of administrative decisions in the field of insurance
[Page 227 ]
claims for industrial injuries, the Commissioner's Decisions. The two data bases consisted of about two million and one million words respectively. The data bases were chosen largely because both of them could be accessed through manually constructed indexes. The High Court decisions (called cases for short) were indexed both for the series and for the individual volumes. The index terms were taken from an introductory telegraphic abstract of each report. The administrative decisions (called decisions for short) were indexed in a general loose-leaf file. The index was detailed and thoroughly cross-referenced, reproducing a high proportion of the original headnotes. The index was oriented towards factual descriptions, in contrast to the case index, which was oriented toward legal terms.
Still another factor concerning the experimental setup should be mentioned. The manual searchers were restricted to the use of the indexes. They were not allowed to browse through or examine the documents themselves, because it was felt that they then would have an unfair advantage compared to the searchers using the machine.
The results were evaluated in terms of recall and precision. The questions were based on the facts selected from representative and recent reports. The answer set was defined by selecting:
Thus no attempt was made to find the complete answer set consisting of all the relevant documents. However, based on the assumption that the computer and conventional techniques were independent, a value for the size of the complete answer set was estimated on the basis of the intersection of the two answer subsets.
The main results of the experiment are summarized below.
[Page 228 ]
Fig 5/5 - Performance of conventional and computerized retrieval
r = recall
p = precision
We note that with respect to recall the computer techniques performed significantly better than the conventional techniques. In fact the differences in values are quite remarkable. This can be seen as yet another confirmation of the superiority of authentic text representation compared to indexing, even when the indexing is quite elaborate and thorough.
As expected, the computer technique produced inferior precision values, but it is worth noting that the range of the values, from about 20 to 40 per cent, is not at all unmanageable in an on-line environment.
The second phase of the Oxford experiments originates in Tapper's interest in citation patterns. The common law practice of citing precedents creates a citation pattern around each case: the case will cite former cases and in turn be cited by later cases. Tapper's hypothesis was that these patterns were quite distinctive and well suited to retrieve similar cases.
The experiments took place in part at Stanford, California, and in part at the Norwegian Research Center for Computers and Law, Oslo. They are reported for instance in Tapper 1976 and 1982.
The experiments used data bases of English and US case law being of both a specific and general nature. Each case was described by a string of citations, both of them cited by and citing the described case.
Comparisons were based on vector retrieval - in Oslo was used the extention to the NOVA*STATUS text retrieval program, VEXT. Part of the interest in the experiments concerned the arriving at an appropriate
[Page 229 ]
similarity function. This was done by "fiddeling" with the standard cosine function in order to assign special weight to features of the citation patterns. If, for instance, two cases cited another case decided by a low court, this was taken as an indication of higher similarity than if the two cases cited another case decided by a higher court. The increased performance of what has become known as the "Oxford function" compared to the cosine function is quite remarkable.
Results were compared in a number of ways, for instance taking text book examples and comparing the cases retrieved with the citations associated with the relevant passage in the text book.
The citation patterns were also used for clustering, which again was compared to the "clustering" represented by the index to the volumes of case reports.
The use of citations and their interpretation has yet to find its place in legal information systems, but the experiments proved that citation patterns were effective in identifying and grouping together common law cases having significant points of similarity.
5.4.3 The Responsa project: 1967-1983
The Responsa project was initiated in 1967 as a cooperative effort between the Weizmann Institute of Science and the Bar-Ilan University in Israel. Over the years activities have been transferred to the Bar-Ilan University, and since 1975 the project has been located at the Institute for Information Retrieval and Computational Linguistics at this university.
The objective of the project is to make available the enormous Jewish responsa literature for text retrieval. The literature consists of questions and answers from the 8th century to our own days, and originate from 50 different countries. It was written by approximately 1.000 different authors and contains information of importance for several aspects of Jewish and Humanities studies. At the end of 1982, 210 collections containing 45.000 documents (approximately 45 million text words) were converted and made retrievable.
[Page 230 ]
A survey of the project is given for instance by Choueka 1980.
The project has developed its own text retrieval system and most of the research is related to the design of this system. The project has excellent possibilities of applying research results to a real user environment, as it includes research, development of the retrieval software, establishment of the data base and utilizing the resulting system for research in the responsa literature.
A number of the experiments are of a linguistic nature. The language of the documents is mainly Hebrew with some Aramaic, both of which have a complex structure making the text processing extremely difficult. The morphology is extensive and complex. The form of the words are determined by the context, and contain several suffixes and prefixes. The suffixes determine the inflexion of the word and contain information on gender, number, tense, mode etc, as well as attached pronouns. The factors are numerous, and a substantive may have several thousand different inflexions. The prefixes determine which prepositions are associated to the word. As there are as many as 100 prepositions, the linguistic variants of a word may indeed be numerous. It may be illustrative that a modern and comprehensive Hebrew dictionary may have 35.000 entries (based on 3.-5.000 different roots), while the total number of different words exceeds 100 millions (Choueka 1980:162).
In addition, Hebrew is in general written without vowels, and on average a Hebrew word may be interpreted in 4 different ways. This creates a large homograph problem in text retrieval (Choueka 1980:163).
In addition, Hebrew and Aramaic word forms and rules are often mixed in the text. The texts also contain numerous abbreviations and acronyms; in some instances as much as half the text is made up by such word forms.
In the beginning of the 1970ies the program system KEDMA was developed, which was intended to take care of the linguistic problems related to text processing, especially to text retrieval (Attar/Choueka 1978). KEDMA is a Hebrew acronym for "Grammatical
[Page 231 ]
file in retrieval systems", and is based on a representation of the vocabulary in a tree structure. The tree is established on the basis of a comprehensive Hebrew dictionary where all Hebrew words are represented in a look-up form associated with grammatical information. The structure of the tree reflects the construction of the words (consisting of prefixes, root and suffixes) with the roots of the words on top of the tree, and their inflexions in the bottom of the tree. Using the linguistic methods of KEDMA one is able to map the different inflexions of a root by following the tree and combine the root with relevant prefixes and suffixes. Correspondingly, one may identify a root by scaling the tree and shedding suffixes and prefixes as they appear. The method is of the synthesis-analysis type, which is partly constituted of a grammatical synthesis mapping and eliminating suffixes, and of a local grammatical analysis which identifies and eliminates prefixes.
The first version of KEDMA was constructed especially to solve the linguistic problems created by testing the feedback method in text retrieval based on a content analysis (see below). The system was later developed into a general linguistic tool, and includes today procedures for syntactic analysis, morphological analysis, development of concordances, and word statistics. The program system may also be used for other languages.
One of the best known experiments in text retrieval within the Responsa project is the test of local metrical feedback (Attar/Frankel 1980). The method uses the result of a search to reformulate the search request for the next search. The reformulation is based on an analysis (clustering) of the retrieved documents, where words frequently occurring together with the search words ("searchonyms") are identified. These words can either be suggested to the user, who then make the final reformulation of the request (manual feedback), or they may be automatically associated to the request (automatic feedback). The method is described above under sect 4.5.4 (3).
The method has been tested by both experienced and unexperienced users of text retrieval, and resulted in both cases in an improvement of performance. The experiment with automatic reformulation of the request did not, however, increase performance.
[Page 232 ]
The project included a number of other experiments in information retrieval and computational linguistics, such as automatic generation of concordances, disambiguation by short contexts, automatic retrieval of frequent colloquations and idioms. These we shall not consider in detail in this brief presentation. In the operative Responsa project one will find all the possibilities generally available in a text retrieval system. In addition, the system offers the linguistic tools, a very general mask function (left hand, right hand and middle multiple truncation), highlightning and a number of sorting routines for retrieved documents.
5.4.4 The WIENER SYSTEM
The WIENER SYSTEM research project was based on a contract between the Austrian Chancellery and IBM Austria. The initiative may be traced to the political program of the Austrian Socialist Party in 1970. This stated the necessity of making available court decisions, legal literature and statues by a computerized system.
The main objective of the project was to develop a software package for legal information retrieval, and to demonstrate the capabilities of this software by means of a test data base.
The test data base consisted of 10 years of decisions by the Constitutional Court, approximately 10.000 documents; 3 years of legal literature, approximately 6.000 documents; and the Constitution with all its amendments (approximately 800 documents).
During the project period, STAIRS became available in Europe. It was decided to make use of this software, and concentrate on the development of "marginal" programs to STAIRS (mainly software for input and user support), and on the documentalistic aspects.
In the field of software development a thesaurus support program for building up and using a thesaurus was constructed. This program was named FAIR, and it
[Page 233 ]
allowed for automatic generation of thesaurus relations and control procedures, browsing, automatic inclusion of descriptors, etc. It is now available as part of the TLS (Thesaurus and Linguistic System), available from IBM as a supplement to STAIRS, and is being used in some legal information retrieval systems.
A linguistic package for the German language was developed. This expands search terms (ie nouns, adjectives and verbs) automatically to all their grammatical variations at the moment of searching, if the user so desires. This program is now the other part of TLS, and in addition to German, the program is available for French and English.
A package for reconstructing the text of a statute as it was at any given time, taking into account the original text and later amendments.
A conversion software was developed for converting different forms of input into a format suitable for STAIRS. Also a sophisticated control software for the input documents was developed.
A citation network generation program was developed, allowing the user to find in one step all documents citing and cited by a certain document, the documents which in the next cycle was cited by or did cite the documents retrieved in the first cycle, etc. This snowball program, however, remained unfinished.
In the documentalistic field, a comprehensive table of categories for legal documents was developed. This table also took into account the treatment of citations. The table is known as WAKS - Wiener Allgemeines Kategorienschema.
In addition was developed a stop-word list for legal documentation; a "thesaurus" for legal institutions like courts, administrative agencies, parliaments etc; and a classification table. Proposals were made for structuring legal documents for storage and retrieval.
The construction of a legal thesaurus was abandoned during the project. It was realized that this would encounter severe difficulties, and its usefulness with respect to improving performance of text
[Page 234 ]
retrieval was considered doubtful.
The most important result of the project was probably its demonstration of the fact that the method of text retrieval was possible and reasonable. Though the project itself was discontinued, some of the other Austrian projects have made use of the results (cfr below in Part III under the entry for Austria).
The WIENER SYSTEM is discussed for instance in Bock/Lang 1973, Svoboda 1973a, 1973b and 1975.
5.4.5 The MAJUS program: 1974-1977
In 1974, the Institut fuer Datenverarbeitung im Rechtswesen (IDR) - which is part of the independent research institute Gesellschaft fuer Mathematik und Datenverarbeitung (Bonn) - initiated a research program on legal information retrieval. The aim of the research was to support the development of the German legal information retrieval service, JURIS (cfr the entry on Germany in part III). The program was named MAJUS, an acronym for Methoden zur Aufbereitung und Abfrage in juristisch orientierten Suchsystemen.
A number of different projects were carried out within the framework of the program, and a number of publications documented the program - apart from the series of reports (MAJUS 1-21) which were published by IDR. These include a summary report (Gebhardt 1977).
The program was mainly concerned with four aspects.
Development of tools. An analysis of user dialog was planned in order to give suggestions of improvements in the text retrieval program an empirical basis. Also, standards for the assessment of dialog systems were to be developed.
Standards for retrieval systems. An early attempt was made to discuss the standardization of interfaces for text retrieval systems. A number of assessments of systems were conducted. As part of this work, a simulation model oriented towards the IBM text retrieval system STAIRS was developed. Within the
[Page 235 ]
framework of the program, studies were also made of structure and statistics of legal texts, as well as of the structure of search requests. A study of the possibilities of data compression was not terminated.
Improvement of the dialog. A concept for a computer- assisted creation of thesauri was developed. The performance of ranking algorithms - mainly word-frequency algorithms - was analyzed. A number of requirements with respect to search language (as many as 150) was established and analysed. The possibilities of automatic error identification and correction were examined.
Development towards a question-answer system. In parallel to MAJUS, a sub-program known as FRAGANT (Frage-Antwort-Systeme im Recht) was initiated. This project was an early attempt to see how it might be possible to give "dem Benutzer Antworten auf seine Fragen und nicht nur Hinweise auf Literaturstellen". This ambitious sub-program was terminated at the same time as MAJUS was discontinued.
In the context of our survey, one will appreciate the depth and scope of the MAJUS program. The studies ranged from practical studies to ambitious projects. The approximately 50 MAJUS publications indicate the richness of the program.
The program was discontinued in the spring of 1977. This was a decision by the GMD taken in order to improve coordination of IDR projects with the overall research policy of the GMD. It is understandable that such decisions have to be taken from time to time within large research establishments. Nevertheless we shall close this section by regretting that such a promising program had to be cut short.
5.4.6 The NORIS research program: 1972-1983
The NORIS research program was initiated in 1972 at the Norwegian Research Center for Computers and Law. It was designed to provide a common superstructure for activities relevant to legal information systems in a broad sense. The objectives of the program are to be attained through a number of small and
[Page 236 ]
well-defined projects - taking, so to speak, one step at a time. The program is still running with a number of parallel projects.
The program includes studies of design of legal information retrieval systems, deontic systems, user studies, communication processes, etc. In this section, only some of the activities relevant to the design of text retrieval systems shall be mentioned. It should also be stressed that the program has never aimed at developing any operative text retrieval system. Its objectives are concerned with methods and design principles - and some of these have been exploited in other activities, for instance in the development of the Norwegian text retrieval systems NOVA*STATUS or SIFT, activities in which the NRCCL has participated, but which has been directed by the Government Institution for Organization and Management (Statens rasjonaliseringsdirektorat).
There have been two main objectives with respect to the research in text retrieval systems. The first one concerns the properties which would be desirable in a text retrieval system to satisfy the needs of lawyers. The second one concerns the performance of different search strategies with respect to legal problems.
(1) A probabilistic model of information retrieval was developed (Harvold 1980). The model itself is made up of two components.
One component gives the number of retrieved documents as a function of
The second component gives recall and precision values as a function of
In the model, a class is interpreted as a class of searchonyms. The model is based on traditional probability theory coupled with the type-token distribution of terms in natural text. This distribution reflects the fact that terms do not occur independently of each other in natural text.
The model has been used to investigate various aspects of retrieval performance, as for example:
There are other aspects of retrieval performance which may be analysed in terms of the model, but there are also some aspects which cannot be analysed due to the limitations inherent in the model assumptions. For example, since all the documents in a collection are assumed to have the same length, the relative effect of ranking algorithms, designed to compensate for variations in length among the various documents, cannot be investigated. Similary, since all classes in a request are assumed to contain the same number of terms, the effect of "unbalanced" requests - one or more classes being greatly expanded, while others include only one or a few terms - cannot be analysed.
The model at least has shed some light on the factors determining the limitations inherent in search processes based on the matching of terms. In this connection one of the interesting results of the analysis is that the precision loss caused by lack of context is relatively minor, provided that the documents are not extremely long. Consequently, there is probably little to be gained by further development of the search function in the direction of automatic text analysis with the aim of resolving textual ambiguity in the documents. The improvement in system performance from such a development would be small, and the accompanying development costs would be very high.
[Page 238 ]
According to the results provided by the model, the performance loss due to retrieval failures seems to be a more serious problem. The failure to represent one of the concepts in the question (exhaustivity failure) will normally be a serious failure in terms of both recall and precision. Less serious, but not negligible, is the failure to represent a given concept with the necessary terms (specificity failure).
There is room for considerable improvement in today's systems with respect to reducing the risk of these retrieval failures. Development in this direction must necessarily involve user-system communication to a greater extent than has been the norm in existing systems.
Finally, it should be mentioned that the model differs from "traditional" probabilistic models based on loss minimizing rules (see for example Salton 1979) on the following accounts:
(2) Controlled experiments. Characteristic of the research within the NORIS program has been the development of a process which allows controlled experiments in text retrieval. The design is basically very simple. A document collection (data base) is specified. A number of questions are formulated, preferably identical to the type of questions which a lawyer would try to answer by referring to documents included in the data base. The questioner will also identify an "answer set" as a sub-set of the data base. The answer set includes documents which the questioner deems "relevant". The relevance assessment is, of
[Page 239 ]
course, often a controversial element of an experiment in system performance. In the experimental design, however, recall and precision are used only to measure the relative performance of strategies. It is not maintained that the measured quotas indicate the "true" performance of the strategy, merely that they indicate which strategy performs better.
It should be noted that the identification of an answer set presumes that the questioner is able to use an alternative and independent method with respect to the retrieval system to be tested. In the NORIS experiments, the questioner has usually been a legal expert relying on conventional research tools and subsequent examination of the documents in the data base for identification of answer sets.
At this stage the experimenter has a set of questions and an answer set to each question. The question is then represented as search requests in the form required by the system to be tested. Identical search requests may be made subject to different strategies. In a similar way identical strategies may be used on different document designs representing the same sources.
Performance is expressed by comparing results with the predefined answer set. In the NORIS program, the results have been represented as recall and precision ratios plotted onto graphs. The individual recallprecision graphs of the questions included in the experiment are finally summarized in the form of average graphs according to some simple rules.
When first designing controlled experiments in 1973 (Bing/Harvold 1974), the NRCCL was using auxiliary programs and coded input to obtain the necessary calculations. After the introduction of NOVA*STATUS in 1975-76, the NRCCL developed an extension to this text retrieval system at the computer center at Oslo University. This extension, known as VEXT, allows the experimenter to define sub-sets of the data base as answer sets, and may provide recall-precision tables and average recall tables for any search request (Fjeldvig 1978).
(3) Retrieval strategies. The NORIS program also includes the evaluation of different retrieval
[Page 240 ]
strategies. The strategies were tested empirically in addition to being compared theoretically. Of special interest were the matching of functions used to rank documents. The following strategies have been tested and compared:
The results prove that through ranking it is possible to improve retrieval performance by having the relevant documents ranked on top of or high in the list of retrieved documents.
Using word frequency ranking, the performance is improved by adjusting the rank in respect of document length. A number of different algorithms for such adjustments have been examined.
Vector based retrieval performs on a level corresponding to word frequency ranking adjusted for document length, which is hardly surprising as these two methods are based on the same criteria. This work has been inspired by the SMART project, and a vector based system has been included in VEXT, the NRCCL extention of NOVA*STATUS mentioned above.
The best results have clearly been achieved by conceptor based retrieval. The strategy takes the question of the user as its point of departure. The question has the form of a "need-to-know" in the mind of the user. It is suggested that the user tries to structure this need in a simple way, regarding this need as an intersection of ideas.
The problem may be related to whether a housewife with some physical handicap (like being unable to walk) may be qualified for disablement pension. Such a question may be structured as an intersection of three ideas: 'housewife', 'unable to walk' and 'disablement pension'. Each idea may be expressed as a class of synonyms specified by the user, and combined with a Boolean OR. Such a class of synonyms, which
[Page 241 ]
are all related to the same idea, is what is called a "conceptor".
Any document containing at least one word represented in one of the conceptors (ie the union of all documents containing at least one word from the request) will be retrieved. These documents are then ranked according to a double scheme. Primary rank sets are created from conceptor frequency - the documents containing words representing the highest number of conceptors being ranked highest.
Ideally, all the conceptors of the search request will be present in the documents of the highest primary rank set. Such documents have a high probability of fulfilling a minimum condition of relevance. As all conceptors are represented, it is likely that all the original ideas are also represented. The presence of all the original ideas is a minimum condition of relevance, though not a sufficient condition, since the structure of the ideas may be different from those of the question. For instance, the document may describe a housewife applying for disablement pension for her invalid husband.
Within each primary set, the documents are ranked according to the number of words from the request in the document.
Above, in fig 4/6/1 the average curves for the performance of conceptor strategy is compared to the results of a straight word frequency ranking and an incidental ranking based on the dates of the documents (Harvold 1980b:144).
An interesting aspect of the conceptor based strategy is the pedagogic help offered by the structure to the user. Users have to structure their search request in a rough way by dividing the problem into ideas, and specify synonyms in classes describing these ideas. The user interface of the retrieval system (as implemented in for instance NOVA*STATUS or SIFT) makes this easy for the user. The user is encouraged to specify synonyms, countering one of the major difficulties in text retrieval - that of specificity. The user does not have to specify explicitly the logic of the request, though the request may be interpreted as a quite complex formulation. And the user seems to find the structuring of the request into ideas not
[Page 242 ]
only easy, but quite helpful. Judging from experiences with legal end users, they are at least finding this a "natural" way of structuring requests.
Conceptor based strategies are also discussed above under sect 4.5.2 (3).
(4) Performance failure. In most controlled experiments, one has also examined the causes of performance failure. This is seen as a necessary analysis in order to understand why one strategy performs better than another.
A recall failure occurs when a relevant document is not retrieved at all. The reason for such a failure can generally be identified as belonging to one of five groups. A specificity failure means that the user did not find the correct terms to represent the concepts of the question. An implicity failure means that a concept was not explicitly expressed in the document. A point-of-view failure occurs when the user and the author approach the problem from different angles - with the result that different vocabularies are used in the request and the document. A system failure is caused by a fault in the retrieval system. The system failures in the experiments were mainly caused by faulty maintenance of the data base.
The results referred to are derived from a series of experiment in project NORIS (8). Three different document collections were used: 1-100 decisions by the Social Security Court (118.000 words); II - 430 decisions by Swedish administrative courts (80.000 words); and III - 379 responsa from the central tax administration (258.000 words). In later experiments have been used a data base of 1.298 abstracts of appeal court cases in family law (188.432 words); and 1.010 responsa of the Ministry of Justice in respect of land registration (217.638 words).
Fig 5/6 - Causes for recall failure (documents)
[Page 243 ]
It is encouraging that the major cause of recall failure may be related to the user (specificity), and not to the fact that the documents are in natural language (implicity alone would make feasable a recall ratio of 96-97).
Precision failure occurs when an irrelevant document is retrieved. There are too many precision failures for all to be analysed. The post mortem was therefore limited to the irrelevant documents in which the relevant ideas had been identified, ie the upper rank sets. The inclusion of irrelevant documents in lower rank sets is a characteristic of the conceptor based strategy, and not a failure.
Fig 5/7 - Causes for precision failure
Specificity may be caused by homonymity. It seems to vary with the data base - and is especially influenced in particular by its lack of homogenity in topics. Exhaustivity represents the situation in which the document does contain the subject of the question, but the subject is only mentioned as a reference, en passant or in unrelated to the main content of the document. This type of failure may be said to be caused by the nature of natural language. At least it is reasonable to expect that the failure is less prominent in indexing systems. However, exhaustivity failures are not absent from such systems - Lancaster identifies exhaustive indexing as the cause of 11.5 per cent of the precision failures (see above).
(5) Natural language search requests. Special effort has been made in the development of strategies for natural language search requests. Within this project, several aspects have been studied.
Automatic indexing. Through frequency and distribution statistics, criteria for appropriate search
[Page 244 ]
terms have been developed. The relation between appropriate search words and word classes has been evaluated, confirming that substantives as well as verbs derived from substantives ("dance" and "danced") are the most typical. Also adjectives and adverbs are valuable if they may be associated with the appropriate word. Compounded words are identified and split.
Expansion of the search request. Automatic expansion of words with the same basic form (lemma), expansion by splitting compounded words (in the Norwegian search requests as much as 30 per cent of the words were compounded excepting stop words), and expansion by automatic truncation based on a set of stable rules. Performance tests are run on all strategies.
In addition, one has tried an automatic structuring of search requests, ie expanding each identified search word of the request into a class of synonyms, and ranking it as a conventional conceptor based request.
(6) A preprocessor to text retrieval systems. There are limits to what extent it is possible in practice to increase performance by using the search request and documents in conventional text retrieval systems. The analysis of performance failure has proven that the major cause of recall failure is specificity. If performance is to be increased, the input of the user must be improved.
This implies a better communication between the user and the system, with an improved search request as an assumed result. The current activity at the NRCCL is to develop an "intelligent" preprocessor to a text retrieval system, involving the user in a dialog governed by the system. This dialog will include:
Analysis of the search request, which may be formulated in natural language or Boolean logic. Appropriate search terms will be identified, and a syntactic analysis of requests in natural language will be performed in order to identify phrases and compounded words.
Expansion of the search request. Expansion by stemming (methods developed and being based on rules without using dictionaries give an
[Page 245 ]
efficiency of 97-98 per cent); expansion by automatic truncation and splitting of compounded words. Expansion will also be facilitated by saving and aggregating prior requests and analysing their structure.
A norm based thesaurus for further expansion of the request. (Such a thesaurus is in its experimental stages.) This type of thesaurus is related to expert systems, its basis being a representation of the normative structure of a selected (and rather limited) area of law. The norms are described as hierarchical networks, and the nodes of these networks are again described by lists of words. A user may employ the network to identify his legal problem, and the thesaurus will then translate this into a structured search request, exploiting structural information from the network and the word lists associated with the nodes. This work at the NRCCL is supported by parallel projects in the modelling of legal norms.
Automatic estimates of recall and precision. The model of text retrieval presented above allows automatic estimates of recall and precision to be calculated, guiding the user by indicating for instance that "few" or "many" of the relevant documents have been retrieved.
Modification of the request by different feedback methods.
(7) Document design. Document design is important with respect to text retrieval, as the document defines the limits of performance of the system. Some experiments have been conducted to test the performance of different designs. For the identical data base, three different document types were used
As expected it was found that the best results were obtained by documents including title and the least satisfactory results by documents including title only - but the differences were perhaps not as significant as might have been expected.
[Page 246 ]