HOME
PREVIOUS PAGE
NEXT PAGE


III PERFORMANCE OF TEXT RETRIEVAL SYSTEMS

[Page 143 ]


8 Retrieval systems

8.1 INTRODUCTION

In a decision-making process the need for information may vary, ranging from the need for a single fact or a specific document to the need for documents which discuss or treat a certain problem or subject.

When the decision-maker himself is performing a manual search, he will vary his technique according to the type of retrieval problem confronting him. If for instance he is looking for a specific fact like an income figure, he may first use a general index to find the correct volume and the correct table. The desired information may then be looked up through the table index, which may, for example, be an alphabetical namelist. If the user is looking for general information relevant to a given problem, he will usually have to read through all possibly relevant documents, often quite thoroughly in order to grasp the meaning of the arguments presented.

A machine, of course, does not have the human capability of understanding text. It does, however, have two capabilities which are useful with respect to information retrieval. First of all it can match characters, and secondly, it can retrieve information stored in a given field. The last capability can be used to construct lists and networks of related information; see the CODASYL Data Base Task Group report for an example of such a system. What is known in a system of this kind is the label of the field where the information is stored. The information itself is unknown. The situation is the exact opposite where the machine is used for character matching. Here the user knows what he is seeking, but he does not know where the information is stored. It is primarily the character-matching type of retrieval system we shall deal with in this chapter.

Character matching can be retrieval by known characteristics such as, for example, author or title. It can also be retrieval by problem or

[Page 145 ]


subject, although a host of difficulties usually have to be tackled in such a process.

Many of the difficulties stem from the fact that character matching operates on a purely syntactic level. A machine capable of matching concepts and even complex subjects would seem to be a highly desirable development. Obviously such an intelligent machine would solve many of the problems now confronting the user. However, it would not solve all of the problems. Retrieval by subject is complex largely because a subject, represented as a problem, is subjective and singular to the user. It can be difficult to communicate the problem accurately to another intellect, whether this be another human being or a machine. The reasons for this are complex, but seem at least partly linked to the fact that the conception of a problem is always determined by what the person already knows - by his so-called background knowledge. No two humans have identical background knowledge, and an intelligent machine can also be expected to have its own unique experience and its own "personality". Tomorrow's computer librarians may turn out to be no easier to communicate with than many of today's have proved to be.

In the meantime a character-matching machine might not be such a bad alternative. These systems provide direct access to the documents, and on-line systems have made instant feedback possible. The main limitation of the systems is that effective use requires user experience in order to bridge the gap between subject and query. The user must be able to anticipate or guess how different authors have chosen to express the various relevant concepts. The machine can help the user in this process. Below we shall discuss miscellaneous functions which are available. However, the basic difficulty of query construction will always rest with the user, since he is the only one who knows, or should know, exactly what his problem is.

Text retrieval systems based on character matching are designed according to relatively simple principles. Roughly speaking a text retrieval system consists of four main components:

  • file generator
  • search file
  • document file
  • user interface.

The file generator reads the documents and creates the search file and document file.

[Page 146 ]


Fig. 8/1
Creation of the search and document files

The document file contains the text of the documents in their original form. It makes the documents available to the user for reading purposes.

The search file is used for the purpose of searching the documents. One way of searching the documents would be to read through the documents in sequential order. For large data bases this is a time-consuming way of performing the task. Normally, therefore, use is made of a file constructed specially for the purpose of searching.

The function of the search file is similar to the indexes found in most books. The search file lists all the words that are used as document attributes. Each word is associated with a list of references identifying the documents (and places within the documents) where the word occurs.

The list of words is organized so as to optimize the speed with which any given word may be looked up. The words may for example be listed in alphabetical order. The order can be utilized when searching the list for a given word. The technique is the following.

[Page 147 ]


Bisect the list and compare the word found with the given word. If the words do not match, use the outcome of the comparison to determine the part of the list in which the given word belongs. Bisect this part and again compare the word found with the given word. Repeat the process until either a match is found or the bisection cannot be carried any further.

The described technique is called a binary search. The technique is remarkably efficient compared to a sequential search since the number of comparisons that have to be made are relatively few in a binary search. A sequential search requires at the most n comparisons to find a given word in a list which is n words long. A binary search requires at the most log2 n + 1 comparisons, where log2 n is the logarithm of n to the base 2. Compared to n, log2 n is a small number. Given for example that n = 100.000, log2 n 17.

The list of words may also be organized in a tree-like structure. In a so-called tree the principle of the binary search is built into the structure of the list itself.

Suppose we have a list of seven words ordered alphabetically. A tree can be constructed from this list by choosing words according to successive bisections, as illustrated in the diagrams below:

bisection
level word
3 -> and
2 -> decisions
3 -> information
1 -> legal
3 -> of
2 -> systems
3 -> theory

A tree is searched by first comparing the given word with the word at the top of the tree. If the words do not match, one of the branches is chosen depending on the outcome of the comparison. The given word is then

[Page 148 ]


compared to the next word on the branch, and the process is repeated until a match is made, or the bottom of the tree is reached.

The number of necessary comparisons when using a tree will in principle be the same as when using a binary search. The advantage of the tree is that it is simpler to update than an ordinary list.

Suppose for example that we want to add the word "computer" to our list of words. According to its alphabetical value "computer" should be inserted between "and" and "decision" in the list. Inserting the word here means moving all the words below "and" downward to make room for the new word.

Adding the word to a tree, however, does not require moving the other words. In a tree structure new words are added on to the bottom of the tree. The word "computer" is connected to the word "and" as follows:

We note that a growing tree may eventually become unbalanced. The tree can be balanced by selecting a new word as the top node and rearranging the other words accordingly.

The user interface provides the user with the means of communicating with the system.

The user interface will analyze the query submitted by the user, perform the search, and present the results back to the user. The user interface also provides the user with the means of reading the documents stored in the text file.

In the following chapters we shall look in more detail at some of the factors that are relevant to the design and performance of the user interface.

First we shall take a look at retrieval performance. Central to a definition of performance is both the concept of relevance and the process of retrieval itself. We shall discuss both factors and define the criteria by which performance is measured.

[Page 149 ]


Next we shall consider a variety of factors that we have assembled under the heading of search strategies. The factors involve a number of choices confronting the designer and user of a retrieval system. Among the choices are questions relating to the selection of data base, selection of document representation, selection of command language, and formulation of query.

Finally, we shall look at some of the research that has been done regarding the performance of retrieval systems. Our survey of research projects is by no means complete; we have limited ourselves to eight projects - four oriented toward general retrieval problems, and four oriented specially toward problems of legal retrieval.

Fig. 8/2
User communication with the retrieval system

[Page 150 ]


9 Retrieval performance

9.1 INTRODUCTION

A retrieval system is usually only part of a greater information system which, in addition to the retrieval system, will include a host of other data-gathering and data-processing routines. The information system itself is part of a decision-making system, which in turn is part of a social system, for example a court of law, a government office or a corporation. It is the goals of the higher-up systems which determine the goals of the information system and by implication the goals of the retrieval system.

We shall not go into great detail on the subject of goals. If given a lengthy theoretical exposition, the subject may easily seem more difficult than it really is. When we deal with goals at all, we shall keep strictly to the goals of the retrieval system. Furthermore, we shall not discuss how the goals of the retrieval system may compete for scarce resources with other goals. In other words, we shall refrain from entering into a discussion of retrieval efficiency.

What we are interested in is the ease with which the system may be employed by the user and the quality of the retrieval result itself. Corresponding to these two broad problem areas we can speak of operations-oriented and relevance-oriented performance criteria.

9.2 OPERATIONS-ORIENTED CRITERIA

Performance criteria may be defined in different ways and at different levels of generality. Most criteria, however, can be reduced to one of the six criteria in the now famous list first presented by Cleverdon (1964).

Three of Cleverdon's criteria can be classified as operations-oriented criteria. These are:

  • response time
  • user effort

[Page 151]


  • form of output

Response time is usually taken as the time from when the query is submitted to when the response is received. However, in on-line systems the concept of response time becomes more vague than in the typical batch system, and can in fact be given a variety of interpretations, see Lancaster/Fayen (1973: 133-136).

First of all we have the time it takes to make the system, including the data base, available to the user. On-line systems are usually available only at certain hours of the day, and certain data bases may only be available on request.

Secondly we have the time it takes for the system to respond to a command. For an average query the response time should only be a few seconds. A waiting time in excess of four seconds is generally considered as a needlessly irritating factor that may alienate users.

Lastly we have the time it takes to achieve a satisfactory result. This will include the entire time spent in a session. A retrieval problem will normally require several searches with time spent in between on evaluating the answer sets. Sessions lasting in excess of half an hour are not unusual.

User effort relates to the design of the system. A system may have been designed with a special group of users in mind. With respect to effort it is at the very least necessary to distinguish between the efforts of a beginner and the efforts of an experienced user. Thus a powerful and flexible system may not always be easy to learn. However, once such a system is mastered, the user may find that it is relatively easy to conduct searches in this system compared to a "simpler" one. User effort is therefore a complex criterion which depends as much on the abilities of the user as on the design of the system.

Form of output refers to the various formats in which the documents and feedback indicators may be presented to the user.

We shall return to the question of output in addition to the questions of user effort and response time in section 10.5.

9.3 RELEVANCE-ORIENTED CRITERIA

A high score regarding the operations-oriented criteria can be used to offset a low score regarding the relevance-oriented criteria, and vice versa. A system which tends to retrieve many irrelevant documents along with relevant ones may thus still be usable as long as it has a short response time and adequate display facilities.

[Page 152 ]


Nevertheless, the main purpose of a retrieval system is to retrieve all relevant, and only the relevant, documents for a given request. Failure to do so can be ameliorated, but never completely remedied, by purely operative functions. It is therefore quite natural that the literature on retrieval systems has to a large degree concentrated on relevance-oriented questions. These are the questions dealing with the relevance of the systems in general and with the relevance of given retrieval results in particular.

The relevance-oriented criteria are:

  • coverage
  • recall
  • precision

Coverage is a measure of the general relevance of the system. The measure says something about the adequacy of the document collection in relation to the information needs of the user. A collection which includes all the documents that the user may ever want to consult has a coverage of 1. Smaller document collections have a coverage between 0 and 1.

The quality of a particular search is measured by the recall and precision ratios. The recall ratio measures the proportion of all relevant documents in the data base which have been retrieved. The precision ratio measures the proportion of the retrieved documents which are relevant.

If a, b, c and d are defined as below:

retrieved not retrieved
relevant a b
not relevant d c

then:

recall = a / (a + b)
precision = a / (a + d)

9.4. THE CONCEPT OF RELEVANCE

The measures of recall and precision may seem simple at first glance, but this impression is something of an illusion. The measures depend

[Page 153 ]


entirely on the concept of relevance, and this is probably one of the most difficult concepts in the whole field of information retrieval.

There are at least three issues of importance to our understanding of relevance. The first issue concerns the type of relevance. Are we dealing with formal, content, or subjective relevance? See Königova (1971). The second issue concerns the nature of relevance. Is relevance absolute or is it relative to each user? The last issue concerns the grading of relevance. Is relevance a matter of degree or is it an either/or proposition?

9.4.1 Types of relevance: formal, content, and subjective relevance

Relevance is a relator in the sense that it says something about one thing in relation to another thing. The two "things" might be the syntactics of two texts, in which case we talk of formal relevance. Or they may be a problem and the content of a text, in which case we talk of content or subjective relevance depending on the type of situation in which the evaluation takes place.

Formal relevance measures the syntactic similarity between two texts. The texts may be query and document or two documents. The formal relevance of a document is thus a value assigned to the document by a matching function. Formal relevance is based on the syntactic structure of the document, not on the content of the document, nor on its usefulness to the user. Formal relevance will usually reflect the similarity between two texts as measured, for example, by a matching of words. But it may also reflect general criteria like type and age of document, author, and so on. The nature of formal relevance is absolute. Given two texts and a matching function, the relevance value is unambiguously defined. Depending on the matching function, the grading may be either/or (binary) or by degrees.

Content relevance is defined as the adequacy of the content of a document as a response to the request. Subjective relevance is defined as the usefulness of the document to the user. Subjective relevance will depend on a host of factors, including the user's previous knowledge. In the literature subjective relevance has also been characterized as "utility", Cooper (1971), and "pertinence", Foskett (1972).

The choice between content and subjective relevance must to a certain extent reflect the type of decision-making situation in which the user finds himself.

In an informal decision-making situation where the value of the decision depends entirely on future consequences, there are no rules that

[Page 154 ]


the decision-maker is forced to consider. The only thing that counts is the decision itself. How it is arrived at is irrelevant.

The quite opposite situation exists where the decision process is formalized in such a way that the validity of the decision depends mainly on the premises on which the decision is based. The validity of a decision made by a legal court, for example, will depend on whether or not certain procedures have been observed and respected. Certain of these procedures demand that certain legal sources shall be consulted. If the judge neglects to consult one of the required sources, his decision is invalid, even if it turns out that the same decision would have been arrived at had the source been taken into account.

In legal decision-making, as in other formal situations, it is thus not appropriate to regard relevance as entirely subjective. The relevance of a document cannot be defined only in terms of its usefulness to the user, since such a definition implies that only documents which in some way cause the decision-maker to change his mind are relevant. Instead we must base relevancy on the content of the document. Whether or not the document is relevant will depend on the adequacy of the content of the document as a response to the request. As part of the content we count things like date of publication, author, and so on.

9.4.2 The nature of relevance: Absolute and relative relevance

We have earlier remarked that formal relevance is absolute in the sense that it does not depend on individual value judgements. Subjective relevance, on the other hand, is relative in the sense that it is particular to each individual user. In fact subjective relevance is not only relative to each user, but to each user situation, since the background knowledge of the user, which changes constantly, is a main factor affecting the utility of additional information.

The nature of content relevance is more difficult to establish. Content relevance measures the adequacy of a document as a response to the request. Content relevance does not depend on whether or not the user himself finds the information useful in the sense that it is new to him.

In a strictly formal system there are definite rules for evaluating relevance. There is little or no room for the personal opinion of the individual user. In such a system content relevance tends to be absolute.

While the legal system has certain characteristics of a formal system, these are not sufficiently prominent to make the assessment of content relevance absolute. This is not only evident in legal theory with its

[Page 155 ]


emphasis on human judgement within the often broad limits set by the law, but is also borne out by several empirical investigations regarding relevance assessments. We refer to section 11.3.1 for a summary of the results of one of these investigations. This is not to say that relevance assessment in the legal system is completely relative. Clearly it is not. As so often in the law the solution must be found somewhere in between the extremes, the different situations in each case deciding the way the scale tips.

9.4.3 The grading of relevance: grading by degrees or binary grading?

Even when we know the type and nature of the relevance assessment, we are still left with the question of grading. In fact, the question of whether or not the relevance assessment should be graded by degrees or given a binary grading does not follow automatically from our previous choices of type and nature, and yet neither is the grading of relevance completely independent of type and nature.

Consider for example formal relevance. We have established that the assessment of formal relevance as performed by a matching function is absolute. Since there is no uncertainty regarding formal relevance, it might seem to be an appropriate candidate for binary grading. However, such a conclusion would be premature. It must be remembered that formal relevance as applied in a retrieval system is only an approximation of content or subjective relevance, and in most situations that are not characterized by fact retrieval (see section 9.5.1), the approximation will not be perfect and may even be quite inaccurate. As long as formal relevance is an approximation, the system should grade documents by degrees. This is appropriate even in the cases where the user himself would grade documents according to a binary scale. In these situations formal relevance is only a measure of similarity and should not presume to be a measure of identity.

The grading of documents by the user according to content or subjective relevance is an entirely different matter. If it is assumed that the user assesses documents according to subjective utility, it is obvious that some documents are going to be more useful than others. It is doubtful, however, whether or not the user will be able to assign to every document a unique rank according to utility. It seems, at the very least, safe to assume that the user will not make any distinctions among the documents which are all clearly irrelevant. And in most cases it also seems safe to assume that the user in fact will not be able to assign a

[Page 156 ]


unique rank to each relevant document. Experience has shown that humans are generally incapable of mentally making a complete comparison of more than a few items.

The user may be able to classify the relevant documents under a few headings like:

But again it is doubtful how much use he will have of such a classification. It is also a doubtful empirical question whether users of retrieval systems normally classify documents in this manner, or if the practice is reserved for panelists taking part in relevance experiments.

The behavior of panelists has often been cited in support of the proposition that relevance is a matter of degree. Gebhardt (1975), for example, refers to the Joint American Bar Foundation and IBM Project, see Eldridge (1968), and points out that panelists seem to disagree as often as they agree on relevance assignments. The typical user situation, however, does not consist of a panel, but of a single user. For a given document in a given situation the user will normally be able to decide whether or not the document is clearly irrelevant, or whether it might be relevant. Most of the time he will probably not assign a unique rank to the document, and an attempt to do so might prove difficult.

Assuming, however, that legal documents are assessed according to their content, relevance values corresponding to their hierarchical ranks may be assigned. Thus the constitution may be given a higher relevance value than a statute, a statute may be given a higher value than a regulation, and so on. But such a scheme is neither very realistic nor, probably, very useful. The rank of a legal source is not in itself absolute. The respective ranks of a supreme court decision, a statute, and a regulation, for example, depend on several factors, including how long ago the respective documents were written, how directly they affect the issue at hand, the reasonableness of the result which each document favors, and so on. In fact, in most areas of the law so much depends on human judgement that it does not seem practicable to implement any kind of rigid scheme for assigning relevance values to documents. Cfr. above at section 1.2.9.

A complete ranking of documents on a utility basis, rather than on the basis of content, is not normally performed by users. One of the findings reported by Eldridge (1968) was that each panelist seemed to

[Page 157 ]


have his own favorite relevance group in which he tended to place documents. It is likely that the same tendency is found among ordinary users.

If a request is complex, the user may rank documents according to the aspect of the request which the document refers to. In a criminal case, for example, the user may find a document discussing the question of guilt more important than a document discussing the correct legal reaction once guilt is established. It is probably appropriate, however, to regard this kind of preference as a ranking of the various aspects of the problem rather than as a ranking of retrieved documents.

The user may change his relevance assessment as he gains new insight into the nature of the problem. He may disregard documents he previously thought to be relevant and to consider as relevant documents he previously overlooked. In one of the experiments of the NORIS project, the evaluator initially assessed the relevance of all documents in the data base with respect to 20 questions. After having considered the result of the machine search, he rejected 16 of the 162 documents originally judged relevant and accepted 61 of the documents originally judged irrelevant. A re-evaluation of previous assessment results is probably common and illustrates the relativity of relevance. However, this fact by itself has little bearing on the appropriate relevance grading of the documents.

What we are left with as a conclusion is that the user at any one time disregards irrelevant documents. The remaining documents may be, and sometimes are, classified in a few relevance categories. But they will almost never be assigned unique rank values. Documents which at first glance seem to be of doubious relevance are usually re-assessed and either disregarded or accepted as relevant. The user will normally not leave them in an uncertain state, as this is of little value to him.

The grading of content and subjective relevance must therefore, generally speaking, be regarded as binary, that is as an either/or proposition. The user can make use of a few relevance categories, but will as a rule not make a full ranking of the documents.

9.5 THE RETRIEVAL PROCESS

The retrieval process consists of both human and machine subprocesses as shown schematically in Fig. 9/1.

[Page 158 ]


Fig. 9/1.
The retrieval process

[Page 159 ]


The first step in a retrieval process is query construction. The user must transform his problem into a query, i.e. he must give the problem a syntactic representation. It should be emphasized that the problem itself is semantic in nature. It is often possible to give the problem several syntactic representations which are all equally adequate from the user's point of view. As queries, however, the different representations may not all be equally adequate.

The query defines the formal properties which a document must have in order to be retrieved. Thus the formal relevance value assigned to a document is determined by the query, while the content (subjective) relevance of a document is, of course, independent of any particular syntactic representation.

One way, and sometimes the only way, to achieve a perfect result, where all and only the relevant documents are retrieved, is to formulate the query as duplicate images of the relevant documents. Of course, normally this is not a practical alternative. The syntactics of the relevant documents will usually vary significantly. And because there is an expressed retrieval need, the relevant documents will, by the very nature of the situation, be largely unknown to the user. The query cannot, and should not, portray each relevant document, but should express the properties which the documents have in common. We can call these properties the necessary conditions of relevancy. The formal relevance of a document should reflect the probability of all these conditions being matched in a document.

In order to illustrate what we mean by necessary conditions of relevancy, let us take as an example the retrieval problem used by Horty in his paper from 1960 on the application of information retrieval techniques to legal research. The legal problem in question concerned the potential loss of a tax exemption which a hospital might suffer because it rented out portions of its first floor for commercial purposes.

As Horty points out, the problem

"..involves three basic concepts. First of all it is a tax problem, secondly, it concerns the exemption of property from taxation, and thirdly it concerns the use of a part or portion of the property in a commercial manner. The object is to retrieve each statute which deals with all three concepts.

Because not every statute will deal with this problem in the same language our inquiry must reflect every possible way each concept has been expressed in the statutes."

Horty chose to give his query the following structure:

c1 AND c2 AND c3

[Page 160 ]


where c1 , c2 and c3 are sets of alternative ways of representing the first, second, and third concept respectively, see Horty (1960). The structure reflects the condition that a document, in order to be relevant, must contain all three concepts.

Each concept is represented in the query by a set of words. Each word in a set is an alternative way of representing the concept. In the following we shall call a set of words representing the same concept a class.

Once the query is specified, the documents can be matched against the query. This processs is performed by the machine according to a specified algorithm usually called the matching function.

It is normally difficult to evaluate the result of the search. In principle it can be done by manually assessing the relevance of all the documents in the data base and comparing the result with the retrieval result. In Fig. 9/1 such a hypothetical manual evaluation process is represented by the box labeled "relevance assessment", and the search result is depicted by the degreee with which sets S1 and S2 overlap. We note that recall and precision is determined by the ratios n(S1 ∩S2 )/n(S1 ) and n(S1 ∩S2 )/n(S2 )

In a retrieved document set there will normally be some irrelevant documents along with the relevant ones. As long as the documents are not ranked, the relevant and irrelevant documents will, on the average, be randomly distributed in the retrieved set, and the documents will not be presented to the user in an order favoring the relevant ones. Precision will therefore, on the average, remain constant as the user looks at new documents. In a ranked set, however, precision will initially be high and then deteriorate as new documents are considered by the user.

Before we leave this short introduction to the search process, it should be emphasized that recall and precision alone do not give the complete performance picture. For one thing, recall and precision do not tell us the time it takes for the user to complete the search. It is true that high precision reduces necessary browsing time and thus is a powerful timesaving device. But total search time is also a function of factors like user ability and experience. In fact, recall and precision are never independent of the other performance criteria, but can usually be improved at the expense of search time and user effort. Secondly and perhaps more importantly, recall and precision do not reflect the interactive nature of on-line retrieval system. This interactivity, which allows the user to modify his query on the basis of achieved results (feedback), is a very powerful system function. However, it would probably be a

[Page 161 ]


misconception to think that the interactive capabilities of on-line systems reduce the demands on the other search functions. Interactive capabilities can only complement other functions, not replace them.

9.5.1 Fact retrieval

A point which has often been stressed by authors on the subject of information retrieval (see for example Bar-Hillel 1964) is that it is important to distinguish between different information needs. This seems especially true in a discussion regarding retrieval strategies. There is a whole spectrum of information needs - one end of the spectrum being characterized by fact retrieval, the other end by reference retrieval.

Fact retrieval is, as the name implies, a search for facts. A fact can in this connection be either a specified piece of information, for example a name, figure or amount, or it can be a specified document, for example a letter, invoice, or statute. What distinguishes fact retrieval from reference retrieval is that all types of relevance assessment in fact retrieval are absolute. Hence there can only be one correct answer set irrespective of the user performing the search. An important consequence of this is that the relevance requirements can be preciesely and formally specified in the query. It means furthermore that, as long as we have an adequate data base, the retrieval result is always optimal in the sense that all relevant and no irrelevant documents are retrieved.

9.5.2 Reference retrieval

Reference retrieval is a search for references or citations to documents which discuss or in some other way throw light on a given problem. Most of the legal research connected with finding precedents can be characterized as reference retrieval.

Reference retrieval is a much more complicated process than fact retrieval, since user relevance assessment has now become relative. A perfect reference retrieval result is rarely achieved, and even a satisfactory result can often be difficult to achieve for the inexperienced user. A great many retrieval aides and strategies have been developed to meet the difficulties involved, and it is to the most important of these aides and strategies we now turn our attention.

[Page 162 ]


10 Search strategies

10.1 INTRODUCTION

Factors affecting performance can be divided into two groups according to whether or not the factors are in principle subject to user control. We shall define a search strategy as the determination of the variable factors, that is the factors subject to user control.

The fixed factors are the factors over which the user has no control, not only practically speaking, but in principle. As it turns out, there are not many factors which are fixed. In addition to the problem itself the fixed factors include only the documents that provide an adequate coverage with respect to the problem. Whether or not these documents are part of a searchable data base is a question which the user normally cannot control. However, we still choose to regard the composition of the data base as a variable factor. Even if the user is not in a controlling position, at least the system designer is.

In addition to the selection of the data base, the variable factors include selection of document representation, selection of command language, and formulation of query. These are the factors concerning coverage, indexing, user effort, and retrieval performance, and we shall now discuss them in more detail.

10.2 SELECTION OF DATA BASE: THE QUESTION OF COVERAGE

Coverage is a similar measure to recall. But while recall refers to the proportion of relevant documents stored in the data base that is retrieved, coverage refers to the proportion of documents required by the user which is stored in the data base. Recall can thus be said to be a derived measure of coverage, and a high value of recall may not be very significant if coverage is inadequate to begin with.

[Page 163 ]


Like recall, coverage is particular to the user and his problem. Normally it is not practical to provide all users with perfect coverage, nor is perfect coverage for everybody in all possible situations necessarily the aim of the system. Not only is universal perfect coverage impossible, but the failure to achieve it is hardly serious. Coverage and recall failures cannot be compared in this respect. Recall failures are serious because normally the user does not know what he is missing. Coverage failures, however, are immediately apparent to the user, provided that the system is transparent in the sense that the user is made aware of the different document types included in the data base. The important thing is not that the system covers all information needs, but that it covers certain well-defined needs which are not easily provided for by other means.

10.2.1 Information needs

Because of the availability of other information sources in addition to a retrieval system, there will rarely or never be a need to implement a system with perfect coverage. However, decisons regarding the data base are hardly less difficult in a system with more modest coverage goals. Since coverage, in the final analysis, depends on user relevance assessment, there is potentially no limit to what should (or could) have been included, even in a limited system.

Obviously the system designer has a need for guidelines or rules regarding data base composition. Not surprisingly such guidelines are difficult to define. Some help, however, can be found in the type of decision-making situation in which the retrieval system functions.

Decision-making situations may be more or less formally structured. In a typical informal situation the value of the decision is determined by the decision effects, and these effects may be more or less difficult to predict at the time of the decision. The effects are always difficult to predict when the cause-effect relationship is unknown or too complex to be thoroughly understood. In informal situations of this type it is common to associate the value of information with the value of the decision made on the basis of that information. Since the value of the decision is determined by its effects, the value of information is usually based on methods for predicting uncertain decision effects. To the extent that forecasting methods or models are used, the field of potentially relevant information is reduced from an unstructured and unlimited mass of data to a few precisely defined items.

[Page 164 ]


This solves the coverage problem regarding the model, but does not really solve the problems regarding total information needs. Methods and models to predict the future, if they exist at all, are as a rule only simplifications of reality and therefore inaccurate. Usually such methods are only used as aides, not as the main instrument in the decisionmaking process. Consequently, we are still left with difficulties in predicting the potential relevancy of different types of information and with difficulties in relating different information items to possible decision outcomes.

A quite different type of situation exists where decisions are made according to sets of norms (or meta-norms). This is the formal type of situation which exists, broadly speaking, in, for example, legal decisionmaking. In formal systems there are usually no serious problems connected with predicting decision effects. The effects are still as important as before, but in the formal type of situation there is no uncertainty connected with the effects. The mutual relationships between relevant information, norms, and effects are predefined in the norms themselves.

Legal decision-making tends to be "formal" in the sense the word is used here. But only in very rare cases will legal decision-making be strictly formal, so that given norms are applied mechanically. It is also only in rare cases that legal decision-making is strictly informal so that no norms at all are considered, but only the effects.

In formal systems we no longer have the same relationship between information and decision as we had in informal systems. In informal systems, we remember, the value of information depended on the value of the decisions made on the basis of this information. In formal systems it must necessarily be the other way around. It is now the value of the decision that is dependent on the value of the information on which the decision is based. The value of the information depends on whether or not the information is selected and used according to accepted system practice. Hence the problems of coverage are very different in formal and informal types of systems. In formal systems like the legal system, the relevant sources of information are to a large extent specified in the meta-norms governing the decision process. The sources can easily be identified and are usually available as written text. A designer of a legal retrieval system need generally not worry, as must his colleague who is struggling with the information problems of an informal system, that he has failed to identify a crucial or important source of information.

It is now time to look closer at the structure of the legal sources of

[Page 165 ]


information. As far as written texts are concerned, this structure is relatively simple and helps to facilitate problems in connection with the coverage of legal data bases.

10.2.2 The hierarchy of legal sources

The concept of "legal source" can take on different meanings. Sometimes it connotes an argument in favor of a particular solution to a legal conflict. At other times it connotes the authority issuing a legal norm, for example parliament, the government or a court.

In this book the concept of legal source is used in yet another sense. By a legal source we understand the text of a document used in support of a particular legal argument. A legal source is in itself not a fact in the case. A legal source says something about the norms which should be used to decide the case. Sometimes the distinction between fact and legal source becomes blurred, as for example when a contract favoring a particular solution to a conflict is submitted as evidence in the case. It is only the document text itself that we refer to as a legal source. Any particular interpretation or understanding of the document we do not take as the source.

The structure of legal sources can be said to have two dimensions. One dimension is made up of the hierarchical structure of source types, the other dimension is made up of different historical versions of the same document. We call the former the type dimension and the latter the time dimension.

10.2.3 The type dimension

Legal sources are, generally speaking, structured as a hierarchy of different types. A hierarchy in this connection can be described as a structure where authority, while formally resting at the top, is delegated downwards in the structure. Communication lines are generally vertical, and very little horizontal communication takes place in the structure.

A legal source is located in the hierarchy according to the authority of its issuing body. Thus a description of the hierarchical structure of legal sources more or less follows the lines of authority found in the social legal structure. Legal sources issued by bodies at the same level of authority are usually regarded as belonging to the same source type. There have been several attempts at classifying Norwegian legal sources under a few general type headings, see for example Andenæs/Kvamme (1969:19), Eckhoff (1971:15), and Bing/Harvold (1973:18). In Fig. 10/1

[Page 166 ]


we have utilized these attempts, but adapted the results according to our text-oriented interpretation of "legal source".

The diagram is, of course, a simplification of reality. While it does reflect the lines of authority in the legal structure, it tells us very little about the relative weights (ranking) the various source types are given in real decision-making situations.

The diagram does, however, tell us something about coverage. Implicit in the hierarchical structure of legal source types is a duty on the part of the decision-maker to examine also the relevant sources of a higher type than the document he is currently considering. If documents of different types are in conflict or differ in important respects, the decision-maker must discuss and resolve the differences. Owing to the appeal system, the decision will usually conform to the source of the higher type, although there is no explicit norm on this point, see Eckhoff (1971:270-306). There is no similar mechanism inducing the decision-maker to consider also documents of lower or parallel source types. It may be considered good practice to do so, but failure in this respect will not normally render the decision invalid.

Fig 10/1
The hierarchy of legal source types.

[Page 167 ]


10.2.4 The time dimension

In addition to the main hierarchical structure just described, the structure of legal sources has a time dimension.

Laws that have been replaced or amended are often still applicable to acts committed at the time they were in force. The most striking examples are perhaps found in the field of tax law, where the frequent rate of amendments has led to many a case being prosecuted according to earlier versions of the tax code, see Føyen/Harboe/Lie (1973: 56).

In these cases the data base must include different versions of what is basically the same document. In terms of Fig. 10/1 we can think of earlier historical versions of a document as extending vertically down from the current one.

The decision as to what historical versions of a document to include in the data base is part of the coverage problem. The decision will often turn out to be difficult. Since different versions of the same document fall outside the main hierarchical structure of document types, this structure is of no help to the decision-maker. It may also be difficult for him to foresee the conflicts which may require consultation of the old document versions. Practically speaking, however, the problem is not so great as it may seem in principle. There are only a few areas of the law which give rise to a large number of conflicts and which are, at the same time, continually updated. Taxation and social security legislation fall into this category, and the future may, of course, bring further examples.

10.2.5 The need for a segmented data base

A large and heterogeneous data base is not only costly to update and costly to search, but it is also inefficient in terms of precision. The user may retrieve irrelevant documents which would have been avoided if the search had been limited to only a part of the total data base.

The capability of limiting the search to a well-defined subpart of the data base is useful also in another respect. An automated retrieval process is usually not transparent in the same sense as a traditional manual search. An automated retrieval process does not give the user the same immediate sense of what he is looking at as when he is going through index cards, picking books off shelves, and browsing, etc. in a library.

This lack of transparency is not significant so long as the data base consists of documents of a single type, for example supreme court decisions or articles from a given professional journal. But when the

[Page 168 ]


data base consists of different types of documents, there may be a clear need for dividing the data base into separate segments, in order to give the user the option of excluding irrelevant segments.

A data base may be segmented in different ways. One method is to use several different physical search files. Another method is to use just one search file, but impose on the documents a logical structure which can be used for the purpose of carving out pieces of the data base.

One way to impose a logical structure on the data base is to create a linked structure consisting of formatted fields holding data like document identification, author, dates, and so on. Parts of the data base may then be carved out on the basis of the data stored in these fields. If one of the fields contains dates, the user may for example select all cases decided during a given period.

The use of formatted fields means establishing a new data structure in addition to the so-called inverted file structure used for ordinary text retrieval. Creating a data structure consisting of formatted fields is not always necessary, since much of the same effect can be achieved by adding prefixes to the words which would otherwise be stored in the formatted fields. Words having a common prefix can be uniquely identified as belonging to the same class. Suppose for example that we want to retrieve all documents published in a period defined by two given dates. This is not easily accomplished in a traditional text retrieval system, since a date by itself may look pretty much like any other number. However, if we give all dates a standard format and prefix them with a unique code (e.g. Date:761201), all documents published in for example the year 1976 may be found by use of an appropriate query, as for example the following: date:*.gt.date:751231.and.date:*.lt.date:770101 where "date:" is the prefix and * signifies truncation.

Data base segmentation is not the only means by which part of the data base may be carved out at the start of a search. Must notable of alternative approaches is the so-called cluster generation, see Salton (1971: 223-304). Through this technique documents are clustered according to their syntactic similarity. The technique is probably most effective in situations characterized by very large data bases consisting of rather homogeneous documents. In situations of this kind manual segmentation may be both difficult and ineffective.

[Page 169 ]


10.3 SELECTION OF DOCUMENT REPRESENTATION: THE QUESTION OF INDEXING

Any retrieval system, whether based on manual or automated methods, consists of basically two types of files. There is one file containing the documents themselves, and then there is a file containing the entry points to the documents. It is the last file which is actually used in the matching stage of the retrieval process, and we shall call it the search file. In the literature the file is also known by a variety of other names according to the method of document representation used.

Not surprisingly the file is known as the index file in systems where the documents are represented by index- or keywords.

In full text systems the file is commonly called the inverted file. An inverted file contains the same words as the document file. The words, however, are organized differently in the two files. In the document file the words in a given document are found by looking up the document. In the inverted file the documents in which a given word occurs, are found by looking up the word. In other words the two files are inverted versions of each other and hence the name.

In so-called vector systems the documents are represented, as might be expected, in a vector file. Each document is given a vector consisting of a number terms totaling the number of different words in the data base. The order of the terms is the same in all vectors, but if a word does not occur in a document, the corresponding term is given a value of zero in the document vector.

We thus see that the search file may be constructed according to different principles. But the essential function of the file, namely that of providing entry points to the documents, always remains the same.

The different ways of mapping or representing documents have been widely discussed in recent years and have been the subject of several investigations. Next we shall look at two methods of representing documents. In the first method documents are represented by manually assigned keywords. In the second method documents are represented by words occurring in the documents themselves, a process that can be performed automatically.

10.3.1 Representation by keywords

The process of representing a document by keywords is usually referred to as indexing. There are two main reasons for indexing a document.

The primary reason is one of costs. In many situations indexing is

[Page 170 ]


simply the cheapest way of establishing a retrieval system. In most libraries for example, documents are not available in machine-readable form, and the manual process of indexing is the only practical way of creating a search file. Writing abstracts or retyping documents for this purpose will be prohibitively expensive when the document collection is large.

Cost, however, is not the only reason for indexing. Indexing is sometimes also used in systems where the documents are available in machinereadable form. It is true that the reason for indexing may still be partly economic - an index search file is normally shorter and thus cheaper to store and search than a search file based on abstracts or full texts. However, the reason is probably more motivated by a desire to add information to the documents. An important characteristic of the document may only be implied in the text, but can be explicitly expressed by the use of a keyword. In addition the use of keywords also provides the means of classifying documents according to a systematic and controlled vocabulary, see Lancaster/Fayen (1973: 244-262).

Experimental tests indicate, however, that indexing is, on the average, less effective than text representation based on abstracts or full texts. Salton/Lesk (1968) report that representation by document titles, which is similar to representation by keywords, is less effective than representation by abstract or full text. Cleverdon (1967) had arrived at seemingly corresponding results, although his experimental setup was different. Cleverdon found that single words selected from the texts of the documents provided, on the average, the best document representation. Representation by abstracts or controlled vocabularies proved less effective. See sections 11.2.1 and 11.2.2.

The economic benefits of indexing are thus bought at the expense of a loss in retrieval performance. Whether or not this loss is acceptable depends mainly on the situation in which the retrieval system functions. In some typical library situations, for example, the user wants to retrieve documents already known to him. In these cases there is little need for entry points in addition to bibliographical data. In section 9.5.1 we characterized a search of this kind as fact retrieval. In other typical library situations the user is familiar with the subject on which he is searching, but does not know all the documents containing relevant material. In these situations, which we earlier characterized as reference retrieval, the information loss resulting from indexing may be unacceptable.

Legal retrieval can be characterized as reference retrieval most of the

[Page 171 ]


time. Typically, a case is presented to a lawyer as a set of facts. On the basis of these facts the lawyer has to find the relevant legal sources and prepare his argument on an interpretation of these sources as they apply to the case. It can readily be seen why indexing may prove inadequate in situations of this type. Indexing is a transformation process in which the original words of the document may be kept, deleted, changed, or added. Most of the words are in effect deleted. Words are kept, changed or added according to the subject areas which the indexer considers important at the time. This transformation process introduces an arbitrary element into the representation process. The subject areas uppermost in an indexer's mind at the time of the indexing may not correspond at all to the needs of unknown users, who may be removed from the indexer not only with respect to professional background, but sometimes also by a span of unknown years. We know that sometimes a subject, or even legal norms, will develop in directions which were unpredictable even a few years earlier. The indexer may thus really be faced with the impossible task of classifying a document under a subject heading which has not yet been defined, or which at least has not yet been associated with the content of document.

10.3.2 Representation by text

By far the simplest way of representing a document is to base the search file exclusively on the words occurring in the document. We shall call this method text representation.

A document may be represented by its entire text or by smaller parts, as for example the conclusion or a summary. Systems where documents are represented by their entire texts are usually called full-text systems.

Even in these systems, however, the search file does not normally include all the words in the texts. The so-called common words, i .e. words like "the", "a", "and", "for", "is", and so on, are more or less evenly distributed in all documents and therefore carry very little information for retrieval purposes. Consequently, they are normally excluded from the search file.

Text representation does not require any kind of manual processing at all. It does require, however, that the documents are available in machine-readable form. The method is therefore especially well suited to situations where documents are "captured at the source", i.e. where documents are made available in machine-readable form as part of a printing or typing process. Capture at the source is not possible in the case

[Page 172 ]


of historical material, and retyping or punching of the material is then a necessity. Punching of even vast amounts of old material may be justified. In order to create the data base of LEXIS, the legal retrieval system operated by the Mead Data Corporation, three billion characters had been punched at the end of 1973, cfr. above at section 6.4.

The advantages of text representation are generally speaking the same as the disadvantages of keyword representation. A document represented by its text can be retrieved on the basis of any word occurring in the text. Therefore text representation will normally provide more entry points to a text than indexing. And, generally speaking, a lot of entry points is an advantage, since the retrieval process can be described as a game of second-guessing either the author or the indexer of the documents. In order to retrieve a document containing a given concept, the user must include in his query the exact words which the author, or in the case of keyword representation, the indexer, used to represent the concept. Unless the user is familiar with the indexer's way of thinking, it may be easier for him to conjecture the word-use of the author.

However, text representation does have drawbacks of its own. First of all it normally requires a relatively large amount of storage space compared to keyword representation. In addition the method may make retrieval difficult precisely because the documents are represented only by words occurring in the documents.

A concept will not always be expressed explicitly in a text; sometimes it is only implied by the overall meaning of the presented argument. For example "out-of-town living expenses" is an important concept in many tax cases, but, at least in a collection of decisions by the Swedish governmental courts, we found that the concept was not always expressed explicitly in relevant documents. Instead it was implied by the factual descriptions, for example by the description of the hotels and restaurants where the salesman had spent his week away from home.

Even when a concept is explicitly expressed, it can sometimes be extremely difficult to second-guess the author as to the word used to express the concept. Very high or low levels of specificity may be involved. Examples are abundant. "Agricultural inventory "expressed as a refrigerator, and "dangerous article" expressed as a slide projector are just two examples of high specificity. See Bing/Harvold (1974: 108-111).

[Page 173 ]


10.4. FORMULATION OF QUERY: THE QUESTION OF PERFORMANCE

Some of the most important characteristics of retrieval performance can be described by the recall and precison ratios. Similarly, important aspects of query construction and search strategies can be described in terms of so-called recall and precision devices - retrieval devices aimed at improving recall and precision respectively, see for example Lancaster (1968: 85). In the following we shall make use of a somewhat different approach where the emphasis is put on the role of the output quantity as a determining factor of retrieval performance.

Generally speaking, recall can only be improved by increasing the number of retrieved documents, and an increased number of documents normally makes for a drop in precision, even though there is an important exception, which will be discussed in section 10.4.4. Similarly, precision is normally improved by reducing the number of documents, a result that tends to reduce recall. Thus since both recall and precision devices are mainly functions of the size of the retrieved document set, they are, from the point of view of query construction, two sides of the same coin.

The process of formulating a query can be described as striking a balance between the semantic demands inherent in the question and the syntactic limits inherent in the matching function, see section 10.5.5. The most basic problems encountered in formulating queries, however, are common to most retrieval systems, regardless of the type of matching function used. These are the problems connected with transforming the information in the question to syntactic criteria which can be interpreted by the matching function.

Essentially the question transformation process consists of two steps:

It is convenient to put a common label on terms representing the same necessary condition. In the following we shall call such a class of terms either a "conceptor" or simply a "class".

10.4.1 Identifying conditions necessary to relevance

A question is normally made up of separate ideas or concepts, each of which is necessary in order that the question may retain its unique

[Page 174 ]


character. Identifying the necessary conditions of a given question may superficially seem easy enough, but often turns out to be a tricky business indeed. The process involved is best illustrated by an example.

Suppose that we are interested in the legal norms affecting the relationship between parent and child. A first attempt at identification of the necessary conditions might yield:

A closer look at these concepts may lead us to eliminate the concept of "legal norms" as superfluous, since we are dealing with a data base consisting of legal sources to begin with. The two concepts "affecting" and "relationship" are also very general in this type of data base; probably so general that they are not useful as search criteria either. This leaves us with the concepts of "child" and "parent". The concepts of "child "and "parent "do not by themselves define the question. Obviously they may be part of many other questions too. In this case, as in many others, it is difficult to give the question an exhaustive and, at the same time, meaningful representation.

Suppose we have established that we want to base our search on the concepts of "parent" and "child". It should be emphasized that we are talking about concepts. The specific words "parent" and "child" are only used to suggest the concepts. It should also be emphasized that concepts normally are fuzzy around the edges. Whether or not a given concept is present in the user's mind depends not only on the direct word impulses he receives, but also on his lines of thought and on the associations he makes.

Take for example "parent" and "child". These two concepts can very well be used completely independently of each other. "He behaved like a child" will not usually invoke the association of a parent. In other contexts the two concepts imply each other. If we are dealing with parent-child relationships, any talk of parent necessarily implies a child, and vice versa. Cfr. above at section 1.2.9. This is an important point to make in connection with document retrieval. Whether or not a concept is recognized in a document will often depend on one's lines of thought and on one's point of view. "Parent" might imply "child", or it might not. In some cases both "parent" and "child "may be implied by still another

[Page 175 ]


concept, as for example the concept of "a statute protecting the rights of minors".

The above example illustrates the importance of analyzing a question in terms of concepts before embarking on a search. The conditions essential to a question may shift somewhat, depending on the conceptual level on which the user is thinking, and the success of a search will depend upon the user's ability to see his problem on the same conceptual level as the author's. It is therefore important that the user always strives to adopt different viewpoints or angles from which to analyze the question.

10.4.2 Specifying terms

Once the question has been analyzed and the conditions necessary for relevance identified, the user can start to specify the terms which will represent the conditions. Of course in practice it is difficult to separate the processes of question analysis and term specification. Very often the two processes will be performed simultaneously. This is unimportant as long as the user is able to make a proper analysis of the question.

Most of the literature on search strategies concerns the problems of term specification. It is especially systems functions aimed at aiding the users which have received a lot of attention.

A term will normally be an alphabetical word or phrase. In principle it may be any character string which can be matched in the data base. A term used as a search criterion has an entity all of its own. It is ambiguous in the sense that the user does not know all the contexts in which the term may appear in the data base. The user will thus often feel a need of specifying the context in which he is using the term. In index systems limited contexts may be specified by the use of so-called links and roles. A link specifies the combination of two independent concepts into a more complex concept, e.g. testing of cars, testing of bombs. A role defines one of several contextual meanings of a word, e.g. car (sales product), car (source of pollution).

In text systems links can normally be defined by specifying that two words must occur within a certain distance of each other (as measured by words). A so-called positional operator is used for this purpose. Generally speaking, roles can also be handled in text systems by demanding the co-occurrence of both the main object (e.g. car) and its context (e.g. sales product).

[Page 176 ]


10.4.3 Aids in specifying terms

Natural language is immensely varied. A given subject may be treated and expressed in a large number of ways, depending on the author and the context in which he is writing. Specifying terms for a given idea or concept may sometimes seem like an almost impossibly difficult task. In order to assist the user in this task, several different methods and techniques have been developed. We shall now take a closer look at the most important of these.

(1) Browsing through the documents already retrieved is the oldest, simplest, and perhaps the most effective way of getting fresh ideas with respect to new terms to be included in the query.

The effectiveness of browsing may mainly be due to the immediate feeling of vocabulary which it imparts to the user. Browsing is relatively time-consuming, however, and can only be used in situations where on-line systems are available. And, of course, browsing requires that the user already has obtained one search result.

(2) Truncation. The grammatical variations of a word are relatively unimportant to the search techniques in use today. What usually counts is the stem of the word. Prefixes and suffixes can be regarded as arbitrary functions of style. The method of truncation provides the possibility of disregarding word endings for matching purposes. Suppose the user specifies the term "car*". He will then not only retrieve documents containing "car" and "cars", but, in addition, all documents containing words having the leading letters "car". If the data base covers broad subject areas, our truncating user may be in for a surprise. He may, for example, find himself confronted with documents not containing the word "car" but words like "carnivore", "carrier", "carrot", "cartel", and "cartridge" just to mention a few. However, this example should not detract from the fact that truncation, when it is used with care, can be a very powerful and effective search technique.

How effective truncation really is has, as far as we know, never fully been tested. Salton/Lesk (1968) report on a comparison of the suffix "s" dictionary with a word stem dictionary. The use of the former dictionary is tantamount to truncating the terminal "s"; the use of the latter is tantamount to truncating all endings beyond the word stem. The results were inconclusive in the sense that both methods came out on top in the comparison, depending on the data base used. The effect on performance of truncating terms as compared to leaving them in their original form was not tested in this experiment.

[Page 177 ]


Such an experiment does involve several methodological difficulties, but was nevertheless undertaken as part of the NORIS program. One of the principal difficulties is to obtain results that are representative of the data base as a whole. In the NORIS experiment the analysis was not based on sample questions, but on an analysis of the total data base with the aim of grouping together those words which were considered synonyms independently of context. Context-dependent synonyms could not be included in the experiment, since questions that would have defined the contexts were not included in the experimental setup. Once the synonym groups were isolated, the effect of truncation could be measured by selecting and truncating a representative word within each group. Three texts were used. The longest text contained 2 913 words. The second longest text contained 2 273 words. Since the last text was supposed to be very short, average values based on 40 representative short texts were calculated. The average length was 72.8 words.

The main results of the experiment were quite encouraging. The results obtained for the long text showed that an average of 75 per cent of the words in the synonym groups were matched by the truncated terms, while an average of only 20 per cent of the words matched by the truncated terms belonged outside the synonym groups. The corresponding values for the two shorter texts were 94 per cent and 5 per cent for the second text and 96 per cent and 1 per cent for the very short text, see Harvold (1974). The extremely good results in the last case indicate primarily that very short texts contain relatively few synonyms.

The results from this experiment can also be used to illustrate the effect of truncation on performance, as measured by recall and precision. The results obtained are derived by means of a retrieval performance model developed during the NORIS program.

In fig. 10/2 the effect of truncation is shown as a function of both data base size and average document length. The truncation effect is indicated by the average, relative change in recall and precision resulting from the truncation. We note that the percentage increase in recall is much greater than the corresponding decrease in precision. We also note that truncation is relatively more effective the larger the data base and the shorter the average document length. The model predicts, for example, that in a data base consisting of about 30 000 documents, where each document is about 60 words in length, recall can, on the average, be approximately doubled by the use of truncation, while precision will only suffer a decrease of about 25 per cent. The results suggests that truncation can be

[Page 178 ]


Fig. 10/2

a powerful aid to the user, even though the exact values given above should not be taken too seriously. They do not pretend to be more than they are - average values extrapolated from a limited empirical material by the use of a simplified model of reality.

The results also presuppose that the user is able to truncate in a relatively sensible way. In this connection it can be pointed out that truncation mistakes are normally easily corrected in on-line systems. They will usually show up as a sudden increase in the number of retrieved documents, or they will be detected in the browsing stage of the process.

Suffix truncation, or so-called right-hand truncation, is by far the most useful type of truncation in most languages. In certain languages, like German and the Scandinavian languages prefix or left-hand truncation may also be quite useful. These languages are characterized by the construction of complex words through a concatenation of simpler words. In addition, German has its special prefix problem. A completely general masking function, where any part or parts of a word can be discarded for matching purposes, may be the ultimate solution in the case

[Page 179 ]


of these languages. Systems having such a feature will be letter-oriented in contrast to the present word-oriented systems.

(3) Thesauri. A thesaurus is a synonym dictionary. It defines different synonym groups, each of which may be activated by the specification of any one member of the group. Synonym dictionaries are usually structured as hierarchies, although they may also be given a largely unsystematic network-type structure. Hierarchical dictionaries are manually constructed, but it is possible to construct network dictionaries by automatic methods, for example by linking terms which co-occur often in sample documents assumed to be typical for the given subject area, see Salton (1971:132-141), or by linking terms co-occurring often in queries.

Opinions are divided concerning the practical use of thesauri. Most empirical investigations have given negative results. Saracevic (1970:677) reports on the usefulness of a thesaurus constructed for a data base consisting of 600 journal papers in the field of tropical diseases. Use of this thesaurus did not result in any significant improvement of performance. The experiences of Salton/Lesk (1968) are only slightly more encouraging. Using a thesaurus constructed for a data base consisting of 780 abstracts of documents in computer literature, they reported that performance was somewhat improved when the query was expanded by higher up concepts (parents), but no improvement was observed in the cases of sons, brothers, or cross-references.

Some of the reasons for this poor track record are not hard to find. Since a thesaurus is a general tool, it must be independent of any one given context. It can only include context-independent synonyms, which in many ways are the least interesting synonyms from a retrieval point of view. Of greater interest to the user are the context- dependent synonyms, words like for example "water-well", "refrigerator", and "grain", which can all be synonyms in the context of taxation of agricultural inventory, but which may not be related in any other contexts.

Many, maybe most, of the context-independent synonyms are grammatical versions of the same form. Some systems, like FAIR developed by IBM Austria (cfr. above at section 7.2.1), limit their thesaurus to a grammatical generator of word forms. In order to handle the contextdependent synonyms they provide the user with system functions for constructing his own private thesaurus. Functions of this kind may not only include the tools for defining and deleting links between terms, but may also include proper synonym lists from which the user can "mark" (for example by the use of a light-pen) the terms he finds interesting and

[Page 180 ]


which he wants to include in his query. Even an ordinary alphabetical list of all the distinct words in the data base may prove a useful source of inspiration to the user.

Before we leave this section we have to mention two other functions, which are characterized both by an elegant simplicity and a large variety of uses. These are the query-saving function and the so-called macro function. The facility for saving queries may be used to construct special synonym queries. These may then later be included in new queries simply by way of reference. The essential feature of a macro is not that it can be used to save terms, but that it can be used to save the logical structure of a query. A macro is specified as an ordinary query with the exception that the terms are given as parameters. Actual terms are filled in later on when the macro is referenced. Macros were first implemented on the STATUS 1 system, see A.E.R.E (1975).

10.4.4 Controlling the quantity of output

Earlier we mentioned that, generally speaking, recall and precision were functions of the number of retrieval documents. Controlling the quantity of output is therefore the key to controlling performance.

In reference retrieval at least, it is convenient to regard a query as being made up of classes of terms, where each class represents a necessary condition of the question. A query made up of classes can be expanded in three ways:

A class is expanded by adding terms that are synonyms at the same generic level as the terms already specified in the class. "Synonyms" is here taken in a broad sense and includes not only context-independent synonyms (the true synonyms), but also context-dependent synonyms. The interesting thing about term expansion is that, on the average, it will not affect precision, even though recall is increased. Terms representing a condition at the same generic level are likely to be equally representative of this condition. The probability of retrieving a relevant document is therefore not changed as terms are added to the class. Consequently the ratio of relevant and irrelevant retrieved documents will, on the average, remain constant. See Bing/Harvold (1974: 64-71) for an empirical illustration.

[Page 181 ]


Term expansion is thus a very attractive method of increasing recall, since the normal penalty in the way of a loss in precision is avoided. Several retrieval experiments suggest, however, that no amount of term expansion will produce 100 per cent recall as long as the class co-occurrence requirement is not reduced. Saracevic (1970 b:678) considers this as one of his most significant and surprising results. Corresponding results were obtained in the NORIS (8) project, see Bing/Harvold (1974:80) and Bing/Harvold/Kjønstad/Stabell (1976). The same result is also implied by a model of text retrieval developed by Harvold (1976). According to this model a complete expansion of all classes is necessary in order to obtain 100 per cent recall. The only possible way to obtain 100 per cent recall without complete expansion is in fact to relax the class co-occurrence requirement. It goes without saying that complete expansion of a class is not normally practical, since it involves identifying all words used in the data base to express the given concept or idea.

Output can also be increased by adding new terms of a higher generic level. This will decrease the specificity of the class. Decreasing specificity is primarily a useful technique in systems using a controlled and hierarchical structured vocabulary of index terms. However, the technique may also prove useful in text systems. Legal documents, for example, often have a heading or a summary describing the important points or aspects of the document in more general terms than the main text itself. Decreasing the specificity of the terms will normally cause an increase in recall accompanied by a decrease in precision.

The last method by which output can be increased is to relax the class co-occurrence requirement by dropping one or more classes from the query. Dropping a class implies that the query no longer exhaustively describes the question. Hence the method is often referred to as decreasing the exhaustivity of the query. Reducing the class co-occurrence level will normally both increase recall and decrease precision. Recall will increase because at least some of the new documents will contain all the necessary conditions in the correct context and thus be relevant, even though not all the conditions are represented in the query. Of course precision will fall because the query no longer represents all the conditions of the question.

Because of its effect on recall and precision, control of the class co-occurrence level has the potential of being an excellent ranking device. We shall return to this possibility later in our discussion of nearness functions.

[Page 182 ]


10.5 SELECTION OF COMMAND AND SEARCH LANGUAGE: THE QUESTION OF USER EFFORT

10.5.1 Performance and flexibility

Anyone taking a stand against system flexibility is not likely to attract a huge following. In fact flexibility is a goal usually taken for granted and not in need of any futher elaboration or justification. The universal acceptance of flexibility is probably partly due to the vagueness of the concept, and partly due to the demonstrated fact that flexibility in the sense of adaptability is an extremely useful quality in a changing world.

Flexibility can be used in connection with various aspects of a retrieval system. We shall use the concept in two different connections. We shall take flexibility to mean both adaptability to different types of problems and adaptability to users with different backgrounds and experiences.

The simplest type of system for the totally inexperienced user to operate is a system with minimal log-on procedures, with no system command language and with natural language query capabilities. The user sits down, presses a special key on the key-board, which connects him with the system, and proceeds to specify his question in natural language. User-congenial systems of this type are not only possible but have, with slight modifications, been developed, see for example the description of CONTEXT in Vischer (1971), above at section 7.6.

It is especially the ability to process natural language which makes the systems flexible in the sense that both experienced and inexperienced users can operate them. However, queries written in natural language provide relatively little information to the machine. The queries are normally processed by disregarding common words and conducting the search on the remaining significant words. In a natural language query the user has no means of making distinctions between the significant words. Words that are repeated in the query may be given greater weight by the machine. But normally queries are short compared to most texts, and if a word should appear more than once in a query, it will usually be due to accidental features of style or to common-type words that should, in any case, have been deleted from the query by a stop-list, see Bing/Harvold/Kjønstad/Stabell (1976). Once the user deliberately repeats a word in order to give it greater weight, we are no longer dealing with a true natural language query.

Thus, as long as machines do not have language processors with capabilities approaching those of humans, natural language queries can

[Page 183 ]


only be interpreted as a command to retrieve all documents in which at least one of the query words occur. In addition, the documents may be ranked according to the number of query words that co-occur in each document. This basic scheme may be varied by assigning different weights to words according to, for example, the frequencies of the words in data base, or according to the frequencies of the words in the individual documents, or according to a combination of the two.

But in today's systems natural language is by no means the ultimate way in which to specify queries. Experience has shown that performance can be improved by the introduction of operators that explicitly define the relationships between the words in the query, see section 10.5.5..

The introduction of operators into the search language requires certain syntactic rules, which the user must master. Even if these rules are simple, the system has now become a little more difficult to operate for the inexperienced. Performance in the form of problem flexibility has been bought at the expense of performance in the form of user effort and user flexibility. In fact at our present level of knowledge, high performance, as measured by recall and precision, seems incompatible with ease of use, at least if we think in terms of a system which is general and not custom-made for special types of problems. Flexibility is thus somewhat of a paradox. As a general goal it remains an elusive dream, since high flexibility with respect to all aspects of the system seems incompatible with the maintenance of a high performance level.

10.5.2 Imperative and responsive dialogues

The communication process between user and machine can be structured according to at least two different principles. Depending on the principle used, the dialogue will, from the point of view of the user, appear as either imperative or responsive.

The philosophy behind responsive dialogues is that even people with no special system experience or knowledge should be able to use the system. Thus a responsive dialogue requires very little initiative on the part of the user. He is guided along the retrieval process by prompts and questions from the system, and most of the time replies of "yes" or "no" will suffice to keep the system guide satisfied and busy. A responsive dialogue might run as illustrated below. The dialogue is based on commands, used by STATUS 1 see A.E.R.E. (1975), although other systems, as for example STAIRS, might have been used.

Statements made by the system are italicized in the dialogue.

[Page 184 ]


.

.

.

question please

contract?

118 documents satisfies the question. Do you wish to list titles?

no

question please

contract .and. breach?

22 documents satisfies the question. Do you wish to list titles?

no

question please

contract .and. breach .and. fraud?

3 documents satisfies the question. Do you wish to list titles?

yes

.

.

.

Do you wish to read any of these documents?

yes

.

.

.

We note that a system using a responsive dialogue is user flexible. The system can be used by the experienced and the inexperienced alike. After a while, however, the typical user may tend to find a responsive dialogue somewhat long-winded, and may indeed welcome the change to an imperative dialogue.

The philosophy behind imperative dialogues is that the user has experience, that he knows what he is doing, and that he does not have to be led down the retrieval path by a somewhat dull system. An imperative system requires that the user has acquired mastery of the system command words. In the Norwegian version of STATUS (NOVA*STATUS) there are commands relating to data base selection, query formulation, browsing of titles and texts, query saving, and macro definitions.

Below our previous dialogue is rewritten as an imperative dialogue based on the commands of the NOVA*STATUS system. The dialogue illustrates a query-saving facility whereby earlier queries are inserted at the places where they are referenced in later queries.

[Page 185 ]


.

.

question

1: contract?

118 documents retrieved

question

2: .1..and. breach?

22 documents retrieved

question

3: .2..and. fraud?

3 documents retrieved

titles

.

.

.

read

.

.

We note that the dialogue has been shortened considerably. In environments where systems are in continuous use, the advantages of a shorter and more tidy dialogue may well exceed any drawbacks in connection with experience requirements.

10.5.3 Feedback indicators

Most of the information passed on to the user by the retrieval system will regard either the quantity or the quality of output. Quantity indicators apply to the number of documents retrieved, and quality indicators apply to the relevance of the retrieved documents. The user engaged in fact retrieval will mostly be interested in quantity indicators. Suppose for example that the problem consists of finding all documents where the word "ombudsmann" occurs. By definition the quality of such a search is not of much interest, at least if we assume a correct data base and a correctly functioning system, see section 9.5.1. Of great interest, however, is the number of documents in which the word "ombudsmann" occurs.

The situation is quite the opposite in the case of reference retrieval. What the user really is after now is information about the quality of his search. It is only when such information is not directly available that quantity figures may be used as indirect indicators of performance quality.

[Page 186 ]


The importance of quantity indicators in situations characterized by reference retrieval depends, furthermore, on the type of matching function used. Indeed, if we are using nearness functions which have the ability to assign a unique rank to each document, it is not even meaningful to talk about the number of retrieved documents. The documents are, for all practical purposes, retrieved one at a time, and the user can stop the process at any point he wants. In the cases where a full order is not induced, but where the documents are grouped in ranksets, information on the number of documents in each rankset may be quite useful, however.

What the user really is after, in situations characterized by reference retrieval, is information on recall and precision. However, such information is not easy to come by. Normally a precision figure can be established by evaluating the retrieved documents. Since recall in part depends on the relevant documents in the unretrieved part of the data base, there is normally no practical way for the ordinary user to establish recall. It is possible, however, to estimate recall on the basis of two independently retrieved samples of relevant documents. If, for example, two searches, one manual and one automatic, are performed seperately, the resulting two document sets will be independent. But such a procedure is time-consuming and not very practical in user-oriented situations.

As part of the NORIS program, a method of estimating recall, based on assumed independent document sets obtained during the search itself, was developed. The method allows for recall-estimates to be calculated during the search. In the NORIS (8) II report the method was tested and found to yield statistically significant results. However, the experiment was limited in scale, and further development and testing is necessary in order to establish any potential, practical use of the method.

An important part of the feedback process is the ability to browse through the retrieved documents. In on-line systems, browsing not only has the purpose of informing the user regarding the question, but also provides information which can be used to restate the query in an improved form.

10.5.4 Form of output

The way the results of a search are presented to the user is not the least important part of a retrieval system. The different design choices fall, broadly speaking, into two categories.

[Page 187 ]


The first category concerns the types of information which should be made available. Normally a retrieval system will provide the user with information on:

The second category concerns the ways in which the provided information should be presented. The most important media in this connection are:

The specific design solutions to these choices depend a great deal on the type of situation in which the retrieval system functions. In the following we shall mainly consider the situation of the typical lawyer.

Legal retrieval systems are usually made directly available to lawyers on an on-line basis. The lawyer is linked to the system through a terminal, which he operates himself. Of paramount importance to the lawyer are short response times. Terminals are therefore normally equipped with screens, often referred to as CRTs (cathode ray tubes) or VDUs (video display units). All communication between user and system is displayed on the screen. This information is normally lost once it is taken off the screen, but in some systems it is saved and can be redisplayed on a command from the user. There is usually no great need to save that part of the dialogue which concerns commands and feedback indicators. There may be a need to save queries, either for the duration of the session or permanently. In NOVA * STATUS, queries are saved automatically for the duration of the session, and they may also be saved permanently by use of the macro function, see section 10.4.3. The lawyer may want a permanent copy of the references to the documents retrieved in a particular search. This requires a printer that can be operated from the terminal. The best solution is to have a small printer located close to the terminal. An alternative and less costly solution is to use a printer located at the computer center itself. However, the last solution may be unacceptable because of the resulting time delays.

For feedback purposes the lawyer needs immediate access to the text of the retrieved documents. The texts have to be displayed on the screen for this purpose. In order to quickly assess the potential relevance of

[Page 188 ]


documents, the lawyer is primarily interested in the text surrounding the search word. There exist different methods of drawing the user's attention to this part of the text. Highlighting the search word is one such method. So-called focusing is another. When focusing is employed, the text containing the search word is placed in the center of the screen. Perhaps the best method of quickly identifying the context of the search words is to combine focusing and highlighting (KWIC format).

Often the user will want to browse through all the documents and select a few for more leisurely study later on. If printed copies of the documents are available, he may perform the latter task at his desk, using a reference list provided by the system. Otherwise he will have to display and read the documents on the screen.

10.5.5 Matching functions

A matching function compares query and documents and, on the basis of the comparison, assigns a formal relevance value to each document. Matching functions can be divided into three categories according to the principle used in the matching process. The three types of functions are:

The purpose and usefulness of these function types are complementary. Below we shall look at each one in more detail.

(1) Identity functions

In order to appreciate the difference between identity functions, on one hand, and nearness and snowball functions, on the other, it is useful to briefly refer back to our earlier discussion on fact retrieval and reference retrieval. As we remember, the general problem in fact retrieval was to retrieve well-defined and known pieces of information. These "pieces of information" might be a set of documents, for example "all documents containing the word computer". Or they might be more isolated data, for example as "the number of supreme court decisions in 1976". In order to solve the fact retrieval problem an identity type function is needed.

Identity functions select documents having the exact attributes specified in the query. Each word specified in the query defines a set of documents having the common attribute of containing the word. The different document sets defined in this way are usually related to each

[Page 189 ]


other by the use of Boolean algebra. The basic concept of Boolean algebra is the concept of "class of objects". Boolean algebra, applied to document retrieval, describes the relationship between sets of documents on the basis of attributes these documents have, or do not have, in common. The basic operations of the algebra of classes are conjunction, disjunction and negation, which, in search languages based on Boolean algebra, correspond to the AND, OR, and NOT operators. The AND operator defines the set of documents which have both the specified attributes. The OR operator defines the set of documents which have either one or both of the specified attributes. The NOT operator defines the set of documents which do not have the specified attribute. For a further introduction to the properties of Boolean operators, see Becker/Hayes (1967: 335-343).

We note that the Boolean operators only apply to document sets. Sometimes the user may be interested in attributes which concern subunits of the documents - for example he wants to retrieve documents where two given words occur next to each other. This cannot be accomplished by Boolean algebra as long as the user has no control over the definition of the document. The so-called positional operator may be used for this purpose, however. By using the positional operator, terms may be defined not only as attributes of documents, but also as attributes of the word position within the document. This makes it possible, for example, to specify the attribute that two given terms shall occur next to each other.

(2) Nearness functions

Nearness functions do not retrieve documents on the basis of identity, but on the basis of similarity to the attributes specified in the query. An identity function divides the data base into two groups, one consisting of retrieved documents, the other of non-retrieved documents. A nearness function may impose a full order on the documents, that is, it may assign a unique rank to each document in the data base, or it may impose a partial order whereby documents that have been assigned identical ranks are grouped together in common rank-sets.

Documents may be ranked according to both bibliographical criteria and criteria based on the texts of the documents. Bibliographical criteria include things like author (source type), date, geographic code, and so on. A collection of retrieved legal documents could, for example, be ranked according to the three criteria:

[Page 190 ]


Syntactic criteria include things like number of matched classes, number of matched terms, the frequency with which the matched terms occur in a document, the distribution of the terms, document length, and so on.

Nearness functions are primarily used in connection with reference retrieval. There are mainly two reasons for this.

The first reason is connected with the situation of the user. The user situation may vary greatly in reference retrieval, ranging from the need for a quick look at one or two relevant documents, to the need for a thorough reference file, including even documents of only minor interest. In addition the situation may depend on other factors, such as the experience of the user and his familiarity with the subject matter. The user may also adopt another view of his problem after reading a few retrieved documents. The relevance assessment may change as a consequence. In short, reference retrieval is characterized by a changing and dynamic environment, and a nearness function reflects this reality better than an identity function.

The second reason is connected with the difficulties of giving a reference type problem an adequate query representation. We discussed some of these difficulties earlier in section 10.4. It was pointed out that both empirical results and theoretical analysis emphasize the virtual impossibility of matching all the necessary conditions in all but a few of the relevant documents. In a practical situation the retrieved documents will have different probabilities of being relevant, depending on the number of conditions which have been matched. Identity functions cannot, by their very nature, rank the documents according to relevance probabilities, while nearness functions, on the other hand, can.

It is necessary at this point to comment upon the fact that Boolean search techniques are so widely used, also in connection with reference retrieval, even though they are based on identity functions. The principal reason for this may be not so much based on the superiority of identity functions, as on the very fast response time of on-line systems, which allows the user to continually change and improve his query. The ability to make rapid alterations in the query gives a dynamic quality to the otherwise static identity functions. A nearness function allows the user to move up and down a list of ranked documents, and thus has in itself a

[Page 191 ]


dynamic quality. The usefulness of this quality rests of course on the premise that the nearness function reflects the user's own relevance preferences.

A nearness function measures the similarity between query and document and can be used to rank documents relative to the query. Some nearness functions also have the ability to measure the similarity between texts in general and thus to create "clusters" of documents which are similar. Different types of techniques may be used to express nearness functions.

The simplest technique is based on ranking the documents according to the number of terms each document has in common with the query. The technique may be modified by taking into account the frequency with which the matched words occur in the documents. The assumption is that the thoroughness with which the referenced concept is treated in a document, and hence the probability of potential relevance, is a function of the frequency with which the word occurs in the document. Of course, word frequency is also a function of document length. In order to be absolutely correct, each frequency should be modified in order to adjust for the fraction of the frequency which, on the average, is due to the length of the document.

A more sophisticated technique is the so-called weighted-term technique, which allows the user both to group the terms in the query and to assign weights to the groups. The technique can be used to rank the documents, not according to the matched terms, but according to the matched groups. If a document thus contains terms from several groups, it is given a rank corresponding to the combined weights of these groups. The same group is never counted more than once, however, even if several of its terms are matched in the document. We note that if all groups are assigned equal weights, documents are ranked according to the number of groups matched in a document. The user can also vary the relative importance of the groups by shifting their weights. Take the following example:
car 4
automobile 4
motor vehicle 4
accident 2
collision 2
insurance 1

[Page 192 ]


Documents containing words from all three groups are ranked on top. But we note that a document containing words from only the first group is ranked before a document containing words from both the second and third groups.

It can be demonstrated that the weighted-term technique is a very flexible tool, which is capable of expressing the same logical relationships as Boolean algebra, see Sommar/Dennis (1969).

In the case where each term is considered as a group by itself, the technique is identical to our simple term-matching technique, with the exception that now the user can assign weights to the terms. This type of matching technique makes it possible to represent the documents as vectors, that is as ordered arrays of terms where each distinct word in the document collection is associated with a given term. Briefly speaking, vectors are constructed in the following way. If there are n distinct words in the document collection, there will be n terms in the vector space. The number assigned to the term in the vector space determines the place of the corresponding term in a vector. The value of a term in a vector is determined by the weight assigned to the word. If the word does not occur in the document, the value of the term is zero. If the word is present, any weight may in principle be assigned to it. The most common method is to assign values corresponding to the word frequencies in the documents.

Several measures of nearness between two vectors may be defined. The perhaps simplest one sums the products of the vector terms and is given by the scalar product of the vectors a and b. The scalar product is defined as:

<a, b> = a1 b1 + a2 b2 + ... + an bn

If the weights assigned to the terms are restricted to 0 and 1, the scalar product gives the number of terms the two vectors have in common, and could of course also have been arrived at by other means.

The scalar product does not work very well if the documents are of widely different lengths. A measure which does make adjustments for differences in document length is the cosine function, defined as:

cos(a, b) = <a, b> /(||a|| · ||b||)

where ||a|| is called the length of the vector a:

||a|| = √(a12 + a222 + ... +an2)

[Page 193 ]


The cosine function measures the angle between two vectors in vector space, and this figurative interpretation is often used to illustrate its properties as a nearness function. Well-known systems based on vector representation of documents and employing the cosine function are the CONTEXT system, see Vischer (1971) and above at section 7.6, and the SMART system, see Salton (1971). Vector systems have also been developed for experimental purposes at for example Kent University in Canterbury and at the Norwegian Research Center for Computers and Law.

For examples of nearness functions in addition to the cosine function see Soergel( 1966).

Even if the vector technique in principle seems well suited to referenceretrieval, it suffers, as do almost all techniques, from certain disadvantages. The perhaps main limitation derives from the fact that different terms within a vector cannot be given class attributes. This means that the user cannot specify in the query which concepts the different words represent. It is not possible to remedy this deficiency by assigning different weights to the terms in the query, because weights alone do not give class attributes to the terms. It can be demonstrated that this inability to group synonyms may cause the vector function to return needlessly inferior results, see Harvold (1976). The only way to solve this particular problem is to do the grouping of synonyms in the document vectors by assigning the same term to words that are synonyms. Ideally the grouping should be done on the basis of the query, since this is the only way in which also context-dependent synonyms can be taken into consideration. But the re-generation of all document vectors for each new query is clearly not a practical alternative. Context-independent synonyms may, however, be grouped once and for all at the time the document vectors are generated. Vectors where synonyms have been grouped are called concept vectors, see Williamson/Williamson/Lesk (1969). In concept vectors, synonyms are represented by a single term, and as a result information about distinct words in the synonym group is lost. It is possible to retain information about distinct words, and at the same time describe the necessary logical relationships which a grouping of synonyms requires, by letting each document be represented by a lattice, where each word is represented by a node in the lattice. See Becker/Hayes (1967:341-343).

The above discussion applies to the vector method in general. There are also some disadvantages which can be attributed to the cosine function

[Page 194 ]


in particular. The factor in the cosine function which may be a cause of trouble is ||a||, which computes a value for the length of the vector. The value is peculiar to each document, and the smaller it is, the higher a document will be ranked, other things being equal. This means that a document will be favored if it is short. And if two documents are of equal length, the one in which the distinct words have low and equally distributed frequencies will be favored. Thus generally speaking a document dealing with only one subject will be ranked before other documents which also deal with other subjects and presumably are longer. Clearly this is a ranking criterion that might not always reflect the true relevance value of the documents. Soergel (1966: 164) mentions yet another disadvantage related to the vector method. Assume that the user wants to make an exhaustive search on the terms A, B, and C. For the sake of exhaustiveness low weights are assigned to all three terms. In this case documents in which A, B, and C have equally low weights will be ranked before documents in which, for example, A and B have high weights and C a low weight, this in spite of the fact that the latter documents probably have higher interest.

Salton, who was part of the pioneering team developing the vector method, reports favorably on its performance in a long list of publications many of which are assembled in Salton (1971). Especially the possibility of weighting the vector terms proved effective in improving retrieval experiments, see Salton/Lesk (1968).

It should be emphasized that the general version of the weighted-term technique is not characterized by the same disadvantages as the more special vector method. But even so the weighted-term technique may not always have the same user appeal as the better known Boolean technique. The so-called conceptor technique represents an attempt to combine features from both the Boolean and the weighted-term techniques. This technique introduces a new operator - the COR operator - which can be used in combination with the traditional Boolean and positional operators. The COR operator has an executive priority lower than any of the other operators. It merges the documents sets found on the basis of its operands and ranks the document on the basis of two criteria - the primary criterion being the number of matched classes and the secondary criterion being the frequency of the matched words. The conceptor technique was developed especially for text retrieval, and this is the reason for the secondary ranking on word frequency, a feature which is not part of the weighted-term technique.

[Page 195 ]


We shall illustrate conceptor searching by considering a query consisting of the three following classes of words:

seller
sale
sold

liability
liable

dangerous article
accident

The conceptor technique as implemented in the NOVA*STATUS system offers the user two alternative ways of expressing the query. The query can be expressed by filling out the terms in each class, as follows:

C1: seller, sale, sold
C2: liability, liable
C3: dangerous article, accident

The method corresponds to a weighted-term representation, except that the weights are not explicitly assigned by the user. The query can also be expressed by explicit use of the COR operator:

seller, sale, sold .COR. liability, liable .COR. dangerous article, accident.

The comma is an alternative way of expressing the OR operator. "OR" could have been used if the user had so desired.

A single Boolean statement cannot be used to arrive at the same result. By the use of three statements, however, we can retrieve the document sets where three, two, and one classes have been matched respectively. The three Boolean statements are:
S1 :c1 AND c2 AND c3
S2 : ((c1 AND c2 ) OR (c1 AND c3 ) OR (c2 AND c3 )) NOT s1
S3 : (c1 OR c2 OR c3 ) NOT (s1 OR s2 )

where, in order to simplify the expressions, the three term classes have been substituted by c1 , c2 and c3 respectively. Even so the statements are rather awkward, and would become even more so in the case of a query consisting of four or more classes.

[Page 196 ]


(3) Snowball functions

The essential feature of a snowball function is that it will retrieve not only documents which are of varying similarity with respect to the query, but also documents which are of varying similarity with respect to the documents already retrieved.

The process operates in the following way. A search is performed on the basis of the original query. The documents which are retrieved are then used to automatically create a new query (or queries depending on the method used), and the cycle is repeated. A snowball function may well be constructed around the vector method, although any method based on term matching can in principle be used.

While it is perfectly possible to use the snowball technique for text retrieval in the manner just described, the technique can also be used for the purpose of tracing citations - an application of practical importance. There are two possible versions of citation systems.

In one version the system will retrieve the documents which are cited by the source document. In the other version the system will retrieve the documents which themselves cite the source document. The first method is thus used to retrieve documents which are older than the source document, while the second method is used to retrieve documents which are younger, see Loosjes (1973). The snowball function can, of course, be used in both versions.

Snowball functions are by nature explosive in the sense that the number of retrieved documents tend to grow exponentially. However, there are various ways to limit the quantity of output. The most obvious way is to limit the number of cycles. This is a necessity in any case, or else the snowball function would go on forever. Another way, which primarily is applicable to text systems, is to increase the threshold value of the matching function for each cycle. Thus in the first cycle relatively many documents are retrieved while in the second cycle fewer documents are retrieved because of the more exacting matching criteria used.

[Page 197 ]


11 Research regarding theperformance of retrieval systems

11.1 INTRODUCTION

The science of information retrieval largely lacks a comprehensive theoretical foundation. Part of the explanation of this state of affairs can probably be found in the fact that information retrieval, like other computer-dependent sciences, is a relatively young science. Part of the explanation can also be found in the nature of information retrieval itself, and especially in the problems related to the concept of relevance. The main questions concerning the type, nature, and grading of relevance are central in any attempt to measure the performance of a retrieval system, but as yet, there has been no accepted general way of treating them.

Part of the explanation can also be traced to the relatively slow progress made in the field of artificial intelligence. Many of the problems of information retrieval depend in a fundamental sense on the ability to understand text, something which machines cannot do - yet. And it is a sobering thought, that even if we had in our possession a clever machine capable of text comprehension, we would still be left with many of the problems associated with subjective relevance. As for the time being, we are not only left to deal with the problems of relevance, but also with the problems connected with retrieving documents according to purely syntactic criteria.

There are quite serious practical limitations imposed on empirical investigations of retrieval performance. Most of these limitations are again caused by problems connected with relevance.

In order to measure retrieval performance, the relevance of both the retrieved and the non-retrieved documents must be evaluated. Essentially this means that every document in the test data base must be read and evaluated by a juror. Ideally only one juror should be used, since experience shows that different jurors arrive at quite divergent results, depending on their familiarity with the problems and their general

[Page 198 ]


backgrounds, see section 11.3.1. Differences in the relevance assessment can probably be reduced if the jurors are instructed to look for content relevance and not subjective relevance, and if they are given guidelines for assessing content relevance. However, as long as there is some element of judgement involved, the opinions of jurors will, and should, differ. Even a single juror will normally change his assessment over time. We must remember that the assessment process is also a learning process. By reading the documents the juror gains new insight, which may cause him to reconsider some of his earlier evaluations.

The uncertainty associated with relevance assessment need not, however, be a serious obstacle to the general credibility of the results of retrieval system evaluations. Lesk/Salton (1968) report that in an experiment designed especially to test the question no significant relationship was found between differing relevance assessments and the evalution results. There is a sound reason for such a result. In view of the definitions of the recall and precision ratios, it is likely that uncertainty regarding the relevance of marginal documents will, on the average, affect the numerator and denominator of each ratio proportionately and thus leave the values of the ratios unchanged. Should a shift in the relevance assessment be caused by a juror's misunderstanding of the question, however, the mistake would very likely show up as a corresponding shift in the performance figures.

Another and more serious problem connected with relevance is that, in order to get accurate values for recall, it is necessary to manually assess the relevance of all documents in the data base. This imposes severe practical limits on the size of the test data base. An unfortunate situation, since we have no reason to believe that the effect which the different system factors have on performance can be extrapolated linearly. Two different solutions to the problem suggest themselves. One solution is to estimate recall on the basis of the results from two independent searches. The other solution involves constructing general models of the retrieval system in which the size of the data base is one of the variables. A model of this type has several other advantages as well. It can be used to gain insight into how the factors which determine performance interact, and the relative importance of each factor. If the model is flexible in the sense that the parameters of the model can be set to reflect varying retrieval situations, it will also be possible to predict performance in extreme situations - situations which might appear in the future, but which it is out of question to test empirically for the time being, cfr. Harvold (1976).

[Page 199 ]


11.2 GENERAL RESEARCH

We now turn our attention to a survey of eight experimental projects in which various aspects of retrieval performance have been tested. We do not at all pretend that our survey is complete. Our selection reflects both our familiarity with the projects and our special interest in legal retrieval systems.

11.2.1 The Aslib-Cranfield Projects: 1960-1966

The aim of the first Aslib-Cranfield project was to investigate the operational performance of four different indexing systems. The project is described by Cleverdon (1960), (1962) and by Aitchison/Cleverdon (1963).

The most important result of the project was probably not its findings - the general conclusion was that the four systems, the Universal Decimal Classification, a facet classification, an alphabetical subject catalogue, and the Uniterm system of co-ordinate indexing all operated at about the same level of effectiveness. In the course of the project, however, valuable experience was gained in the testing of retrieval systems, and a methodology was developed which later investigations to a large extent have built upon. Performance was for the first time measured by the five criteria: coverage, recall, precision, response time, presentation and user effort.

These criteria have later become classical and widely used in a variety of retrieval experiments. Coverage, response time, presentation and user effort measure the quality of the operational features of the system, while recall and precision measure the quality of the individual retrieval results.

On the basis of his tests Cleverdon suggested that a retrieval system is made up of a basic vocabulary and a number of retrieval devices. The retrieval devices are made up of recall and precision devices. Examples of recall devices are:

Examples of precision devices are:

This conceptual apparatus was used in the second Cranfield project to investigate retrieval devices in isolation and in all practical combinations

[Page 200 ]


in order to measure the effect of each device on performance. The project is documented by Cleverdon/Mills/Keen (1966) and Cleverdon (1967).

During the tests, factors affecting both indexing and search strategy were varied. The test conditions were stringently controlled, and only one factor was varied at a time. Three main types of index languages were investigated. The first type was based on single terms selected from the natural text of the documents (e.g. axial, flow, compressor). The second type was based on concepts selected from the natural text of the documents (e.g. axial flow compressor). The third type was based on various groupings of a set of controlled terms. Search strategies used included

The test data base consisted of 1400 research papers mainly in the field of aerodynamics. The number of questions used was 221. The retrieved documents were ranked according to term co-occurrence level, and a normalized recall figure was calculated for each test.

The results, as it turned out, were unexpected at the time, although they seem more reasonable today. The best indexing language turned out to be the single term natural language, which consisted of words selected from the texts of the documents. The best results were furthermore obtained when the endings of these terms were confounded. Slightly inferior results were obtained when synonyms were grouped or the terms used in their natural language form. Any further reduction in specificity, for example, by grouping synonyms or using hierarchies, resulted in reduced performance. The use of all other index languages gave inferior results. Languages based on controlled terms performed, on the average, better than concept languages, but were still inferior to the single term language.

The results seem to indicate that the best results are obtained when preprocessing, in the way of standardizing the documents, is kept to a minimum.

The only preprocessing of the documents which will improve performance is the grouping of context-independent synonyms - the so-called true synonyms. Grouping of context-dependent synonyms (quasi synonyms) or any other form of standardization, like the use of controlled vocabularies, hierarchies, or concepts, will result in inferior results, presumably because standardization at this stage implies that information is deleted, not added, to the documents. It is the

[Page 201 ]


context-dependent synonyms that are the main cause of trouble. By definition, however, these cannot be grouped once and for all, but can, if information is not to be lost, only be grouped in the query itself.

11.2.2 The SMART experiments 1964-1971.

One of the most comprehensive research programs dedicated to investigating the performance of retrieval systems was started in the early 1960s at Cornell University. The heart of the system is the SMART document retrieval system, which has been operating since 1964. The SMART system has been used to test a variety of retrieval procedures, including both automatic indexing techniques and search techniques. The results are documented in a long list of publications, many of which have been collected in Salton (1971). Salton/Lesk (1968) also give a good survey of the main experimental results.

The results of the second Cranfield project seemed to indicate that automatic indexing techniques could perform just as well, if not better, than manual techniques. The tests showed that the best results were obtained when the indexing terms were selected from the words in the documents and when the terms were normalized by the confounding of word endings.

In the course of the SMART project several additional dictionaries for the purpose of normalizing the vocabulary were studied:

The results indicated that in general the preprocessing of documents may improve performance, but not in any dramatic way.

The use of the word-stem dictionary compared to the

[Page 202 ]


suffix "s" dictionary improved performance slightly for the two collections IRE-3 and ADI, but not for the Cranfield collection.

The use of synonym dictionaries compared to the word-stem dictionary resulted in relatively better and statistically significant improvements. However, separate dictionaries were manually constructed for each collection, and it is hard to say how relevant the results are if we think in terms of large, heterogeneous and rapidly changing data bases. We also note that the dictionaries did not include context-dependent synonyms.

The possibility of constructing synonym dictionaries by fully automatic methods was also investigated. Roughly speaking, such a dictionary can be constructed by computing term similarity coefficients based on the co-occurrence characteristics of the terms, either in the whole document collection or in a sample subset of the collection. Semiautomatic methods may also be used. Here the word relationships are defined by an expert, but use may be made of, for example, word frequency lists and KWIC lists in addition to various other possible aides.

A phrase dictionary may be useful in situations where the individual words of the phrase, taken by themselves, are so common that they have little or no retrieval value. The classical example is the phrase "computer control". In computer literature the words "computer" and "control" are by themselves almost meaningless, while "computer controll" has a specific meaning. The use of phrases can therefore possibly improve performance. The addition of the phrase dictionary, however, did not significantly improve the performance of the synonym dictionary. The experiment was based on a dictionary of important phrases, which was preconstructed by manual means. It is also possible to use automatic methods for the purpose of identifying phrases. The basic principle is to identify the phrases on the basis of the number of times the words occur together. The co-occurrence level may be varied by using different cutoff levels. A dictionary based on the statistical association method did not significantly improve performance, however.

A hierarchical dictionary can be used to expand a query by:

Using a hierarchical dictionary constructed for the IRE collection, both documents and queries were expanded by use of the dictionary. When compared to the use of the synonym dictionary, performance was

[Page 203 ]


inferior as regards expansion by brothers, sons and cross-references. Expansion by parent was generally more successful, but even here performance was not significantly improved.

The cluster technique is an extension of use of the nearness matching function. The technique can be used to divide the data base into areas, which might correspond more or less to traditional subject areas, and to restrict a search to one or more of these areas. The following overall strategy is normally used.

Clusters are constructed by matching every document with every other document and by grouping those documents which are sufficiently similar. For each cluster, a representative document, sometimes known as the centroid document, is used to represent all the documents in that cluster. The search itself proceeds in two steps. First the query is compared to all centroids of all clusters; then the query is compared to the documents located in clusters with highly matching centroids.

The cluster technique was tested by comparing the results of a cluster search to a full search using two document collections of 82 and 200 documents respectively. It was found that even in the case of such extremely small collections, high recall values could not be attained at all with the cluster technique. The middle recall ranges could be reached, but at lower precision values than were obtained in the corresponding full search. The cluster technique will thus not improve performance, but might nevertheless be of use as a method of reducing search costs. Another factor affecting the possible use of the technique is that document clusters will deteriorate as the data base is updated. This factor may limit the use of clusters to relatively stable data bases.

The SMART system was also used to compare so-called entry vocabularies - which are different methods of representing documents for retrieval purposes. Three different entry vocabularies were compared - namely representation by:

As expected, performance improved as the entry vocabulary was expanded, although the improvement from abstract to full text was not as significant as the improvement from title to abstract. It was concluded that full text representation may not alway be superior to the use of abstracts in terms of effectiveness.

[Page 204 ]


11.2.3 The MEDLARS evaluation: 1966-1967

The MEDLARS evaluation represented something new in retrieval experimentation. For the first time the analysis of retrieval performance was broadened to also include an analysis of the causes of retrieval failure. This is a kind of hindsight analysis involving manual examination of:

The analysis is time-consuming, but valuable, since it not only tells us to what degree a search failed, but also why it failed. The MEDLARS evaluation is described by Lancaster (1968b) and (1969).

The MEDLARS data base consisted of more than 800 000 citations from biomedical journal articles published prior to January 1964 and all subsequent issues of the monthly Index Medicus. The data base was too big for the answer sets to be found by manual means. Recall was estimated by obtaining two independent samples of relevant documents. One sample was obtained by MEDLARS, another sample was obtained by other means, for example through a local librarian, through the professional knowledge of the searcher or his colleagues, or in any other way independent of MEDLARS.

A total of 302 searches were completely analyzed, and recall and precision ratios were obtained for 299 of these (3 questions had no answer set). Overall recall and precision ratios were 50.4 and 57.7 per cent and there were 797 cases of recall failure (relevant articles that were not retrieved) and 3 038 cases of precision failure (irrelevant documents that were retrieved).

The failures were analyzed in terms of:

The terms apply to both indexing of a document and the formulation of a query. Exhaustivity means the extent to which all the concepts in the document (request) are covered. Specificity means the generic level at which a concept is represented. The entry vocabulary refers to the document text from which the document is indexed. The entry vocabulary may be the title, an abstract, or the full text of the document. In the

[Page 205 ]


case of a full-text system, the entry vocabulary will be identical to the index terms.

A detailed account of all the results will not be given here. It is interesting to note, however, that system-dependent factors, including the inadequacy of the index language, of the index, and of the user-system interaction, accounted for 74 per cent of total recall failures and 65.6 per cent of total precision failures. Searching factors accounted for 35 per cent of total recall failures and 32 per cent of total precision failures. The totals add up to more than 100 per cent because the same failure may be caused by more than one factor. For a more detailed account of the failures see Tables 11/1 and 11/2.

The evaluation results are especially interesting in the light of the retrieval systems that are available today. Had MEDLARS been an on-line, full-text system, most of the retrieval failures caused by the system-dependent factors would have been avoided. Performance would have been significantly improved. In fact if we use the average performance figures given for the evaluation, recall would increase from about 60 per cent to about 85 per cent and precision would increase from about 50 per cent to about 75 per cent - a result which would have been surprisingly good.

Table 11/2

Reasons for recall failures in the MEDLARS evaluation (from Lancaster (1969))

Index language
- lack of appropriate specific terms 10.2%
Searching
- all reasonable approaches not covered 21.5%
- query too exhaustive 8.4%
- query too specific 2.5%
- other 2.6%
Indexing
- insufficiently specific 5.8%
- insufficiently exhaustive (topics) 20.3%
- important concept omitted 9.8%
- other 1.5%
Computer processing 1.4%
Inadequate user-system interaction 25.0%

[Page 206 ]


Table 11/2

Reasons for precision failures in the MEDLARS evaluation (from Lancaster (1969))

Index language
- lack of appropriate specific terms 17.6%
- false co-ordinations 11.3%
- incorrect term relationships 6.8%
- defective hierarchical structure 0.3%
Searching
- not specific 15.2%
- not exhaustive 11.7%
- inappropriate terms 4.3%
- inappropriate logic 1.1%
Indexing
- exhaustive 11.5%
- other 1.4%
Inadequate user-system interaction 16.6%
Computer processing 0.1%
Value judgements 2.3%
"inevitable" retrieval 0.1%

11.2.4 The "Comparative Systems Laboratory Experiments" Project: 1963-1968.

At Case Western Reserve University a rather large project was undertaken in the mid-sixties in order to investigate the relationship between the variable components of retrieval systems and performance. The project is documented by Saracevic et al. (1968) and by Saracevic (1970).

The components of a retrieval system were described in terms of the purpose and the function of the system. The purpose of a retrieval system was subdivided into:

while the function of the system was subdivided into:

[Page 207]


The data base used in the experiment consisted of 600 documents selected from the 1960 volume of Tropical Diseases Bulletin (indexed in five languages). On the basis of 124 questions, 4 448 queries were submitted for searching. Answer sets to the questions were established by asking the users to evaluate the retrieved documents. The non-retrieved documents were evaluated by a separate expert, who tried to interpolate the relevance judgements of the users. It turned out that of the 124 questions only 63 had relevant answers.

"Sensitivity" and "specificity" were used to measure performance. Sensitivity (Se) was defined in an identical way to recall, while specificity (Sp) was defined as the ratio of the number of non-relevant documents not retrieved to the total number of non-relevant documents in the data base. Effectiveness (E) was defined as: E = Se + Sp - 1

A main purpose of the experiment was to investigate the relative effectiveness of various indexing languages. The effectiveness of the index languages compared to full text was not investigated however. Of greater interest to us is therefore the analysis of different search strategies. The tests included the use of two types of queries:

The query could be expanded by use of a thesaurus or by use of any other available source. Use of the thesaurus did not prove as effective as manual elaboration of the query. One of the most important findings, however, was that it was practically impossible by any means to expand the narrow queries to the extent where all relevant documents were found. It was only when all but one category were dropped (broad search) that most relevant documents were found, but then at the expense of a considerable drop in precision. These results correspond well with the expected behavior of full-text retrieval systems, see Harvold (1976).

Considerably more unexpected was the observation that, when a narrow query was expanded, an almost linear relationship was found to

[Page 208 ]


exist between total output and the number of relevant and non-relevant answers.

The experiment included a test on relevance judgements based on different formats of output. The formats used were title, abstract, and full text. The results were:


judged
relevant
judged
partially
relevant
judged
irrelevant
total
Titles 167 157 762 1086
abstracts 175 169 742 1086
full text 207 156 723 1086

The results for full text must be considered the "correct" values. We note that judgements based on abstracts or even titles are good approximations of the relevance assessment made on full text.

It is also interesting to note that the relevance judgements based on titles or abstracts were superior to the performance of the retrieval system. Below we have calculated recall and precision in both situations.
recall precision
System % %
- titles 20 55
- abstracts 59 40
- full text 74 30
manual    
- titles 63 89
- abstract 77 95
- full text 100 100

11.3 RESEARCH REGARDING LEGAL SYSTEMS

11.3.1 The Joint ABF/IBM Project: 1966-1967.

We shall now consider investigations oriented especially toward the problems of legal, full-text retrieval. We begin with the Joint American Bar Foundation and International Business Machine project, which has

[Page 209 ]


become known not so much for its aim, which was to investigate the degree of satisfaction (as judged by a panel) that could be achieved by the use of a computer-based retrieval system, as for its analysis of the difference in the panel assessments. The project made use of a vector-type retrieval system developed by S. F. Dennis of IBM. The results of the project are described by Eldridge (1968).

The data base consisted of 5 800 appellate court decisions. The question set consisted of 40 questions taken from the files of practising lawyers. The data base was searched both by the retrieval system and by hand at the American Bar Foundation and in the legal department of IBM. Both answer sets were submitted to a panel of four lawyers for evaluation.

It was found that the retrieval system and the manual search had performed about equally well in terms of recall, and that the manual search was about twice as effective in terms of precision. However, a far more interesting and perhaps surprising result of the investigation was the intensity of disagreement between the four panelists. The panelists were instructed to read the questions and evaluate each retrieved document according to a four-point scale of relevance. The documents were to be assessed according to the contribution they made to the resolution of the issue raised in the question. Thus in effect the panelists were asked to evaluate the documents according to content relevance, not subjective relevance. Even so the panelists disagreed more often than they agreed. Of a total answer set of 706 documents, the panelists only gave 3 per cent of the documents a unanimous relevant vote (either "on point", "relevant", or "related"), while 31.3 per cent of the documents received a unaminous irrelevant vote. A total of 65.7 per cent of the documents received a mixed vote. The disagreement, however, turned out to be rather systematic in the sense that each panelist seemed to prefer a certain grade - the academicians on the panel generally preferred the low relevancy grades while the practitioners favored the high ones.

As an explanation of this behavior, the report suggests that the disagreement might reflect the fact that the questions were prepared by a practitioner. As a consequence the issues might have been more familiar to the practitioners on the panel than to the academicians. Other explanations might be possible as well. The experiment does seem to emphasize the subjective nature of relevance. In addition the experimental framework was not "life-like", but distorted both by the fact that the different functions of the retrieval system were performed by different people, and

[Page 210 ]


by the fact that a relevance scale of four grades was used. As Eldridge himself points out, humans normally have difficulties in making comparative evaluations involving more than three or four documents. And in a practical retrieval situation the user probably has little need of making relevance distinctions beyond rejecting some documents as irrelevant and accepting others as of some use.

11.3.2 The Oxford Experiment: 1963-1965

The Oxford experiment represented one of the first large-scale attempts at evaluating the performance possibilities of full-text retrieval, and it was certainly the first experiment making use of a data base consisting of legal documents. The experiment is described by Tapper (1969) and (1973: 159-182). Cfr. also above at section 4.3.4.

The aim of the experiment was to measure the efficiency of computerized legal information retrieval as compared with the conventional technique of index look-up.

Two relatively large data bases were prepared for the purpose of the experiment. The first was a general series of reports of decisions in the High Court, the All England Law Reports. The second was a series of administrative decisions in the field of insurance claims for industrial injuries, the Commissioner's Decisions. The two data bases consisted of about two million and one million words respectively. The data bases were chosen largely because they could both be accessed through manually constructed indexes. The High Court decisions (called cases for short) were indexed both for the series and for the individual volumes. The index terms were taken from an introductory telegraphic abstract to each report. The administrative decisions (called decisions for short) were indexed in a general loose-leaf file. The index was detailed and thoroughly cross-referenced, re-producing a high proportion of the original headnotes. The index was oriented toward factual descriptions, in contrast to the case index, which was oriented toward legal terms.

Still another factor concerning the experimental setup should be mentioned. The manual searchers were limited to the use of the indexes; they were not allowed to browse through or examine the documents themselves. Otherwise, it was felt that they would have had an unfair advantage compared to the searchers using the machine.

The results were evaluated in terms of recall and precision. The questions were based on the facts selected from representative and recent reports. The answer sets were defined by selecting:

[Page 211 ]


Thus no attempt was made to find the complete answer sets consisting of all the relevant documents. However, based on the assumption that the computer and conventional techniques were independent, a value for the size of the complete answer set was estimated on the basis of the intersection of the two answer subsets.

The main results of the experiment are summarized below (from Tapper 1973: 179).
Cases Decisions
No. of pre- No. of pre-
rel.doc. recall cision rel.doc. recall cision
% % % %
Conventional 43 39 100 57 58 85
Computer 67 61 22 77 80 37
Together 84 76 27 91 96 39

We note that the computer technique performed significantly better than the conventional techniques with respect to recall. In fact the differences in the values are quite remarkable. This can be seen as yet another confirmation of the superiority of full-text representation compared to indexing, even when the indexing is quite elaborate and thorough.

As expected, the computer technique produced inferior precision values, but we note that the range of the values, from about 20 to 40 per cent, is not at all unmanageable in a modern on-line system.

11.3.3. The Responsa project: 1967-1969

The Responsa project is an ambitious attempt to make the huge responsa literature available for research through a full-text retrieval system. The responsa span 17 centuries and consist of answers by Jewish authorities to submitted questions. In this short summary only the results of the first experimental phase of the project, which was concluded in 1969, will be discussed. The project is a continuous effort, however, and is by no means

[Page 212 ]


completed. Documentation of the first phase is provided by Choueka/Cohen/Dueck/Fraenkel/Slae (1972).

The responsa are written mainly in Hebrew and Aramaic. The traditional problems which natural language represents from a retrieval point of view are greatly accentuated in these languages. Grammatical variants do not necessarily have the same initial letters; homographs are abundant owing to, among other things, the lack of vowels; Hebrew and Aramaic forms and grammatical rules are mixed, acronyms and abbreviations may make up more than 50 per cent of some texts; and so on.

The method chosen as a means of attacking these difficulties was the so-called synthetical approach. Essentially this is the same method as was adopted by IBM, Austria, in the construction of the FAIR system. The principle of the method is based on the following two-phase preprocessing of the question. In the first phase the user specifies a standard form of the words he wants to search on, together with information on the grammatical variants he is interested in. The standard form is based on the singular masculine form of nouns and on the root of verbs. On the basis of the standard form, the system generates grammatical forms of the specified words and checks which of these actually occur in the data base. In the second phase the words with associated frequencies are presented to the user, who marks off the words he wants to include in the query. If the user is in doubt about the relevance of a word, he can ask to see the word in a limited context consisting of a few words on each side of the word (compact KWIC). After this preprocessing is accomplished, the main retrieval process proceeds as usual.

This broad outline will give an idea of the philosophy behind the system. The system was tested on an initial data base consisting of the 518 responsa (558 864 words) by Rivash. In all 16 questions were run, and the results were gratifying - 100 per cent recall was achieved for all questions, and the average precision was 34 per cent. However, these performance figures are not directly comparable to the other performance figure we have so far considered. The responsa queries were prepared in an unusually thorough and time-consuming manner. Before the query was constructed, the searcher spent a full day researching the data base in order to acquaint himself with the relevant vocabulary. This of course is not normal procedure in other systems and makes it exceedingly difficult to evaluate the responsa results. However, the main feature of the system, the synthetical approach to the preprocessing of queries, is both impressive and promising.

[Page 213 ]


11.3.4 The NORIS program, 1972-1976

In 1972 the Norwegian Research Center for Computers and Law initiated a research program in the field of legal information retrieval. The aim was both to investigate retrieval system performance, and to analyze the potential impact of computerized retrieval on the legal system itself. In the following only the aspect of retrieval performance will be discussed. Retrieval performance was investigated simultaneously along theoretical and empirical lines. Some of the theoretical results are documented in Harvold (1976); the empirical investigations are documented in a series of publications which include Bing/Harvold (1973), (1974), Fjelvig (1976), and Bing/Harvold/Kjønstad/Stabell (1976).

A model of full-text retrieval was developed along the following lines. On the basis of text statistics a logarithmic relationship was assumed to exist between the distinct number of words and total number of words in a text. Using this relationship, a model giving the average quantity of output, as given by the number of retrieved documents, was derived as a function of the following factors:

The quantity model was developed into a performance model by introducing the concept of relevance. It was assumed that the grading of relevance is always done on an either/or (binary) basis, even in the cases where relevance clearly is subjective. This assumption seemed to be the best reflection of reality in normal user situations, where it is the same person who is confronted with the question, formulates the query, performs the search, and evaluates the result. It was felt that the alternative to binary grading, grading by degree, is both time-consuming and difficult and is not normally engaged in. Normally the user is primarily interested in the bisection of the data base, where one section consists of documents that can be rejected out-of-hand and the other section consists of documents that should be consulted. Given this concept of relevance, expressions for recall and precision were developed, in which recall and precision were seen as functions of the same factors that determined the quantity of output.

The model was used to investigate the limits to performance under

[Page 214 ]


various conditions and to compare and evaluate different types of matching functions,

The empirical experiments were conducted partly in order to test the theoretical results, and partly to test questions not covered by the theory. The empirical experiments were made up of the following three traditional phases:

The limits to performance are both of a practical and a principal nature. The principal or absolute limits cannot be overcome by any amount of user ingenuity. They are essentially caused by the difference between formal relevance criteria as applied by the matching function and the content or subjective relevance criteria as applied by the user. Formal relevance is based exclusively on syntactic similarities between documents and query, while subjective or content relevance depends on the user's understanding of the texts. It should be noted that since the absolute limits depend on the difference between syntactics and semantics, the limits are only of importance in reference retrieval.

The absolute limits may affect both recall and precision. An absolute recall failure will be the result in the case where the subject of the question is not explicitly represented in a document, but is implied by the text as a whole. Absolute recall failures are most likely to occur in one-concept questions. In questions of two or more concepts a relevant document containing an implied concept may still be retrieved on the basis of the other concepts, but then only at the expense of a trade-off in precision, since the co-occurrence level no longer implies an exhaustive description of the question.

An absolute precision failure will be the result in the case where recall cannot be improved without a corresponding loss in precision. An absolute precision failure can be caused, as we have seen, by the need to lower the co-occurrence requirement because of an implied concept. In situations where it is specified in the query that all concepts should co-occur (exhaustive co-occurrence requirement), an absolute precision failure will occur either if one of the terms specified in the query has a homograph in a document containing all the concepts in the question except the one represented by the homographic term, or if the document contains all the concepts in the question, but in a wrong context. We note that absolute precision failures are most likely to occur in searches

[Page 215 ]


consisting of one or a few concepts. In fact when the recall-precision curves arrived at by averaging test results were compared with the corresponding curves predicted by our model, it was found that the absolute limits only cause a precision loss of about 10-20 per cent for queries based on three concepts. For queries based on one concept the precision loss increased to 40-50 per cent.

Above we implicitly assumed that it was possible to specify all terms and phrases used in the documents to represent the concepts of the question. Normally of course, this is quite impossible. Thus a high recall will only be within reach if a drop in precision, caused by relaxation of the co-occurrence requirement, is acceptable. Such a search strategy

Fig. 11/3

Performance curves based on a data base of 430 decisions by Swedish Administrative Courts, Harvold( 1976:73).

[Page 216 ]


leads to the type of curves shown in Fig. 11/3. Of the two recall-precision curves depicted, one is the result of averaging the test results obtained in Bing/Harvold (1974), and the other represents the corresponding curve predicted by our model, given similiar searching conditions and under assumptions of no absolute limits and a query expansion of 70 per cent.

Both curves were calculated by averaging individual curves based on queries consisting of from one to three concepts. The individual curves were extrapolated vertically beyond their end points.

The slopes of the curves represent the practical retrieval limits. These limits result primarily from the fact that we did not specify all possible words for each concept. The distance between the curves represents the

Fig. 11/4

Average and maximum predicted performance curves based on a data base of 430 documents, cfr. fig. 11/3.

[Page 217 ]


absolute retrieval limits. We note that this distance represents a precision loss of about 20-30 per cent. This means that, given the type and size of data base used, precision cannot on the average be improved beyond 70-80 per cent. Had all the questions consisted of at least three concepts, the result would have been better. Our questions, however, were a mixture of one, two, and three concept problems. In Fig. 11/4 we thus show both the average and the maximum performance curve, given an absolute precision limit of 25 per cent.

The NORIS project also included evaluations of different search strategies. The strategies were tested empirically in addition to being compared theoretically. Of special interest were matching functions used to rank documents. The following ranking criteria were compared:

We note that the introduction of class frequency ranking improves performance, compared to word frequency. A comparison of the

Fig. 11/5

Performance curves based on different ranking criteria. Harvold (1976:96-98).

[Page 218 ]


Data base of 100 decisions by the Norwegian Social Security Court.

Data base of 374 decisions by the Norwegian Tax Authorities.

methods on a theoretical basis suggest similar improvements, see Harvold (1976:94-100).

In order to map the causes of retrieval failure all the NORIS experiments included post-mortems on the retrieval results. The experiments in the different NORIS projects varied with respect to type and size of data base, and the results of the post-mortems varied accordingly. The general

[Page 219 ]


pattern remained more or less unchanged however. We shall use the NORIS (8) II to illustrate the results.

A recall failure occurs when a relevant document is not retrieved at all. The reason for such a failure can generally be identified as belonging to one of the five groups listed in Table 11/6. The table gives the relative importance of the causes in the NORIS experiment.

Table 11/6

Causes of recall failure, Bing/Harvold (1974: 102)
Specificity 49%
Implicity 22%
Point-of-view 27%
System failure 18%
  116%

The total adds up to more than 100 per cent because more than one cause may be associated with a given failure. A specificity failure means that the user did not find the correct terms to represent the concepts of the question. An implicity failure means that a concept was not explicitly expressed in a document. A point-of-view failure occurs when the user and the author approach the problem from different angles - with the result that different vocabularies are used in query and document. A system failure is caused by a fault in the retrieval system. The system failures in the experiment were mainly caused by faulty maintenance of the data base. The user mistakenly thought the document was included in the base.

The figures speak for themselves. It is interesting to note the relative importance of the point-of-view failures. A different point-of-view will often be the cause of trouble when a document is not found at all.

Before we discuss the precision failures, we have to say something about the partial performance failures. These occur when a retrieved relevant document is assigned a lower rank than it should have had, because of a failure to match some of the concepts in the question. Partial failures can be classified as either recall or precision failures according to choice.

Table 11/7 gives the causes of partial performance failure.

[Page 220 ]


Table 11/1

The causes of partial retrieval failure, Bing/Harvold (1974: 104)  
Specificity 57%
Implicity 12%
Point-of-view 14%
System failure 16%
  99%

Table 11/8

Causes of precision failure, Bing/Harvold (1974: 117) Point-of-view  
Specificity 29%
Exhaustivity 29%
  100%

A precision failure caused by point-of-view is not so much a performance failure as a failure of the experimental method. In this particular experiment, the answer set was not defined by the same person who specified the query and performed the postmortem. The point-of-view failures represent disagreement as to the relevance of certain documents. If the same person had performed the two tasks, the relative importance of the point-of-view failures would not have been so great. Anybody may change his mind though, and when it comes to relevance judgments, fickleness is indeed widespread. In Bing/Harvold/Kjønstad/Stabell (1976) the user, upon a reevaluation of the answer set, dismissed 16 of the 162 documents originally deemed relevant, and included another 61 documents as relevant.

A Specificity failure occurs when the cause can be traced back to one of the words in the query. The user may simply have specified the wrong words, or more likely, a homographic word. There is very little one can do to reduce this particular type of failure, since in the final analysis the failure is due to the ambiguity present in natural language itself.

[Page 221 ]


An exhaustivity failure occurs when a document does contain the subject of the question, but the subject is only mentioned as a reference or en passant or in a way not related to the main content of the document. This type of failure can also be said to be caused by the nature of natural language. At least it is reasonable to expect that the failure is less prominent in indexing systems. However, exhaustivity failures are not completely absent from these systems, Lancaster identifies exhaustive indexing as the cause of 11.5 per cent of the precision failures, see Table 11/2.

[Page 222 ]


HOME
PREVIOUS PAGE
NEXT PAGE