| retrieved | not retrieved |
|---|
| relevant | a | b |
| not relevant | d | c |
then:
recall = a / (a + b)
precision = a / (a + d)
The measures of recall and precision may seem simple at first glance, but this impression is something of an illusion. The measures depend
[Page 153 ]
entirely on the concept of relevance, and this is probably one of the most difficult concepts in the whole field of information retrieval.
There are at least three issues of importance to our understanding of relevance. The first issue concerns the type of relevance. Are we dealing with formal, content, or subjective relevance? See Königova (1971). The second issue concerns the nature of relevance. Is relevance absolute or is it relative to each user? The last issue concerns the grading of relevance. Is relevance a matter of degree or is it an either/or proposition?
Relevance is a relator in the sense that it says something about one thing in relation to another thing. The two "things" might be the syntactics of two texts, in which case we talk of formal relevance. Or they may be a problem and the content of a text, in which case we talk of content or subjective relevance depending on the type of situation in which the evaluation takes place.
Formal relevance measures the syntactic similarity between two texts. The texts may be query and document or two documents. The formal relevance of a document is thus a value assigned to the document by a matching function. Formal relevance is based on the syntactic structure of the document, not on the content of the document, nor on its usefulness to the user. Formal relevance will usually reflect the similarity between two texts as measured, for example, by a matching of words. But it may also reflect general criteria like type and age of document, author, and so on. The nature of formal relevance is absolute. Given two texts and a matching function, the relevance value is unambiguously defined. Depending on the matching function, the grading may be either/or (binary) or by degrees.
Content relevance is defined as the adequacy of the content of a document as a response to the request. Subjective relevance is defined as the usefulness of the document to the user. Subjective relevance will depend on a host of factors, including the user's previous knowledge. In the literature subjective relevance has also been characterized as "utility", Cooper (1971), and "pertinence", Foskett (1972).
The choice between content and subjective relevance must to a certain extent reflect the type of decision-making situation in which the user finds himself.
In an informal decision-making situation where the value of the decision depends entirely on future consequences, there are no rules that
[Page 154 ]
the decision-maker is forced to consider. The only thing that counts is the decision itself. How it is arrived at is irrelevant.
The quite opposite situation exists where the decision process is formalized in such a way that the validity of the decision depends mainly on the premises on which the decision is based. The validity of a decision made by a legal court, for example, will depend on whether or not certain procedures have been observed and respected. Certain of these procedures demand that certain legal sources shall be consulted. If the judge neglects to consult one of the required sources, his decision is invalid, even if it turns out that the same decision would have been arrived at had the source been taken into account.
In legal decision-making, as in other formal situations, it is thus not appropriate to regard relevance as entirely subjective. The relevance of a document cannot be defined only in terms of its usefulness to the user, since such a definition implies that only documents which in some way cause the decision-maker to change his mind are relevant. Instead we must base relevancy on the content of the document. Whether or not the document is relevant will depend on the adequacy of the content of the document as a response to the request. As part of the content we count things like date of publication, author, and so on.
We have earlier remarked that formal relevance is absolute in the sense that it does not depend on individual value judgements. Subjective relevance, on the other hand, is relative in the sense that it is particular to each individual user. In fact subjective relevance is not only relative to each user, but to each user situation, since the background knowledge of the user, which changes constantly, is a main factor affecting the utility of additional information.
The nature of content relevance is more difficult to establish. Content relevance measures the adequacy of a document as a response to the request. Content relevance does not depend on whether or not the user himself finds the information useful in the sense that it is new to him.
In a strictly formal system there are definite rules for evaluating relevance. There is little or no room for the personal opinion of the individual user. In such a system content relevance tends to be absolute.
While the legal system has certain characteristics of a formal system, these are not sufficiently prominent to make the assessment of content relevance absolute. This is not only evident in legal theory with its
[Page 155 ]
emphasis on human judgement within the often broad limits set by the law, but is also borne out by several empirical investigations regarding relevance assessments. We refer to section 11.3.1 for a summary of the results of one of these investigations. This is not to say that relevance assessment in the legal system is completely relative. Clearly it is not. As so often in the law the solution must be found somewhere in between the extremes, the different situations in each case deciding the way the scale tips.
Even when we know the type and nature of the relevance assessment, we are still left with the question of grading. In fact, the question of whether or not the relevance assessment should be graded by degrees or given a binary grading does not follow automatically from our previous choices of type and nature, and yet neither is the grading of relevance completely independent of type and nature.
Consider for example formal relevance. We have established that the assessment of formal relevance as performed by a matching function is absolute. Since there is no uncertainty regarding formal relevance, it might seem to be an appropriate candidate for binary grading. However, such a conclusion would be premature. It must be remembered that formal relevance as applied in a retrieval system is only an approximation of content or subjective relevance, and in most situations that are not characterized by fact retrieval (see section 9.5.1), the approximation will not be perfect and may even be quite inaccurate. As long as formal relevance is an approximation, the system should grade documents by degrees. This is appropriate even in the cases where the user himself would grade documents according to a binary scale. In these situations formal relevance is only a measure of similarity and should not presume to be a measure of identity.
The grading of documents by the user according to content or subjective relevance is an entirely different matter. If it is assumed that the user assesses documents according to subjective utility, it is obvious that some documents are going to be more useful than others. It is doubtful, however, whether or not the user will be able to assign to every document a unique rank according to utility. It seems, at the very least, safe to assume that the user will not make any distinctions among the documents which are all clearly irrelevant. And in most cases it also seems safe to assume that the user in fact will not be able to assign a
[Page 156 ]
unique rank to each relevant document. Experience has shown that humans are generally incapable of mentally making a complete comparison of more than a few items.
The user may be able to classify the relevant documents under a few headings like:
But again it is doubtful how much use he will have of such a classification. It is also a doubtful empirical question whether users of retrieval systems normally classify documents in this manner, or if the practice is reserved for panelists taking part in relevance experiments.
The behavior of panelists has often been cited in support of the proposition that relevance is a matter of degree. Gebhardt (1975), for example, refers to the Joint American Bar Foundation and IBM Project, see Eldridge (1968), and points out that panelists seem to disagree as often as they agree on relevance assignments. The typical user situation, however, does not consist of a panel, but of a single user. For a given document in a given situation the user will normally be able to decide whether or not the document is clearly irrelevant, or whether it might be relevant. Most of the time he will probably not assign a unique rank to the document, and an attempt to do so might prove difficult.
Assuming, however, that legal documents are assessed according to their content, relevance values corresponding to their hierarchical ranks may be assigned. Thus the constitution may be given a higher relevance value than a statute, a statute may be given a higher value than a regulation, and so on. But such a scheme is neither very realistic nor, probably, very useful. The rank of a legal source is not in itself absolute. The respective ranks of a supreme court decision, a statute, and a regulation, for example, depend on several factors, including how long ago the respective documents were written, how directly they affect the issue at hand, the reasonableness of the result which each document favors, and so on. In fact, in most areas of the law so much depends on human judgement that it does not seem practicable to implement any kind of rigid scheme for assigning relevance values to documents. Cfr. above at section 1.2.9.
A complete ranking of documents on a utility basis, rather than on the basis of content, is not normally performed by users. One of the findings reported by Eldridge (1968) was that each panelist seemed to
[Page 157 ]
have his own favorite relevance group in which he tended to place documents. It is likely that the same tendency is found among ordinary users.
If a request is complex, the user may rank documents according to the aspect of the request which the document refers to. In a criminal case, for example, the user may find a document discussing the question of guilt more important than a document discussing the correct legal reaction once guilt is established. It is probably appropriate, however, to regard this kind of preference as a ranking of the various aspects of the problem rather than as a ranking of retrieved documents.
The user may change his relevance assessment as he gains new insight into the nature of the problem. He may disregard documents he previously thought to be relevant and to consider as relevant documents he previously overlooked. In one of the experiments of the NORIS project, the evaluator initially assessed the relevance of all documents in the data base with respect to 20 questions. After having considered the result of the machine search, he rejected 16 of the 162 documents originally judged relevant and accepted 61 of the documents originally judged irrelevant. A re-evaluation of previous assessment results is probably common and illustrates the relativity of relevance. However, this fact by itself has little bearing on the appropriate relevance grading of the documents.
What we are left with as a conclusion is that the user at any one time disregards irrelevant documents. The remaining documents may be, and sometimes are, classified in a few relevance categories. But they will almost never be assigned unique rank values. Documents which at first glance seem to be of doubious relevance are usually re-assessed and either disregarded or accepted as relevant. The user will normally not leave them in an uncertain state, as this is of little value to him.
The grading of content and subjective relevance must therefore, generally speaking, be regarded as binary, that is as an either/or proposition. The user can make use of a few relevance categories, but will as a rule not make a full ranking of the documents.
The retrieval process consists of both human and machine subprocesses as shown schematically in Fig. 9/1.
[Page 158 ]
Fig. 9/1.
The retrieval process
[Page 159 ]
The first step in a retrieval process is query construction. The user must transform his problem into a query, i.e. he must give the problem a syntactic representation. It should be emphasized that the problem itself is semantic in nature. It is often possible to give the problem several syntactic representations which are all equally adequate from the user's point of view. As queries, however, the different representations may not all be equally adequate.
The query defines the formal properties which a document must have in order to be retrieved. Thus the formal relevance value assigned to a document is determined by the query, while the content (subjective) relevance of a document is, of course, independent of any particular syntactic representation.
One way, and sometimes the only way, to achieve a perfect result, where all and only the relevant documents are retrieved, is to formulate the query as duplicate images of the relevant documents. Of course, normally this is not a practical alternative. The syntactics of the relevant documents will usually vary significantly. And because there is an expressed retrieval need, the relevant documents will, by the very nature of the situation, be largely unknown to the user. The query cannot, and should not, portray each relevant document, but should express the properties which the documents have in common. We can call these properties the necessary conditions of relevancy. The formal relevance of a document should reflect the probability of all these conditions being matched in a document.
In order to illustrate what we mean by necessary conditions of relevancy, let us take as an example the retrieval problem used by Horty in his paper from 1960 on the application of information retrieval techniques to legal research. The legal problem in question concerned the potential loss of a tax exemption which a hospital might suffer because it rented out portions of its first floor for commercial purposes.
As Horty points out, the problem
"..involves three basic concepts. First of all it is a tax problem,
secondly, it concerns the exemption of property from taxation, and
thirdly it concerns the use of a part or portion of the property in a
commercial manner. The object is to retrieve each statute which deals
with all three concepts.
Because not every statute will deal with this problem in the same
language our inquiry must reflect every possible way each concept has
been expressed in the statutes."
Horty chose to give his query the following structure:
c1 AND c2 AND c3
[Page 160 ]
where c1 , c2 and c3 are sets of alternative ways of representing the first, second, and third concept respectively, see Horty (1960). The structure reflects the condition that a document, in order to be relevant, must contain all three concepts.
Each concept is represented in the query by a set of words. Each word in a set is an alternative way of representing the concept. In the following we shall call a set of words representing the same concept a class.
Once the query is specified, the documents can be matched against the query. This processs is performed by the machine according to a specified algorithm usually called the matching function.
It is normally difficult to evaluate the result of the search. In principle it can be done by manually assessing the relevance of all the documents in the data base and comparing the result with the retrieval result. In Fig. 9/1 such a hypothetical manual evaluation process is represented by the box labeled "relevance assessment", and the search result is depicted by the degreee with which sets S1 and S2 overlap. We note that recall and precision is determined by the ratios n(S1 ∩S2 )/n(S1 ) and n(S1 ∩S2 )/n(S2 )
In a retrieved document set there will normally be some irrelevant documents along with the relevant ones. As long as the documents are not ranked, the relevant and irrelevant documents will, on the average, be randomly distributed in the retrieved set, and the documents will not be presented to the user in an order favoring the relevant ones. Precision will therefore, on the average, remain constant as the user looks at new documents. In a ranked set, however, precision will initially be high and then deteriorate as new documents are considered by the user.
Before we leave this short introduction to the search process, it should be emphasized that recall and precision alone do not give the complete performance picture. For one thing, recall and precision do not tell us the time it takes for the user to complete the search. It is true that high precision reduces necessary browsing time and thus is a powerful timesaving device. But total search time is also a function of factors like user ability and experience. In fact, recall and precision are never independent of the other performance criteria, but can usually be improved at the expense of search time and user effort. Secondly and perhaps more importantly, recall and precision do not reflect the interactive nature of on-line retrieval system. This interactivity, which allows the user to modify his query on the basis of achieved results (feedback), is a very powerful system function. However, it would probably be a
[Page 161 ]
misconception to think that the interactive capabilities of on-line systems reduce the demands on the other search functions. Interactive capabilities can only complement other functions, not replace them.
A point which has often been stressed by authors on the subject of information retrieval (see for example Bar-Hillel 1964) is that it is important to distinguish between different information needs. This seems especially true in a discussion regarding retrieval strategies. There is a whole spectrum of information needs - one end of the spectrum being characterized by fact retrieval, the other end by reference retrieval.
Fact retrieval is, as the name implies, a search for facts. A fact can in this connection be either a specified piece of information, for example a name, figure or amount, or it can be a specified document, for example a letter, invoice, or statute. What distinguishes fact retrieval from reference retrieval is that all types of relevance assessment in fact retrieval are absolute. Hence there can only be one correct answer set irrespective of the user performing the search. An important consequence of this is that the relevance requirements can be preciesely and formally specified in the query. It means furthermore that, as long as we have an adequate data base, the retrieval result is always optimal in the sense that all relevant and no irrelevant documents are retrieved.
Reference retrieval is a search for references or citations to documents which discuss or in some other way throw light on a given problem. Most of the legal research connected with finding precedents can be characterized as reference retrieval.
Reference retrieval is a much more complicated process than fact retrieval, since user relevance assessment has now become relative. A perfect reference retrieval result is rarely achieved, and even a satisfactory result can often be difficult to achieve for the inexperienced user. A great many retrieval aides and strategies have been developed to meet the difficulties involved, and it is to the most important of these aides and strategies we now turn our attention.
[Page 162 ]
Factors affecting performance can be divided into two groups according to whether or not the factors are in principle subject to user control. We shall define a search strategy as the determination of the variable factors, that is the factors subject to user control.
The fixed factors are the factors over which the user has no control, not only practically speaking, but in principle. As it turns out, there are not many factors which are fixed. In addition to the problem itself the fixed factors include only the documents that provide an adequate coverage with respect to the problem. Whether or not these documents are part of a searchable data base is a question which the user normally cannot control. However, we still choose to regard the composition of the data base as a variable factor. Even if the user is not in a controlling position, at least the system designer is.
In addition to the selection of the data base, the variable factors include selection of document representation, selection of command language, and formulation of query. These are the factors concerning coverage, indexing, user effort, and retrieval performance, and we shall now discuss them in more detail.
Coverage is a similar measure to recall. But while recall refers to the proportion of relevant documents stored in the data base that is retrieved, coverage refers to the proportion of documents required by the user which is stored in the data base. Recall can thus be said to be a derived measure of coverage, and a high value of recall may not be very significant if coverage is inadequate to begin with.
[Page 163 ]
Like recall, coverage is particular to the user and his problem. Normally it is not practical to provide all users with perfect coverage, nor is perfect coverage for everybody in all possible situations necessarily the aim of the system. Not only is universal perfect coverage impossible, but the failure to achieve it is hardly serious. Coverage and recall failures cannot be compared in this respect. Recall failures are serious because normally the user does not know what he is missing. Coverage failures, however, are immediately apparent to the user, provided that the system is transparent in the sense that the user is made aware of the different document types included in the data base. The important thing is not that the system covers all information needs, but that it covers certain well-defined needs which are not easily provided for by other means.
Because of the availability of other information sources in addition to a retrieval system, there will rarely or never be a need to implement a system with perfect coverage. However, decisons regarding the data base are hardly less difficult in a system with more modest coverage goals. Since coverage, in the final analysis, depends on user relevance assessment, there is potentially no limit to what should (or could) have been included, even in a limited system.
Obviously the system designer has a need for guidelines or rules regarding data base composition. Not surprisingly such guidelines are difficult to define. Some help, however, can be found in the type of decision-making situation in which the retrieval system functions.
Decision-making situations may be more or less formally structured. In a typical informal situation the value of the decision is determined by the decision effects, and these effects may be more or less difficult to predict at the time of the decision. The effects are always difficult to predict when the cause-effect relationship is unknown or too complex to be thoroughly understood. In informal situations of this type it is common to associate the value of information with the value of the decision made on the basis of that information. Since the value of the decision is determined by its effects, the value of information is usually based on methods for predicting uncertain decision effects. To the extent that forecasting methods or models are used, the field of potentially relevant information is reduced from an unstructured and unlimited mass of data to a few precisely defined items.
[Page 164 ]
This solves the coverage problem regarding the model, but does not really solve the problems regarding total information needs. Methods and models to predict the future, if they exist at all, are as a rule only simplifications of reality and therefore inaccurate. Usually such methods are only used as aides, not as the main instrument in the decisionmaking process. Consequently, we are still left with difficulties in predicting the potential relevancy of different types of information and with difficulties in relating different information items to possible decision outcomes.
A quite different type of situation exists where decisions are made according to sets of norms (or meta-norms). This is the formal type of situation which exists, broadly speaking, in, for example, legal decisionmaking. In formal systems there are usually no serious problems connected with predicting decision effects. The effects are still as important as before, but in the formal type of situation there is no uncertainty connected with the effects. The mutual relationships between relevant information, norms, and effects are predefined in the norms themselves.
Legal decision-making tends to be "formal" in the sense the word is used here. But only in very rare cases will legal decision-making be strictly formal, so that given norms are applied mechanically. It is also only in rare cases that legal decision-making is strictly informal so that no norms at all are considered, but only the effects.
In formal systems we no longer have the same relationship between information and decision as we had in informal systems. In informal systems, we remember, the value of information depended on the value of the decisions made on the basis of this information. In formal systems it must necessarily be the other way around. It is now the value of the decision that is dependent on the value of the information on which the decision is based. The value of the information depends on whether or not the information is selected and used according to accepted system practice. Hence the problems of coverage are very different in formal and informal types of systems. In formal systems like the legal system, the relevant sources of information are to a large extent specified in the meta-norms governing the decision process. The sources can easily be identified and are usually available as written text. A designer of a legal retrieval system need generally not worry, as must his colleague who is struggling with the information problems of an informal system, that he has failed to identify a crucial or important source of information.
It is now time to look closer at the structure of the legal sources of
[Page 165 ]
information. As far as written texts are concerned, this structure is relatively simple and helps to facilitate problems in connection with the coverage of legal data bases.
The concept of "legal source" can take on different meanings. Sometimes it connotes an argument in favor of a particular solution to a legal conflict. At other times it connotes the authority issuing a legal norm, for example parliament, the government or a court.
In this book the concept of legal source is used in yet another sense. By a legal source we understand the text of a document used in support of a particular legal argument. A legal source is in itself not a fact in the case. A legal source says something about the norms which should be used to decide the case. Sometimes the distinction between fact and legal source becomes blurred, as for example when a contract favoring a particular solution to a conflict is submitted as evidence in the case. It is only the document text itself that we refer to as a legal source. Any particular interpretation or understanding of the document we do not take as the source.
The structure of legal sources can be said to have two dimensions. One dimension is made up of the hierarchical structure of source types, the other dimension is made up of different historical versions of the same document. We call the former the type dimension and the latter the time dimension.
Legal sources are, generally speaking, structured as a hierarchy of different types. A hierarchy in this connection can be described as a structure where authority, while formally resting at the top, is delegated downwards in the structure. Communication lines are generally vertical, and very little horizontal communication takes place in the structure.
A legal source is located in the hierarchy according to the authority of its issuing body. Thus a description of the hierarchical structure of legal sources more or less follows the lines of authority found in the social legal structure. Legal sources issued by bodies at the same level of authority are usually regarded as belonging to the same source type. There have been several attempts at classifying Norwegian legal sources under a few general type headings, see for example Andenæs/Kvamme (1969:19), Eckhoff (1971:15), and Bing/Harvold (1973:18). In Fig. 10/1
[Page 166 ]
we have utilized these attempts, but adapted the results according to our text-oriented interpretation of "legal source".
The diagram is, of course, a simplification of reality. While it does reflect the lines of authority in the legal structure, it tells us very little about the relative weights (ranking) the various source types are given in real decision-making situations.
The diagram does, however, tell us something about coverage. Implicit in the hierarchical structure of legal source types is a duty on the part of the decision-maker to examine also the relevant sources of a higher type than the document he is currently considering. If documents of different types are in conflict or differ in important respects, the decision-maker must discuss and resolve the differences. Owing to the appeal system, the decision will usually conform to the source of the higher type, although there is no explicit norm on this point, see Eckhoff (1971:270-306). There is no similar mechanism inducing the decision-maker to consider also documents of lower or parallel source types. It may be considered good practice to do so, but failure in this respect will not normally render the decision invalid.
Fig 10/1
The hierarchy of legal source types.
[Page 167 ]
In addition to the main hierarchical structure just described, the structure of legal sources has a time dimension.
Laws that have been replaced or amended are often still applicable to acts committed at the time they were in force. The most striking examples are perhaps found in the field of tax law, where the frequent rate of amendments has led to many a case being prosecuted according to earlier versions of the tax code, see Føyen/Harboe/Lie (1973: 56).
In these cases the data base must include different versions of what is basically the same document. In terms of Fig. 10/1 we can think of earlier historical versions of a document as extending vertically down from the current one.
The decision as to what historical versions of a document to include in the data base is part of the coverage problem. The decision will often turn out to be difficult. Since different versions of the same document fall outside the main hierarchical structure of document types, this structure is of no help to the decision-maker. It may also be difficult for him to foresee the conflicts which may require consultation of the old document versions. Practically speaking, however, the problem is not so great as it may seem in principle. There are only a few areas of the law which give rise to a large number of conflicts and which are, at the same time, continually updated. Taxation and social security legislation fall into this category, and the future may, of course, bring further examples.
A large and heterogeneous data base is not only costly to update and costly to search, but it is also inefficient in terms of precision. The user may retrieve irrelevant documents which would have been avoided if the search had been limited to only a part of the total data base.
The capability of limiting the search to a well-defined subpart of the data base is useful also in another respect. An automated retrieval process is usually not transparent in the same sense as a traditional manual search. An automated retrieval process does not give the user the same immediate sense of what he is looking at as when he is going through index cards, picking books off shelves, and browsing, etc. in a library.
This lack of transparency is not significant so long as the data base consists of documents of a single type, for example supreme court decisions or articles from a given professional journal. But when the
[Page 168 ]
data base consists of different types of documents, there may be a clear need for dividing the data base into separate segments, in order to give the user the option of excluding irrelevant segments.
A data base may be segmented in different ways. One method is to use several different physical search files. Another method is to use just one search file, but impose on the documents a logical structure which can be used for the purpose of carving out pieces of the data base.
One way to impose a logical structure on the data base is to create a linked structure consisting of formatted fields holding data like document identification, author, dates, and so on. Parts of the data base may then be carved out on the basis of the data stored in these fields. If one of the fields contains dates, the user may for example select all cases decided during a given period.
The use of formatted fields means establishing a new data structure in addition to the so-called inverted file structure used for ordinary text retrieval. Creating a data structure consisting of formatted fields is not always necessary, since much of the same effect can be achieved by adding prefixes to the words which would otherwise be stored in the formatted fields. Words having a common prefix can be uniquely identified as belonging to the same class. Suppose for example that we want to retrieve all documents published in a period defined by two given dates. This is not easily accomplished in a traditional text retrieval system, since a date by itself may look pretty much like any other number. However, if we give all dates a standard format and prefix them with a unique code (e.g. Date:761201), all documents published in for example the year 1976 may be found by use of an appropriate query, as for example the following: date:*.gt.date:751231.and.date:*.lt.date:770101 where "date:" is the prefix and * signifies truncation.
Data base segmentation is not the only means by which part of the data base may be carved out at the start of a search. Must notable of alternative approaches is the so-called cluster generation, see Salton (1971: 223-304). Through this technique documents are clustered according to their syntactic similarity. The technique is probably most effective in situations characterized by very large data bases consisting of rather homogeneous documents. In situations of this kind manual segmentation may be both difficult and ineffective.
[Page 169 ]
Any retrieval system, whether based on manual or automated methods, consists of basically two types of files. There is one file containing the documents themselves, and then there is a file containing the entry points to the documents. It is the last file which is actually used in the matching stage of the retrieval process, and we shall call it the search file. In the literature the file is also known by a variety of other names according to the method of document representation used.
Not surprisingly the file is known as the index file in systems where the documents are represented by index- or keywords.
In full text systems the file is commonly called the inverted file. An inverted file contains the same words as the document file. The words, however, are organized differently in the two files. In the document file the words in a given document are found by looking up the document. In the inverted file the documents in which a given word occurs, are found by looking up the word. In other words the two files are inverted versions of each other and hence the name.
In so-called vector systems the documents are represented, as might be expected, in a vector file. Each document is given a vector consisting of a number terms totaling the number of different words in the data base. The order of the terms is the same in all vectors, but if a word does not occur in a document, the corresponding term is given a value of zero in the document vector.
We thus see that the search file may be constructed according to different principles. But the essential function of the file, namely that of providing entry points to the documents, always remains the same.
The different ways of mapping or representing documents have been widely discussed in recent years and have been the subject of several investigations. Next we shall look at two methods of representing documents. In the first method documents are represented by manually assigned keywords. In the second method documents are represented by words occurring in the documents themselves, a process that can be performed automatically.
The process of representing a document by keywords is usually referred to as indexing. There are two main reasons for indexing a document.
The primary reason is one of costs. In many situations indexing is
[Page 170 ]
simply the cheapest way of establishing a retrieval system. In most libraries for example, documents are not available in machine-readable form, and the manual process of indexing is the only practical way of creating a search file. Writing abstracts or retyping documents for this purpose will be prohibitively expensive when the document collection is large.
Cost, however, is not the only reason for indexing. Indexing is sometimes also used in systems where the documents are available in machinereadable form. It is true that the reason for indexing may still be partly economic - an index search file is normally shorter and thus cheaper to store and search than a search file based on abstracts or full texts. However, the reason is probably more motivated by a desire to add information to the documents. An important characteristic of the document may only be implied in the text, but can be explicitly expressed by the use of a keyword. In addition the use of keywords also provides the means of classifying documents according to a systematic and controlled vocabulary, see Lancaster/Fayen (1973: 244-262).
Experimental tests indicate, however, that indexing is, on the average, less effective than text representation based on abstracts or full texts. Salton/Lesk (1968) report that representation by document titles, which is similar to representation by keywords, is less effective than representation by abstract or full text. Cleverdon (1967) had arrived at seemingly corresponding results, although his experimental setup was different. Cleverdon found that single words selected from the texts of the documents provided, on the average, the best document representation. Representation by abstracts or controlled vocabularies proved less effective. See sections 11.2.1 and 11.2.2.
The economic benefits of indexing are thus bought at the expense of a loss in retrieval performance. Whether or not this loss is acceptable depends mainly on the situation in which the retrieval system functions. In some typical library situations, for example, the user wants to retrieve documents already known to him. In these cases there is little need for entry points in addition to bibliographical data. In section 9.5.1 we characterized a search of this kind as fact retrieval. In other typical library situations the user is familiar with the subject on which he is searching, but does not know all the documents containing relevant material. In these situations, which we earlier characterized as reference retrieval, the information loss resulting from indexing may be unacceptable.
Legal retrieval can be characterized as reference retrieval most of the
[Page 171 ]
time. Typically, a case is presented to a lawyer as a set of facts. On the basis of these facts the lawyer has to find the relevant legal sources and prepare his argument on an interpretation of these sources as they apply to the case. It can readily be seen why indexing may prove inadequate in situations of this type. Indexing is a transformation process in which the original words of the document may be kept, deleted, changed, or added. Most of the words are in effect deleted. Words are kept, changed or added according to the subject areas which the indexer considers important at the time. This transformation process introduces an arbitrary element into the representation process. The subject areas uppermost in an indexer's mind at the time of the indexing may not correspond at all to the needs of unknown users, who may be removed from the indexer not only with respect to professional background, but sometimes also by a span of unknown years. We know that sometimes a subject, or even legal norms, will develop in directions which were unpredictable even a few years earlier. The indexer may thus really be faced with the impossible task of classifying a document under a subject heading which has not yet been defined, or which at least has not yet been associated with the content of document.
By far the simplest way of representing a document is to base the search file exclusively on the words occurring in the document. We shall call this method text representation.
A document may be represented by its entire text or by smaller parts, as for example the conclusion or a summary. Systems where documents are represented by their entire texts are usually called full-text systems.
Even in these systems, however, the search file does not normally include all the words in the texts. The so-called common words, i .e. words like "the", "a", "and", "for", "is", and so on, are more or less evenly distributed in all documents and therefore carry very little information for retrieval purposes. Consequently, they are normally excluded from the search file.
Text representation does not require any kind of manual processing at all. It does require, however, that the documents are available in machine-readable form. The method is therefore especially well suited to situations where documents are "captured at the source", i.e. where documents are made available in machine-readable form as part of a printing or typing process. Capture at the source is not possible in the case
[Page 172 ]
of historical material, and retyping or punching of the material is then a necessity. Punching of even vast amounts of old material may be justified. In order to create the data base of LEXIS, the legal retrieval system operated by the Mead Data Corporation, three billion characters had been punched at the end of 1973, cfr. above at section 6.4.
The advantages of text representation are generally speaking the same as the disadvantages of keyword representation. A document represented by its text can be retrieved on the basis of any word occurring in the text. Therefore text representation will normally provide more entry points to a text than indexing. And, generally speaking, a lot of entry points is an advantage, since the retrieval process can be described as a game of second-guessing either the author or the indexer of the documents. In order to retrieve a document containing a given concept, the user must include in his query the exact words which the author, or in the case of keyword representation, the indexer, used to represent the concept. Unless the user is familiar with the indexer's way of thinking, it may be easier for him to conjecture the word-use of the author.
However, text representation does have drawbacks of its own. First of all it normally requires a relatively large amount of storage space compared to keyword representation. In addition the method may make retrieval difficult precisely because the documents are represented only by words occurring in the documents.
A concept will not always be expressed explicitly in a text; sometimes it is only implied by the overall meaning of the presented argument. For example "out-of-town living expenses" is an important concept in many tax cases, but, at least in a collection of decisions by the Swedish governmental courts, we found that the concept was not always expressed explicitly in relevant documents. Instead it was implied by the factual descriptions, for example by the description of the hotels and restaurants where the salesman had spent his week away from home.
Even when a concept is explicitly expressed, it can sometimes be extremely difficult to second-guess the author as to the word used to express the concept. Very high or low levels of specificity may be involved. Examples are abundant. "Agricultural inventory "expressed as a refrigerator, and "dangerous article" expressed as a slide projector are just two examples of high specificity. See Bing/Harvold (1974: 108-111).
[Page 173 ]
Some of the most important characteristics of retrieval performance can be described by the recall and precison ratios. Similarly, important aspects of query construction and search strategies can be described in terms of so-called recall and precision devices - retrieval devices aimed at improving recall and precision respectively, see for example Lancaster (1968: 85). In the following we shall make use of a somewhat different approach where the emphasis is put on the role of the output quantity as a determining factor of retrieval performance.
Generally speaking, recall can only be improved by increasing the number of retrieved documents, and an increased number of documents normally makes for a drop in precision, even though there is an important exception, which will be discussed in section 10.4.4. Similarly, precision is normally improved by reducing the number of documents, a result that tends to reduce recall. Thus since both recall and precision devices are mainly functions of the size of the retrieved document set, they are, from the point of view of query construction, two sides of the same coin.
The process of formulating a query can be described as striking a balance between the semantic demands inherent in the question and the syntactic limits inherent in the matching function, see section 10.5.5. The most basic problems encountered in formulating queries, however, are common to most retrieval systems, regardless of the type of matching function used. These are the problems connected with transforming the information in the question to syntactic criteria which can be interpreted by the matching function.
Essentially the question transformation process consists of two steps:
It is convenient to put a common label on terms representing the same necessary condition. In the following we shall call such a class of terms either a "conceptor" or simply a "class".
A question is normally made up of separate ideas or concepts, each of which is necessary in order that the question may retain its unique
[Page 174 ]
character. Identifying the necessary conditions of a given question may superficially seem easy enough, but often turns out to be a tricky business indeed. The process involved is best illustrated by an example.
Suppose that we are interested in the legal norms affecting the relationship between parent and child. A first attempt at identification of the necessary conditions might yield:
A closer look at these concepts may lead us to eliminate the concept of "legal norms" as superfluous, since we are dealing with a data base consisting of legal sources to begin with. The two concepts "affecting" and "relationship" are also very general in this type of data base; probably so general that they are not useful as search criteria either. This leaves us with the concepts of "child" and "parent". The concepts of "child "and "parent "do not by themselves define the question. Obviously they may be part of many other questions too. In this case, as in many others, it is difficult to give the question an exhaustive and, at the same time, meaningful representation.
Suppose we have established that we want to base our search on the concepts of "parent" and "child". It should be emphasized that we are talking about concepts. The specific words "parent" and "child" are only used to suggest the concepts. It should also be emphasized that concepts normally are fuzzy around the edges. Whether or not a given concept is present in the user's mind depends not only on the direct word impulses he receives, but also on his lines of thought and on the associations he makes.
Take for example "parent" and "child". These two concepts can very well be used completely independently of each other. "He behaved like a child" will not usually invoke the association of a parent. In other contexts the two concepts imply each other. If we are dealing with parent-child relationships, any talk of parent necessarily implies a child, and vice versa. Cfr. above at section 1.2.9. This is an important point to make in connection with document retrieval. Whether or not a concept is recognized in a document will often depend on one's lines of thought and on one's point of view. "Parent" might imply "child", or it might not. In some cases both "parent" and "child "may be implied by still another
[Page 175 ]
concept, as for example the concept of "a statute protecting the rights of minors".
The above example illustrates the importance of analyzing a question in terms of concepts before embarking on a search. The conditions essential to a question may shift somewhat, depending on the conceptual level on which the user is thinking, and the success of a search will depend upon the user's ability to see his problem on the same conceptual level as the author's. It is therefore important that the user always strives to adopt different viewpoints or angles from which to analyze the question.
Once the question has been analyzed and the conditions necessary for relevance identified, the user can start to specify the terms which will represent the conditions. Of course in practice it is difficult to separate the processes of question analysis and term specification. Very often the two processes will be performed simultaneously. This is unimportant as long as the user is able to make a proper analysis of the question.
Most of the literature on search strategies concerns the problems of term specification. It is especially systems functions aimed at aiding the users which have received a lot of attention.
A term will normally be an alphabetical word or phrase. In principle it may be any character string which can be matched in the data base. A term used as a search criterion has an entity all of its own. It is ambiguous in the sense that the user does not know all the contexts in which the term may appear in the data base. The user will thus often feel a need of specifying the context in which he is using the term. In index systems limited contexts may be specified by the use of so-called links and roles. A link specifies the combination of two independent concepts into a more complex concept, e.g. testing of cars, testing of bombs. A role defines one of several contextual meanings of a word, e.g. car (sales product), car (source of pollution).
In text systems links can normally be defined by specifying that two words must occur within a certain distance of each other (as measured by words). A so-called positional operator is used for this purpose. Generally speaking, roles can also be handled in text systems by demanding the co-occurrence of both the main object (e.g. car) and its context (e.g. sales product).
[Page 176 ]
Natural language is immensely varied. A given subject may be treated and expressed in a large number of ways, depending on the author and the context in which he is writing. Specifying terms for a given idea or concept may sometimes seem like an almost impossibly difficult task. In order to assist the user in this task, several different methods and techniques have been developed. We shall now take a closer look at the most important of these.
(1) Browsing through the documents already retrieved is the oldest, simplest, and perhaps the most effective way of getting fresh ideas with respect to new terms to be included in the query.
The effectiveness of browsing may mainly be due to the immediate feeling of vocabulary which it imparts to the user. Browsing is relatively time-consuming, however, and can only be used in situations where on-line systems are available. And, of course, browsing requires that the user already has obtained one search result.
(2) Truncation. The grammatical variations of a word are relatively unimportant to the search techniques in use today. What usually counts is the stem of the word. Prefixes and suffixes can be regarded as arbitrary functions of style. The method of truncation provides the possibility of disregarding word endings for matching purposes. Suppose the user specifies the term "car*". He will then not only retrieve documents containing "car" and "cars", but, in addition, all documents containing words having the leading letters "car". If the data base covers broad subject areas, our truncating user may be in for a surprise. He may, for example, find himself confronted with documents not containing the word "car" but words like "carnivore", "carrier", "carrot", "cartel", and "cartridge" just to mention a few. However, this example should not detract from the fact that truncation, when it is used with care, can be a very powerful and effective search technique.
How effective truncation really is has, as far as we know, never fully been tested. Salton/Lesk (1968) report on a comparison of the suffix "s" dictionary with a word stem dictionary. The use of the former dictionary is tantamount to truncating the terminal "s"; the use of the latter is tantamount to truncating all endings beyond the word stem. The results were inconclusive in the sense that both methods came out on top in the comparison, depending on the data base used. The effect on performance of truncating terms as compared to leaving them in their original form was not tested in this experiment.
[Page 177 ]
Such an experiment does involve several methodological difficulties, but was nevertheless undertaken as part of the NORIS program. One of the principal difficulties is to obtain results that are representative of the data base as a whole. In the NORIS experiment the analysis was not based on sample questions, but on an analysis of the total data base with the aim of grouping together those words which were considered synonyms independently of context. Context-dependent synonyms could not be included in the experiment, since questions that would have defined the contexts were not included in the experimental setup. Once the synonym groups were isolated, the effect of truncation could be measured by selecting and truncating a representative word within each group. Three texts were used. The longest text contained 2 913 words. The second longest text contained 2 273 words. Since the last text was supposed to be very short, average values based on 40 representative short texts were calculated. The average length was 72.8 words.
The main results of the experiment were quite encouraging. The results obtained for the long text showed that an average of 75 per cent of the words in the synonym groups were matched by the truncated terms, while an average of only 20 per cent of the words matched by the truncated terms belonged outside the synonym groups. The corresponding values for the two shorter texts were 94 per cent and 5 per cent for the second text and 96 per cent and 1 per cent for the very short text, see Harvold (1974). The extremely good results in the last case indicate primarily that very short texts contain relatively few synonyms.
The results from this experiment can also be used to illustrate the effect of truncation on performance, as measured by recall and precision. The results obtained are derived by means of a retrieval performance model developed during the NORIS program.
In fig. 10/2 the effect of truncation is shown as a function of both data base size and average document length. The truncation effect is indicated by the average, relative change in recall and precision resulting from the truncation. We note that the percentage increase in recall is much greater than the corresponding decrease in precision. We also note that truncation is relatively more effective the larger the data base and the shorter the average document length. The model predicts, for example, that in a data base consisting of about 30 000 documents, where each document is about 60 words in length, recall can, on the average, be approximately doubled by the use of truncation, while precision will only suffer a decrease of about 25 per cent. The results suggests that truncation can be
[Page 178 ]
Fig. 10/2
a powerful aid to the user, even though the exact values given above should not be taken too seriously. They do not pretend to be more than they are - average values extrapolated from a limited empirical material by the use of a simplified model of reality.
The results also presuppose that the user is able to truncate in a relatively sensible way. In this connection it can be pointed out that truncation mistakes are normally easily corrected in on-line systems. They will usually show up as a sudden increase in the number of retrieved documents, or they will be detected in the browsing stage of the process.
Suffix truncation, or so-called right-hand truncation, is by far the most useful type of truncation in most languages. In certain languages, like German and the Scandinavian languages prefix or left-hand truncation may also be quite useful. These languages are characterized by the construction of complex words through a concatenation of simpler words. In addition, German has its special prefix problem. A completely general masking function, where any part or parts of a word can be discarded for matching purposes, may be the ultimate solution in the case
[Page 179 ]
of these languages. Systems having such a feature will be letter-oriented in contrast to the present word-oriented systems.
(3) Thesauri. A thesaurus is a synonym dictionary. It defines different synonym groups, each of which may be activated by the specification of any one member of the group. Synonym dictionaries are usually structured as hierarchies, although they may also be given a largely unsystematic network-type structure. Hierarchical dictionaries are manually constructed, but it is possible to construct network dictionaries by automatic methods, for example by linking terms which co-occur often in sample documents assumed to be typical for the given subject area, see Salton (1971:132-141), or by linking terms co-occurring often in queries.
Opinions are divided concerning the practical use of thesauri. Most empirical investigations have given negative results. Saracevic (1970:677) reports on the usefulness of a thesaurus constructed for a data base consisting of 600 journal papers in the field of tropical diseases. Use of this thesaurus did not result in any significant improvement of performance. The experiences of Salton/Lesk (1968) are only slightly more encouraging. Using a thesaurus constructed for a data base consisting of 780 abstracts of documents in computer literature, they reported that performance was somewhat improved when the query was expanded by higher up concepts (parents), but no improvement was observed in the cases of sons, brothers, or cross-references.
Some of the reasons for this poor track record are not hard to find. Since a thesaurus is a general tool, it must be independent of any one given context. It can only include context-independent synonyms, which in many ways are the least interesting synonyms from a retrieval point of view. Of greater interest to the user are the context- dependent synonyms, words like for example "water-well", "refrigerator", and "grain", which can all be synonyms in the context of taxation of agricultural inventory, but which may not be related in any other contexts.
Many, maybe most, of the context-independent synonyms are grammatical versions of the same form. Some systems, like FAIR developed by IBM Austria (cfr. above at section 7.2.1), limit their thesaurus to a grammatical generator of word forms. In order to handle the contextdependent synonyms they provide the user with system functions for constructing his own private thesaurus. Functions of this kind may not only include the tools for defining and deleting links between terms, but may also include proper synonym lists from which the user can "mark" (for example by the use of a light-pen) the terms he finds interesting and
[Page 180 ]
which he wants to include in his query. Even an ordinary alphabetical list of all the distinct words in the data base may prove a useful source of inspiration to the user.
Before we leave this section we have to mention two other functions, which are characterized both by an elegant simplicity and a large variety of uses. These are the query-saving function and the so-called macro function. The facility for saving queries may be used to construct special synonym queries. These may then later be included in new queries simply by way of reference. The essential feature of a macro is not that it can be used to save terms, but that it can be used to save the logical structure of a query. A macro is specified as an ordinary query with the exception that the terms are given as parameters. Actual terms are filled in later on when the macro is referenced. Macros were first implemented on the STATUS 1 system, see A.E.R.E (1975).
Earlier we mentioned that, generally speaking, recall and precision were functions of the number of retrieval documents. Controlling the quantity of output is therefore the key to controlling performance.
In reference retrieval at least, it is convenient to regard a query as being made up of classes of terms, where each class represents a necessary condition of the question. A query made up of classes can be expanded in three ways:
A class is expanded by adding terms that are synonyms at the same generic level as the terms already specified in the class. "Synonyms" is here taken in a broad sense and includes not only context-independent synonyms (the true synonyms), but also context-dependent synonyms. The interesting thing about term expansion is that, on the average, it will not affect precision, even though recall is increased. Terms representing a condition at the same generic level are likely to be equally representative of this condition. The probability of retrieving a relevant document is therefore not changed as terms are added to the class. Consequently the ratio of relevant and irrelevant retrieved documents will, on the average, remain constant. See Bing/Harvold (1974: 64-71) for an empirical illustration.
[Page 181 ]
Term expansion is thus a very attractive method of increasing recall, since the normal penalty in the way of a loss in precision is avoided. Several retrieval experiments suggest, however, that no amount of term expansion will produce 100 per cent recall as long as the class co-occurrence requirement is not reduced. Saracevic (1970 b:678) considers this as one of his most significant and surprising results. Corresponding results were obtained in the NORIS (8) project, see Bing/Harvold (1974:80) and Bing/Harvold/Kjønstad/Stabell (1976). The same result is also implied by a model of text retrieval developed by Harvold (1976). According to this model a complete expansion of all classes is necessary in order to obtain 100 per cent recall. The only possible way to obtain 100 per cent recall without complete expansion is in fact to relax the class co-occurrence requirement. It goes without saying that complete expansion of a class is not normally practical, since it involves identifying all words used in the data base to express the given concept or idea.
Output can also be increased by adding new terms of a higher generic level. This will decrease the specificity of the class. Decreasing specificity is primarily a useful technique in systems using a controlled and hierarchical structured vocabulary of index terms. However, the technique may also prove useful in text systems. Legal documents, for example, often have a heading or a summary describing the important points or aspects of the document in more general terms than the main text itself. Decreasing the specificity of the terms will normally cause an increase in recall accompanied by a decrease in precision.
The last method by which output can be increased is to relax the class co-occurrence requirement by dropping one or more classes from the query. Dropping a class implies that the query no longer exhaustively describes the question. Hence the method is often referred to as decreasing the exhaustivity of the query. Reducing the class co-occurrence level will normally both increase recall and decrease precision. Recall will increase because at least some of the new documents will contain all the necessary conditions in the correct context and thus be relevant, even though not all the conditions are represented in the query. Of course precision will fall because the query no longer represents all the conditions of the question.
Because of its effect on recall and precision, control of the class co-occurrence level has the potential of being an excellent ranking device. We shall return to this possibility later in our discussion of nearness functions.
[Page 182 ]
Anyone taking a stand against system flexibility is not likely to attract a huge following. In fact flexibility is a goal usually taken for granted and not in need of any futher elaboration or justification. The universal acceptance of flexibility is probably partly due to the vagueness of the concept, and partly due to the demonstrated fact that flexibility in the sense of adaptability is an extremely useful quality in a changing world.
Flexibility can be used in connection with various aspects of a retrieval system. We shall use the concept in two different connections. We shall take flexibility to mean both adaptability to different types of problems and adaptability to users with different backgrounds and experiences.
The simplest type of system for the totally inexperienced user to operate is a system with minimal log-on procedures, with no system command language and with natural language query capabilities. The user sits down, presses a special key on the key-board, which connects him with the system, and proceeds to specify his question in natural language. User-congenial systems of this type are not only possible but have, with slight modifications, been developed, see for example the description of CONTEXT in Vischer (1971), above at section 7.6.
It is especially the ability to process natural language which makes the systems flexible in the sense that both experienced and inexperienced users can operate them. However, queries written in natural language provide relatively little information to the machine. The queries are normally processed by disregarding common words and conducting the search on the remaining significant words. In a natural language query the user has no means of making distinctions between the significant words. Words that are repeated in the query may be given greater weight by the machine. But normally queries are short compared to most texts, and if a word should appear more than once in a query, it will usually be due to accidental features of style or to common-type words that should, in any case, have been deleted from the query by a stop-list, see Bing/Harvold/Kjønstad/Stabell (1976). Once the user deliberately repeats a word in order to give it greater weight, we are no longer dealing with a true natural language query.
Thus, as long as machines do not have language processors with capabilities approaching those of humans, natural language queries can
[Page 183 ]
only be interpreted as a command to retrieve all documents in which at least one of the query words occur. In addition, the documents may be ranked according to the number of query words that co-occur in each document. This basic scheme may be varied by assigning different weights to words according to, for example, the frequencies of the words in data base, or according to the frequencies of the words in the individual documents, or according to a combination of the two.
But in today's systems natural language is by no means the ultimate way in which to specify queries. Experience has shown that performance can be improved by the introduction of operators that explicitly define the relationships between the words in the query, see section 10.5.5..
The introduction of operators into the search language requires certain syntactic rules, which the user must master. Even if these rules are simple, the system has now become a little more difficult to operate for the inexperienced. Performance in the form of problem flexibility has been bought at the expense of performance in the form of user effort and user flexibility. In fact at our present level of knowledge, high performance, as measured by recall and precision, seems incompatible with ease of use, at least if we think in terms of a system which is general and not custom-made for special types of problems. Flexibility is thus somewhat of a paradox. As a general goal it remains an elusive dream, since high flexibility with respect to all aspects of the system seems incompatible with the maintenance of a high performance level.
The communication process between user and machine can be structured according to at least two different principles. Depending on the principle used, the dialogue will, from the point of view of the user, appear as either imperative or responsive.
The philosophy behind responsive dialogues is that even people with no special system experience or knowledge should be able to use the system. Thus a responsive dialogue requires very little initiative on the part of the user. He is guided along the retrieval process by prompts and questions from the system, and most of the time replies of "yes" or "no" will suffice to keep the system guide satisfied and busy. A responsive dialogue might run as illustrated below. The dialogue is based on commands, used by STATUS 1 see A.E.R.E. (1975), although other systems, as for example STAIRS, might have been used.
Statements made by the system are italicized in the dialogue.
[Page 184 ]
.
.
.
question please
contract?
118 documents satisfies the question. Do you wish to list titles?
no
question please
contract .and. breach?
22 documents satisfies the question. Do you wish to list titles?
no
question please
contract .and. breach .and. fraud?
3 documents satisfies the question. Do you wish to list titles?
yes
.
.
.
Do you wish to read any of these documents?
yes
.
.
.
We note that a system using a responsive dialogue is user flexible. The system can be used by the experienced and the inexperienced alike. After a while, however, the typical user may tend to find a responsive dialogue somewhat long-winded, and may indeed welcome the change to an imperative dialogue.
The philosophy behind imperative dialogues is that the user has experience, that he knows what he is doing, and that he does not have to be led down the retrieval path by a somewhat dull system. An imperative system requires that the user has acquired mastery of the system command words. In the Norwegian version of STATUS (NOVA*STATUS) there are commands relating to data base selection, query formulation, browsing of titles and texts, query saving, and macro definitions.
Below our previous dialogue is rewritten as an imperative dialogue based on the commands of the NOVA*STATUS system. The dialogue illustrates a query-saving facility whereby earlier queries are inserted at the places where they are referenced in later queries.
[Page 185 ]
.
.
question
1: contract?
118 documents retrieved
question
2: .1..and. breach?
22 documents retrieved
question
3: .2..and. fraud?
3 documents retrieved
titles
.
.
.
read
.
.
We note that the dialogue has been shortened considerably. In environments where systems are in continuous use, the advantages of a shorter and more tidy dialogue may well exceed any drawbacks in connection with experience requirements.
Most of the information passed on to the user by the retrieval system will regard either the quantity or the quality of output. Quantity indicators apply to the number of documents retrieved, and quality indicators apply to the relevance of the retrieved documents. The user engaged in fact retrieval will mostly be interested in quantity indicators. Suppose for example that the problem consists of finding all documents where the word "ombudsmann" occurs. By definition the quality of such a search is not of much interest, at least if we assume a correct data base and a correctly functioning system, see section 9.5.1. Of great interest, however, is the number of documents in which the word "ombudsmann" occurs.
The situation is quite the opposite in the case of reference retrieval. What the user really is after now is information about the quality of his search. It is only when such information is not directly available that quantity figures may be used as indirect indicators of performance quality.
[Page 186 ]
The importance of quantity indicators in situations characterized by reference retrieval depends, furthermore, on the type of matching function used. Indeed, if we are using nearness functions which have the ability to assign a unique rank to each document, it is not even meaningful to talk about the number of retrieved documents. The documents are, for all practical purposes, retrieved one at a time, and the user can stop the process at any point he wants. In the cases where a full order is not induced, but where the documents are grouped in ranksets, information on the number of documents in each rankset may be quite useful, however.
What the user really is after, in situations characterized by reference retrieval, is information on recall and precision. However, such information is not easy to come by. Normally a precision figure can be established by evaluating the retrieved documents. Since recall in part depends on the relevant documents in the unretrieved part of the data base, there is normally no practical way for the ordinary user to establish recall. It is possible, however, to estimate recall on the basis of two independently retrieved samples of relevant documents. If, for example, two searches, one manual and one automatic, are performed seperately, the resulting two document sets will be independent. But such a procedure is time-consuming and not very practical in user-oriented situations.
As part of the NORIS program, a method of estimating recall, based on assumed independent document sets obtained during the search itself, was developed. The method allows for recall-estimates to be calculated during the search. In the NORIS (8) II report the method was tested and found to yield statistically significant results. However, the experiment was limited in scale, and further development and testing is necessary in order to establish any potential, practical use of the method.
An important part of the feedback process is the ability to browse through the retrieved documents. In on-line systems, browsing not only has the purpose of informing the user regarding the question, but also provides information which can be used to restate the query in an improved form.
The way the results of a search are presented to the user is not the least important part of a retrieval system. The different design choices fall, broadly speaking, into two categories.
[Page 187 ]
The first category concerns the types of information which should be made available. Normally a retrieval system will provide the user with information on:
The second category concerns the ways in which the provided information should be presented. The most important media in this connection are:
The specific design solutions to these choices depend a great deal on the type of situation in which the retrieval system functions. In the following we shall mainly consider the situation of the typical lawyer.
Legal retrieval systems are usually made directly available to lawyers on an on-line basis. The lawyer is linked to the system through a terminal, which he operates himself. Of paramount importance to the lawyer are short response times. Terminals are therefore normally equipped with screens, often referred to as CRTs (cathode ray tubes) or VDUs (video display units). All communication between user and system is displayed on the screen. This information is normally lost once it is taken off the screen, but in some systems it is saved and can be redisplayed on a command from the user. There is usually no great need to save that part of the dialogue which concerns commands and feedback indicators. There may be a need to save queries, either for the duration of the session or permanently. In NOVA * STATUS, queries are saved automatically for the duration of the session, and they may also be saved permanently by use of the macro function, see section 10.4.3. The lawyer may want a permanent copy of the references to the documents retrieved in a particular search. This requires a printer that can be operated from the terminal. The best solution is to have a small printer located close to the terminal. An alternative and less costly solution is to use a printer located at the computer center itself. However, the last solution may be unacceptable because of the resulting time delays.
For feedback purposes the lawyer needs immediate access to the text of the retrieved documents. The texts have to be displayed on the screen for this purpose. In order to quickly assess the potential relevance of
[Page 188 ]
documents, the lawyer is primarily interested in the text surrounding the search word. There exist different methods of drawing the user's attention to this part of the text. Highlighting the search word is one such method. So-called focusing is another. When focusing is employed, the text containing the search word is placed in the center of the screen. Perhaps the best method of quickly identifying the context of the search words is to combine focusing and highlighting (KWIC format).
Often the user will want to browse through all the documents and select a few for more leisurely study later on. If printed copies of the documents are available, he may perform the latter task at his desk, using a reference list provided by the system. Otherwise he will have to display and read the documents on the screen.
A matching function compares query and documents and, on the basis of the comparison, assigns a formal relevance value to each document. Matching functions can be divided into three categories according to the principle used in the matching process. The three types of functions are:
The purpose and usefulness of these function types are complementary. Below we shall look at each one in more detail.
In order to appreciate the difference between identity functions, on one hand, and nearness and snowball functions, on the other, it is useful to briefly refer back to our earlier discussion on fact retrieval and reference retrieval. As we remember, the general problem in fact retrieval was to retrieve well-defined and known pieces of information. These "pieces of information" might be a set of documents, for example "all documents containing the word computer". Or they might be more isolated data, for example as "the number of supreme court decisions in 1976". In order to solve the fact retrieval problem an identity type function is needed.
Identity functions select documents having the exact attributes specified in the query. Each word specified in the query defines a set of documents having the common attribute of containing the word. The different document sets defined in this way are usually related to each
[Page 189 ]
other by the use of Boolean algebra. The basic concept of Boolean algebra is the concept of "class of objects". Boolean algebra, applied to document retrieval, describes the relationship between sets of documents on the basis of attributes these documents have, or do not have, in common. The basic operations of the algebra of classes are conjunction, disjunction and negation, which, in search languages based on Boolean algebra, correspond to the AND, OR, and NOT operators. The AND operator defines the set of documents which have both the specified attributes. The OR operator defines the set of documents which have either one or both of the specified attributes. The NOT operator defines the set of documents which do not have the specified attribute. For a further introduction to the properties of Boolean operators, see Becker/Hayes (1967: 335-343).
We note that the Boolean operators only apply to document sets. Sometimes the user may be interested in attributes which concern subunits of the documents - for example he wants to retrieve documents where two given words occur next to each other. This cannot be accomplished by Boolean algebra as long as the user has no control over the definition of the document. The so-called positional operator may be used for this purpose, however. By using the positional operator, terms may be defined not only as attributes of documents, but also as attributes of the word position within the document. This makes it possible, for example, to specify the attribute that two given terms shall occur next to each other.
Nearness functions do not retrieve documents on the basis of identity, but on the basis of similarity to the attributes specified in the query. An identity function divides the data base into two groups, one consisting of retrieved documents, the other of non-retrieved documents. A nearness function may impose a full order on the documents, that is, it may assign a unique rank to each document in the data base, or it may impose a partial order whereby documents that have been assigned identical ranks are grouped together in common rank-sets.
Documents may be ranked according to both bibliographical criteria and criteria based on the texts of the documents. Bibliographical criteria include things like author (source type), date, geographic code, and so on. A collection of retrieved legal documents could, for example, be ranked according to the three criteria:
[Page 190 ]
Syntactic criteria include things like number of matched classes, number of matched terms, the frequency with which the matched terms occur in a document, the distribution of the terms, document length, and so on.
Nearness functions are primarily used in connection with reference retrieval. There are mainly two reasons for this.
The first reason is connected with the situation of the user. The user situation may vary greatly in reference retrieval, ranging from the need for a quick look at one or two relevant documents, to the need for a thorough reference file, including even documents of only minor interest. In addition the situation may depend on other factors, such as the experience of the user and his familiarity with the subject matter. The user may also adopt another view of his problem after reading a few retrieved documents. The relevance assessment may change as a consequence. In short, reference retrieval is characterized by a changing and dynamic environment, and a nearness function reflects this reality better than an identity function.
The second reason is connected with the difficulties of giving a reference type problem an adequate query representation. We discussed some of these difficulties earlier in section 10.4. It was pointed out that both empirical results and theoretical analysis emphasize the virtual impossibility of matching all the necessary conditions in all but a few of the relevant documents. In a practical situation the retrieved documents will have different probabilities of being relevant, depending on the number of conditions which have been matched. Identity functions cannot, by their very nature, rank the documents according to relevance probabilities, while nearness functions, on the other hand, can.
It is necessary at this point to comment upon the fact that Boolean search techniques are so widely used, also in connection with reference retrieval, even though they are based on identity functions. The principal reason for this may be not so much based on the superiority of identity functions, as on the very fast response time of on-line systems, which allows the user to continually change and improve his query. The ability to make rapid alterations in the query gives a dynamic quality to the otherwise static identity functions. A nearness function allows the user to move up and down a list of ranked documents, and thus has in itself a
[Page 191 ]
dynamic quality. The usefulness of this quality rests of course on the premise that the nearness function reflects the user's own relevance preferences.
A nearness function measures the similarity between query and document and can be used to rank documents relative to the query. Some nearness functions also have the ability to measure the similarity between texts in general and thus to create "clusters" of documents which are similar. Different types of techniques may be used to express nearness functions.
The simplest technique is based on ranking the documents according to the number of terms each document has in common with the query. The technique may be modified by taking into account the frequency with which the matched words occur in the documents. The assumption is that the thoroughness with which the referenced concept is treated in a document, and hence the probability of potential relevance, is a function of the frequency with which the word occurs in the document. Of course, word frequency is also a function of document length. In order to be absolutely correct, each frequency should be modified in order to adjust for the fraction of the frequency which, on the average, is due to the length of the document.
A more sophisticated technique is the so-called weighted-term technique, which allows the user both to group the terms in the query and to assign weights to the groups. The technique can be used to rank the documents, not according to the matched terms, but according to the matched groups. If a document thus contains terms from several groups, it is given a rank corresponding to the combined weights of these groups. The same group is never counted more than once, however, even if several of its terms are matched in the document. We note that if all groups are assigned equal weights, documents are ranked according to the number of groups matched in a document. The user can also vary the relative importance of the groups by shifting their weights. Take the following example:
| car ??> | 4 | |||
| automobile ??> | 4 | |||
| motor vehicle ??> | 4 | |||
| accident ??> | 2 | |||
| collision ??> | 2 | |||
| insurance | 1 |
[Page 192 ]
Documents containing words from all three groups are ranked on top. But we note that a document containing words from only the first group is ranked before a document containing words from both the second and third groups.
It can be demonstrated that the weighted-term technique is a very flexible tool, which is capable of expressing the same logical relationships as Boolean algebra, see Sommar/Dennis (1969).
In the case where each term is considered as a group by itself, the technique is identical to our simple term-matching technique, with the exception that now the user can assign weights to the terms. This type of matching technique makes it possible to represent the documents as vectors, that is as ordered arrays of terms where each distinct word in the document collection is associated with a given term. Briefly speaking, vectors are constructed in the following way. If there are n distinct words in the document collection, there will be n terms in the vector space. The number assigned to the term in the vector space determines the place of the corresponding term in a vector. The value of a term in a vector is determined by the weight assigned to the word. If the word does not occur in the document, the value of the term is zero. If the word is present, any weight may in principle be assigned to it. The most common method is to assign values corresponding to the word frequencies in the documents.
| Index language | |
| - lack of appropriate specific terms | 10.2% |
| Searching | |
| - all reasonable approaches not covered | 21.5% |
| - query too exhaustive | 8.4% |
| - query too specific | 2.5% |
| - other | 2.6% |
| Indexing | |
| - insufficiently specific | 5.8% |
| - insufficiently exhaustive (topics) | 20.3% |
| - important concept omitted | 9.8% |
| - other | 1.5% |
| Computer processing | 1.4% |
| Inadequate user-system interaction | 25.0% |
[Page 206 ]
Table 11/2
Reasons for precision failures in the MEDLARS evaluation (from Lancaster (1969))
| Index language | |
| - lack of appropriate specific terms | 17.6% |
| - false co-ordinations | 11.3% |
| - incorrect term relationships | 6.8% |
| - defective hierarchical structure | 0.3% |
| Searching | |
| - not specific | 15.2% |
| - not exhaustive | 11.7% |
| - inappropriate terms | 4.3% |
| - inappropriate logic | 1.1% |
| Indexing | |
| - exhaustive | 11.5% |
| - other | 1.4% |
| Inadequate user-system interaction | 16.6% |
| Computer processing | 0.1% |
| Value judgements | 2.3% |
| "inevitable" retrieval | 0.1% |
At Case Western Reserve University a rather large project was undertaken in the mid-sixties in order to investigate the relationship between the variable components of retrieval systems and performance. The project is documented by Saracevic et al. (1968) and by Saracevic (1970).
The components of a retrieval system were described in terms of the purpose and the function of the system. The purpose of a retrieval system was subdivided into:
while the function of the system was subdivided into:
[Page 207]
The data base used in the experiment consisted of 600 documents selected from the 1960 volume of Tropical Diseases Bulletin (indexed in five languages). On the basis of 124 questions, 4 448 queries were submitted for searching. Answer sets to the questions were established by asking the users to evaluate the retrieved documents. The non-retrieved documents were evaluated by a separate expert, who tried to interpolate the relevance judgements of the users. It turned out that of the 124 questions only 63 had relevant answers.
"Sensitivity" and "specificity" were used to measure performance. Sensitivity (Se) was defined in an identical way to recall, while specificity (Sp) was defined as the ratio of the number of non-relevant documents not retrieved to the total number of non-relevant documents in the data base. Effectiveness (E) was defined as: E = Se + Sp - 1
A main purpose of the experiment was to investigate the relative effectiveness of various indexing languages. The effectiveness of the index languages compared to full text was not investigated however. Of greater interest to us is therefore the analysis of different search strategies. The tests included the use of two types of queries:
The query could be expanded by use of a thesaurus or by use of any other available source. Use of the thesaurus did not prove as effective as manual elaboration of the query. One of the most important findings, however, was that it was practically impossible by any means to expand the narrow queries to the extent where all relevant documents were found. It was only when all but one category were dropped (broad search) that most relevant documents were found, but then at the expense of a considerable drop in precision. These results correspond well with the expected behavior of full-text retrieval systems, see Harvold (1976).
Considerably more unexpected was the observation that, when a narrow query was expanded, an almost linear relationship was found to
[Page 208 ]
exist between total output and the number of relevant and non-relevant answers.
The experiment included a test on relevance judgements based on
different formats of output. The formats used were title, abstract,
and full text. The results were:
| |
judged relevant | judged partially relevant | judged irrelevant | total |
|---|
| Titles | 167 | 157 | 762 | 1086 |
| abstracts | 175 | 169 | 742 | 1086 |
| full text | 207 | 156 | 723 | 1086 |
The results for full text must be considered the "correct" values. We note that judgements based on abstracts or even titles are good approximations of the relevance assessment made on full text.
It is also interesting to note that the relevance judgements based on titles or abstracts were superior to the performance of the retrieval system. Below we have calculated recall and precision in both situations.
| recall | precision |
|---|
| System | % | % |
| - titles | 20 | 55 |
| - abstracts | 59 | 40 |
| - full text | 74 | 30 |
| manual |
| - titles | 63 | 89 |
| - abstract | 77 | 95 |
| - full text | 100 | 100 |
We shall now consider investigations oriented especially toward the problems of legal, full-text retrieval. We begin with the Joint American Bar Foundation and International Business Machine project, which has
[Page 209 ]
become known not so much for its aim, which was to investigate the degree of satisfaction (as judged by a panel) that could be achieved by the use of a computer-based retrieval system, as for its analysis of the difference in the panel assessments. The project made use of a vector-type retrieval system developed by S. F. Dennis of IBM. The results of the project are described by Eldridge (1968).
The data base consisted of 5 800 appellate court decisions. The question set consisted of 40 questions taken from the files of practising lawyers. The data base was searched both by the retrieval system and by hand at the American Bar Foundation and in the legal department of IBM. Both answer sets were submitted to a panel of four lawyers for evaluation.
It was found that the retrieval system and the manual search had performed about equally well in terms of recall, and that the manual search was about twice as effective in terms of precision. However, a far more interesting and perhaps surprising result of the investigation was the intensity of disagreement between the four panelists. The panelists were instructed to read the questions and evaluate each retrieved document according to a four-point scale of relevance. The documents were to be assessed according to the contribution they made to the resolution of the issue raised in the question. Thus in effect the panelists were asked to evaluate the documents according to content relevance, not subjective relevance. Even so the panelists disagreed more often than they agreed. Of a total answer set of 706 documents, the panelists only gave 3 per cent of the documents a unanimous relevant vote (either "on point", "relevant", or "related"), while 31.3 per cent of the documents received a unaminous irrelevant vote. A total of 65.7 per cent of the documents received a mixed vote. The disagreement, however, turned out to be rather systematic in the sense that each panelist seemed to prefer a certain grade - the academicians on the panel generally preferred the low relevancy grades while the practitioners favored the high ones.
As an explanation of this behavior, the report suggests that the disagreement might reflect the fact that the questions were prepared by a practitioner. As a consequence the issues might have been more familiar to the practitioners on the panel than to the academicians. Other explanations might be possible as well. The experiment does seem to emphasize the subjective nature of relevance. In addition the experimental framework was not "life-like", but distorted both by the fact that the different functions of the retrieval system were performed by different people, and
[Page 210 ]
by the fact that a relevance scale of four grades was used. As Eldridge himself points out, humans normally have difficulties in making comparative evaluations involving more than three or four documents. And in a practical retrieval situation the user probably has little need of making relevance distinctions beyond rejecting some documents as irrelevant and accepting others as of some use.
The Oxford experiment represented one of the first large-scale attempts at evaluating the performance possibilities of full-text retrieval, and it was certainly the first experiment making use of a data base consisting of legal documents. The experiment is described by Tapper (1969) and (1973: 159-182). Cfr. also above at section 4.3.4.
The aim of the experiment was to measure the efficiency of computerized legal information retrieval as compared with the conventional technique of index look-up.
Two relatively large data bases were prepared for the purpose of the experiment. The first was a general series of reports of decisions in the High Court, the All England Law Reports. The second was a series of administrative decisions in the field of insurance claims for industrial injuries, the Commissioner's Decisions. The two data bases consisted of about two million and one million words respectively. The data bases were chosen largely because they could both be accessed through manually constructed indexes. The High Court decisions (called cases for short) were indexed both for the series and for the individual volumes. The index terms were taken from an introductory telegraphic abstract to each report. The administrative decisions (called decisions for short) were indexed in a general loose-leaf file. The index was detailed and thoroughly cross-referenced, re-producing a high proportion of the original headnotes. The index was oriented toward factual descriptions, in contrast to the case index, which was oriented toward legal terms.
Still another factor concerning the experimental setup should be mentioned. The manual searchers were limited to the use of the indexes; they were not allowed to browse through or examine the documents themselves. Otherwise, it was felt that they would have had an unfair advantage compared to the searchers using the machine.
The results were evaluated in terms of recall and precision. The questions were based on the facts selected from representative and recent reports. The answer sets were defined by selecting:
[Page 211 ]
Thus no attempt was made to find the complete answer sets consisting of all the relevant documents. However, based on the assumption that the computer and conventional techniques were independent, a value for the size of the complete answer set was estimated on the basis of the intersection of the two answer subsets.
The main results of the experiment are summarized below (from Tapper 1973: 179).
| Cases | Decisions |
|---|
| No. of | pre- | No. of | pre- | |||
|---|---|---|---|---|---|---|
| rel.doc. | recall | cision | rel.doc. | recall | cision |
| % | % | % | % | |||
| Conventional | 43 | 39 | 100 | 57 | 58 | 85 |
| Computer | 67 | 61 | 22 | 77 | 80 | 37 |
| Together | 84 | 76 | 27 | 91 | 96 | 39 |
We note that the computer technique performed significantly better than the conventional techniques with respect to recall. In fact the differences in the values are quite remarkable. This can be seen as yet another confirmation of the superiority of full-text representation compared to indexing, even when the indexing is quite elaborate and thorough.
As expected, the computer technique produced inferior precision values, but we note that the range of the values, from about 20 to 40 per cent, is not at all unmanageable in a modern on-line system.
The Responsa project is an ambitious attempt to make the huge responsa literature available for research through a full-text retrieval system. The responsa span 17 centuries and consist of answers by Jewish authorities to submitted questions. In this short summary only the results of the first experimental phase of the project, which was concluded in 1969, will be discussed. The project is a continuous effort, however, and is by no means
[Page 212 ]
completed. Documentation of the first phase is provided by Choueka/Cohen/Dueck/Fraenkel/Slae (1972).
The responsa are written mainly in Hebrew and Aramaic. The traditional problems which natural language represents from a retrieval point of view are greatly accentuated in these languages. Grammatical variants do not necessarily have the same initial letters; homographs are abundant owing to, among other things, the lack of vowels; Hebrew and Aramaic forms and grammatical rules are mixed, acronyms and abbreviations may make up more than 50 per cent of some texts; and so on.
The method chosen as a means of attacking these difficulties was the so-called synthetical approach. Essentially this is the same method as was adopted by IBM, Austria, in the construction of the FAIR system. The principle of the method is based on the following two-phase preprocessing of the question. In the first phase the user specifies a standard form of the words he wants to search on, together with information on the grammatical variants he is interested in. The standard form is based on the singular masculine form of nouns and on the root of verbs. On the basis of the standard form, the system generates grammatical forms of the specified words and checks which of these actually occur in the data base. In the second phase the words with associated frequencies are presented to the user, who marks off the words he wants to include in the query. If the user is in doubt about the relevance of a word, he can ask to see the word in a limited context consisting of a few words on each side of the word (compact KWIC). After this preprocessing is accomplished, the main retrieval process proceeds as usual.
This broad outline will give an idea of the philosophy behind the system. The system was tested on an initial data base consisting of the 518 responsa (558 864 words) by Rivash. In all 16 questions were run, and the results were gratifying - 100 per cent recall was achieved for all questions, and the average precision was 34 per cent. However, these performance figures are not directly comparable to the other performance figure we have so far considered. The responsa queries were prepared in an unusually thorough and time-consuming manner. Before the query was constructed, the searcher spent a full day researching the data base in order to acquaint himself with the relevant vocabulary. This of course is not normal procedure in other systems and makes it exceedingly difficult to evaluate the responsa results. However, the main feature of the system, the synthetical approach to the preprocessing of queries, is both impressive and promising.
[Page 213 ]
In 1972 the Norwegian Research Center for Computers and Law initiated a research program in the field of legal information retrieval. The aim was both to investigate retrieval system performance, and to analyze the potential impact of computerized retrieval on the legal system itself. In the following only the aspect of retrieval performance will be discussed. Retrieval performance was investigated simultaneously along theoretical and empirical lines. Some of the theoretical results are documented in Harvold (1976); the empirical investigations are documented in a series of publications which include Bing/Harvold (1973), (1974), Fjelvig (1976), and Bing/Harvold/Kjønstad/Stabell (1976).
A model of full-text retrieval was developed along the following lines. On the basis of text statistics a logarithmic relationship was assumed to exist between the distinct number of words and total number of words in a text. Using this relationship, a model giving the average quantity of output, as given by the number of retrieved documents, was derived as a function of the following factors:
The quantity model was developed into a performance model by introducing the concept of relevance. It was assumed that the grading of relevance is always done on an either/or (binary) basis, even in the cases where relevance clearly is subjective. This assumption seemed to be the best reflection of reality in normal user situations, where it is the same person who is confronted with the question, formulates the query, performs the search, and evaluates the result. It was felt that the alternative to binary grading, grading by degree, is both time-consuming and difficult and is not normally engaged in. Normally the user is primarily interested in the bisection of the data base, where one section consists of documents that can be rejected out-of-hand and the other section consists of documents that should be consulted. Given this concept of relevance, expressions for recall and precision were developed, in which recall and precision were seen as functions of the same factors that determined the quantity of output.
The model was used to investigate the limits to performance under
[Page 214 ]
various conditions and to compare and evaluate different types of matching functions,
The empirical experiments were conducted partly in order to test the theoretical results, and partly to test questions not covered by the theory. The empirical experiments were made up of the following three traditional phases:
The limits to performance are both of a practical and a principal nature. The principal or absolute limits cannot be overcome by any amount of user ingenuity. They are essentially caused by the difference between formal relevance criteria as applied by the matching function and the content or subjective relevance criteria as applied by the user. Formal relevance is based exclusively on syntactic similarities between documents and query, while subjective or content relevance depends on the user's understanding of the texts. It should be noted that since the absolute limits depend on the difference between syntactics and semantics, the limits are only of importance in reference retrieval.
The absolute limits may affect both recall and precision. An absolute recall failure will be the result in the case where the subject of the question is not explicitly represented in a document, but is implied by the text as a whole. Absolute recall failures are most likely to occur in one-concept questions. In questions of two or more concepts a relevant document containing an implied concept may still be retrieved on the basis of the other concepts, but then only at the expense of a trade-off in precision, since the co-occurrence level no longer implies an exhaustive description of the question.
An absolute precision failure will be the result in the case where recall cannot be improved without a corresponding loss in precision. An absolute precision failure can be caused, as we have seen, by the need to lower the co-occurrence requirement because of an implied concept. In situations where it is specified in the query that all concepts should co-occur (exhaustive co-occurrence requirement), an absolute precision failure will occur either if one of the terms specified in the query has a homograph in a document containing all the concepts in the question except the one represented by the homographic term, or if the document contains all the concepts in the question, but in a wrong context. We note that absolute precision failures are most likely to occur in searches
[Page 215 ]
consisting of one or a few concepts. In fact when the recall-precision curves arrived at by averaging test results were compared with the corresponding curves predicted by our model, it was found that the absolute limits only cause a precision loss of about 10-20 per cent for queries based on three concepts. For queries based on one concept the precision loss increased to 40-50 per cent.
Above we implicitly assumed that it was possible to specify all terms and phrases used in the documents to represent the concepts of the question. Normally of course, this is quite impossible. Thus a high recall will only be within reach if a drop in precision, caused by relaxation of the co-occurrence requirement, is acceptable. Such a search strategy
Fig. 11/3
Performance curves based on a data base of 430 decisions by Swedish
Administrative Courts, Harvold( 1976:73).
[Page 216 ] leads to the type of curves shown in Fig. 11/3. Of the two
recall-precision curves depicted, one is the result of averaging the
test results obtained in Bing/Harvold (1974), and the other represents
the corresponding curve predicted by our model, given similiar
searching conditions and under assumptions of no absolute limits and a
query expansion of 70 per cent.
Both curves were calculated by averaging individual curves based on queries consisting of from one to three concepts. The individual curves were extrapolated vertically beyond their end points.
The slopes of the curves represent the practical retrieval limits. These limits result primarily from the fact that we did not specify all possible words for each concept. The distance between the curves represents the
Fig. 11/4
Average and maximum predicted performance curves based on a data
base of 430 documents, cfr. fig. 11/3.
[Page 217 ] absolute retrieval limits. We note that this distance represents a
precision loss of about 20-30 per cent. This means that, given the
type and size of data base used, precision cannot on the average be
improved beyond 70-80 per cent. Had all the questions consisted of at
least three concepts, the result would have been better. Our
questions, however, were a mixture of one, two, and three concept
problems. In Fig. 11/4 we thus show both the average and the maximum
performance curve, given an absolute precision limit of 25 per cent.
The NORIS project also included evaluations of different search strategies. The strategies were tested empirically in addition to being compared theoretically. Of special interest were matching functions used to rank documents. The following ranking criteria were compared:
We note that the introduction of class frequency ranking improves performance, compared to word frequency. A comparison of the
Fig. 11/5
Performance curves based on different ranking criteria. Harvold
(1976:96-98).
Data base of 430 decisions by the Swedish Administrative Courts.
[Page 218 ]
Data base of 100 decisions by the Norwegian Social Security Court.
Data base of 374 decisions by the Norwegian Tax Authorities.
methods on a theoretical basis suggest similar improvements, see
Harvold (1976:94-100). - - - - - - word ranking
In order to map the causes of retrieval failure all the NORIS experiments included post-mortems on the retrieval results. The experiments in the different NORIS projects varied with respect to type and size of data base, and the results of the post-mortems varied accordingly. The general
[Page 219 ]
pattern remained more or less unchanged however. We shall use the NORIS (8) II to illustrate the results.
A recall failure occurs when a relevant document is not retrieved at all. The reason for such a failure can generally be identified as belonging to one of the five groups listed in Table 11/6. The table gives the relative importance of the causes in the NORIS experiment.
Table 11/6
Causes of recall failure, Bing/Harvold (1974: 102)
| Specificity | 49% |
| Implicity | 22% |
| Point-of-view | 27% |
| System failure | 18% |
| 116% |
The total adds up to more than 100 per cent because more than one cause may be associated with a given failure. A specificity failure means that the user did not find the correct terms to represent the concepts of the question. An implicity failure means that a concept was not explicitly expressed in a document. A point-of-view failure occurs when the user and the author approach the problem from different angles - with the result that different vocabularies are used in query and document. A system failure is caused by a fault in the retrieval system. The system failures in the experiment were mainly caused by faulty maintenance of the data base. The user mistakenly thought the document was included in the base.
The figures speak for themselves. It is interesting to note the relative importance of the point-of-view failures. A different point-of-view will often be the cause of trouble when a document is not found at all.
Before we discuss the precision failures, we have to say something about the partial performance failures. These occur when a retrieved relevant document is assigned a lower rank than it should have had, because of a failure to match some of the concepts in the question. Partial failures can be classified as either recall or precision failures according to choice.
Table 11/7 gives the causes of partial performance failure.
[Page 220 ]
Table 11/1
The causes of partial retrieval failure, Bing/Harvold (1974: 104)
| Specificity | 57% |
| Implicity | 12% |
| Point-of-view | 14% |
| System failure | 16% |
| 99% |
Table 11/8
Causes of precision failure, Bing/Harvold (1974: 117) Point-of-view
| Specificity | 29% |
| Exhaustivity | 29% |
| 100% |
A precision failure caused by point-of-view is not so much a performance failure as a failure of the experimental method. In this particular experiment, the answer set was not defined by the same person who specified the query and performed the postmortem. The point-of-view failures represent disagreement as to the relevance of certain documents. If the same person had performed the two tasks, the relative importance of the point-of-view failures would not have been so great. Anybody may change his mind though, and when it comes to relevance judgments, fickleness is indeed widespread. In Bing/Harvold/Kjønstad/Stabell (1976) the user, upon a reevaluation of the answer set, dismissed 16 of the 162 documents originally deemed relevant, and included another 61 documents as relevant.
A Specificity failure occurs when the cause can be traced back to one of the words in the query. The user may simply have specified the wrong words, or more likely, a homographic word. There is very little one can do to reduce this particular type of failure, since in the final analysis the failure is due to the ambiguity present in natural language itself.
[Page 221 ]
An exhaustivity failure occurs when a document does contain the subject of the question, but the subject is only mentioned as a reference or en passant or in a way not related to the main content of the document. This type of failure can also be said to be caused by the nature of natural language. At least it is reasonable to expect that the failure is less prominent in indexing systems. However, exhaustivity failures are not completely absent from these systems, Lancaster identifies exhaustive indexing as the cause of 11.5 per cent of the precision failures, see Table 11/2.
[Page 222 ]