|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The invention concerns information retrieval generally. More specifically,
the invention concerns optimization of query plans for retrieving
information from a number of information sources.
2. Description of the Prior Art
Networks now connect computers with information sources located anywhere in
the world. The Internet, for example, provides access to a large and
diverse body of information, such as technical papers, public domain
software, directory services and various databases (e.g., airline
schedules, stock market listings). It is thus now possible to speak of
global information systems.
Being aware that interesting and useful information exists is insufficient
if one cannot find the relevant information sources. The large variety of
information sources, and the disparity of interfaces among them renders
the task of locating and accessing information over the network even more
difficult. In order to address some of these problems, it is important to
understand the characteristics of the available information sources.
Autonomy: The first characteristic is the autonomy of the information
sources. This means that the information sources (i.e., sites) maintain
and update their own data, and they are not willing to change their
operations to suit the needs of the global information system. At best, an
information source is willing to provide a description of its contents.
Dynamic nature: The second characteristic of information sources is their
dynamic nature. Specifically, new information sources are added, while
existing information sources disappear or arc no longer maintained.
Number of sources: The third characteristic is the very large number of
information sources.
Cost of access: The fourth characteristic is that accessing an information
source over the network is expensive (both in time and possibly in money).
The first characteristic distinguishes global information systems from
distributed databases, where the information sources are not autonomous,
but under the control of co-operating database administrators. The second
characteristic sets apart global information systems from enterprise-wide
databases, where the set of information sources are relatively stable
(though the contents may change, of course). The third characteristic
differentiates global information systems from current day multidatabases,
that is, systems in which the information is contained in a number of
different kinds of data base systems.
These characteristics of the information sources necessitate the following
features in an architecture for global information systems.
World-view: A consequence of the very large number of information sources
is that it is unreasonable to expect users to interact separately with
each source. The users need a conceptually uniform view of the information
space, against which they can formulate queries. However, there does not
have to be a single such view of the information, but there can be many
user and domain-specific world-views. In order to relate the contents of
the information sources with the world-view, we need site descriptions.
Expressive site descriptions: A consequence of the large number of
information sources and the high cost of accessing these sources is that
in answering queries, a global information system must minimize the number
of information sources (i.e., sites) that are accessed. Therefore, a key
requirement of the site descriptions is that they be rich enough to
express various constraints that enable the system to prune the sources
accessed.
Extensibility: A consequence of the dynamic nature of the information
sources is that it should be possible to easily extend the world-view to
manage new kinds of information provided by the sources.
Query only: A consequence of the autonomy of information sources is that
while a global information system might be able to support global
querying, it is unreasonable to expect that it will support global
updating.
The parent of the present patent application disclosed an information
retrieval system having some of the above features.
That information retrieval system, shown as system 101 in FIG. 1 of the
present patent application, has a knowledge base system 109 which includes
a domain model 111. Domain model 111 is a model of information from a
specific domain. Domain model 111 has three components: world view 115,
information source descriptions 113, and system-network view 117. All of
these components include concepts belonging to domain model 111. World
view 115 is the part of domain model 111 which is visible to a user of
system 101. World view 115 is a conceptually unified view of the
information space. In a preferred embodiment, world view 115 is
implemented as an expressive object/relational data model. The user can
pose queries in terms of the objects and the relations in the world-view,
unburdened by details of data location and access. World-views are purely
conceptual; all the data required to answer queries is present only in
site relations at external information sources. World view 115 is related
to the information in information sources 123 by means of information
source descriptions 113 which provide descriptions of the contents of
information sources 123 and by means of system-network view 117, which
describes how the information in a particular information source 123 is
accessed.
World view 115, information descriptions 113, and system-network view 117
all include concepts in domain model 111. Knowledge base system 109 is
able to classify new concepts and add them to the hierarchy of concepts
already in domain model 111. Thus, if knowledge base system 109 receives
information about a new information source 123, it automatically
classifies new concepts relating to the information source in domain model
111.
Access plan generation and execution 119 in FIG. 1 poses sub-queries to the
external sources that contain information relevant to answering the query
and combines the answers to these sub-queries to answer the user query.
Accessing information sources over the network is expensive, and so an
important problem in generating access plans is minimizing the number of
external information sources 123. It is an object of the present patent
application to provide improved techniques for such minimization.
SUMMARY OF THE INVENTION
The basis for the improved minimization techniques of the present invention
is an expansion of the data model to include many relations as well as
concepts and roles. The expanded data model in turn permits a site
description language which provides the information needed for the
minimization techniques. A site description language relates the contents
of a site (information source 123) with the world-view. Key aspects of the
site description language that are useful in answering queries efficiently
are the following: (1) Relating the semantic contents of relations in
sites to relations in world-view 105 (note that, in particular, relating
semantic content includes relating schema information), (2) Stating that a
site relation contains complete information about a fragment of the
world-view, and (3) Specifying the query forms that an information source
can answer efficiently.
The site descriptions of the invention, finally permit novel query
optimization techniques that minimize the number of site relations
accessed. The optimization techniques are the following: (1) using
constraints in the site descriptions and the query to prune the set of
site relations that arc irrelevant to the query, and (2) using information
about completeness of site relations to prune redundant site relations.
An important aspect of the optimization techniques is that optimization is
done dynamically. In traditional database query optimization, the query
plan is generated completely at compile-time, and is not modified at
run-time. As pointed out in the parent of the present patent application
and disclosed in detail herein, it is crucial to have dynamic query plans,
where the query plan generation phase interacts with the plan execution
phase. Also disclosed is an algorithm for producing a dynamic query plan.
Other objects and advantages of the apparatus and methods disclosed herein
will be apparent to those of ordinary skill in the art upon perusal of the
following Drawing and Detailed Description, wherein:
BRIEF DESCRIPTION OF THE DRAWING
FIG. 1 is a conceptual overview of the information retrieval system
described in the parent of the present application;
FIG. 2 is a detail of a site description in a preferred embodiment;
FIG. 3 shows the algorithm employed in the preferred embodiment to generate
query subplans;
FIG. 4 shows the algorithm employed in the preferred embodiment for
dynamically generating a query plan; and
FIG. 5 is a detailed block diagram of access plan generation and execution
component 119 of information retrieval system 101 in the preferred
embodiment.
Reference numbers in the Drawing have two parts: the two least-significant
digits are the number of an item in a figure; the remaining digits are the
number of the figure in which the item first appears. Thus, an item with
the reference number 201 first appears in FIG. 2.
DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT
The following Detailed Description contains material from the detailed
description of the parent of the present application through the section
titled Site Description Language; the new material which is being added in
the present application begins at the section titled Improvements in
System 101.
Architecture
Architecture Overview
FIG. 1 presents an overview of an information retrieval apparatus 101 which
incorporates the principles of the invention. A preferred embodiment of
information retrieval apparatus is implemented using a digital computer
system and information sources which arc accessible via the Internet
communications network.
The central component of apparatus 101 is a knowledge base 109 built upon a
description logic based knowledge representation system (CLASSIC in the
preferred embodiment) which is capable of performing inferences of
classification, subsumption, and completion. Knowledge-base systems are
described generally in Jeffery D. Ullman, Principles of Database and
Knowledge-base Systems, Vols. I-II, Computer Science Press, Rockville,
Md., 1989. Descriptions of CLASSIC may be found in Alex Borgida, Ronald
Brachman, Deborah McGuinness, and Lori Resnick, "CLASSIC: A Structural
Data Model for Objects", in Proceedings of the 1989 ACM SIGMOD
International Conference on Management of Data, pp. 59-67, 1989, R. J.
Brachman, et al., "Living with CLASSIC", in: J. Sowa, ed., Principles of
Semantic Networks: Explorations in the Representations of Knowledge,
Morgan-Kaufmann, 1991, pp. 401-456, and L. A. Resnick, et al., CLASSIC:
The CLASSIC User's Manual, AT&T Bell Laboratories Technical Report, 1991.
Knowledge base, 109 is used to construct a domain model 111 which organizes
information accessible via apparatus 101 into a set of concepts which fit
the manner in which the user of system 101 is intending to view and use
the information. In system 101, domain model 111 has three components:
world view 115, which contains concepts corresponding to the way in which
a user of the system looks at the information being retrieved,
system/network view 117, which contains concepts corresponding to the way
in which the information is described in the context of the data bases
which contain it and the communications protocols through which it is
accessed, and information source descriptions 113, which contains concepts
describing the information sources at a conceptual level. System/network
view 117 and information source descriptions 113 are normally not visible
to the user. The concepts in these portions of domain model 111 do,
however, participate fully in the reasoning processes that determine how
to satisfy a query.
An important benefit of using a description logic system like CLASSIC is
that as new information is added to the system, much of the work of
organizing the new information with respect to the concepts already in
knowledge base 109 is done automatically. Only a description of the known
attributes of the information must be specified; CLASSIC's inference
mechanisms then automatically classify these descriptions into appropriate
places in the concept hierarchy.
User interaction with the system is accomplished through browsing and
querying operations in terms of high-level concepts (concepts that are
meaningful to a user unsophisticated in the details for information
location and access). These concepts are intended to reflect the terms in
which the user thinks about the type and content of information being
queried. By working with these high-level concepts, the user is unburdened
with the details of the location and distribution of information across
multiple remote information servers.
Information sources 123 are generally (though not limited to) network-based
information servers that are accessed by standard internet communication
protocols. Sources can also include databases, ordinary files and
directories, and other knowledge bases.
User Interface
The user interacts with the system through a graphical user interface 103.
The two primary modes of interaction supported by this interface are
querying and browsing. In both cases the user expresses both browsing and
querying operations in terms of concepts from "world view" portion 115 of
domain model 111.
A knowledge base browser in CLASSIC 109 allows the user to view and
interactively explore the concept taxonomy. The concept taxonomy is
represented graphically as a directed graph 105, where the nodes
correspond to concepts and edges indicate parent/child relationships among
concepts.
To support extension of the concept taxonomy, the knowledge base browser
also serves as an editor, allowing the user to define new concepts in
terms of existing ones. The classification inferences in knowledge
representation system 109 automatically place new concepts at the correct
place in the taxonomy with respect to existing concepts.
Since both the high-level world concepts 115 and low-level system concepts
117 coexist in a single domain model 111, an important role of user
interface 103 is to filter the system concepts out of the view seen by the
user in query results and in the taxonomy browser.
Query Translator 107
The query language used in system 101 is based on CLASSIC, but has
additional constructors that enable the user to express queries more
easily. The query is formulated in terms of the concepts and objects that
appear in the world view part 115 of the knowledge base. Query translator
107 translates queries expressed in the query language into CLASSIC
description language expressions which are used to consult the knowledge
base. Due to the limited expressive power of the description language and
the need for special purpose query operators, the query language may
contain elements not expressible in the description language of knowledge
representation system 109. After partial translation to a description
language expression, the remaining fragments of the query are translated
to procedural code that is executed as part of the query evaluation.
Knowledge Representation System 109
The knowledge base is a virtual information store in the sense that the
information artifacts themselves remain external to the knowledge base;
the system instead stores detailed information (in terms of domain model
111) about the location of these information artifacts and how to retrieve
them. Retrieval of a particular piece of information is done on demand,
when it is needed to satisfy part of a query. The types of information
managed in this manner include files, directories, indexes, databases,
etc.
The domain model embodied in the knowledge base is logically decomposed
into world view 115, system/network view 117, and information source
descriptions 113 World view 115 is the set of concepts with which the user
interacts and queries are expressed. System/network view 117 concerns low
level details which, though essential for generating successful query
results, are normally of no interest to the user. Information source
descriptions 113 is a collection of concepts for describing information
sources. These information source descriptions are expressed in terms of
both world and system concepts. The purpose of encoding information source
descriptions 113 in the domain model is to make it possible for CLASSIC to
reason about what information sources must be consulted in order to
satisfy a query.
We define system concepts comprising system/network view 117 as those
concepts that describe the low-level details of information access. This
includes concepts related to network communication protocols, location
addressing, storage formats, index types, network topology and
connectivity, etc. Since the knowledge base generally merely retrieves
information instead of storing previously-retrieved information,
system/network view 117 includes all those concepts relevant to
determining attributes like location, retrieval methods, and content
format.
Continuing in more detail, concepts within world view 115 describe things
with which the user is familiar; they are the concepts that describe
characteristics of information artifacts of interest to users. Concepts
within information source descriptions 113 relate the concepts in world
view 115 to concepts concerning the semantic content of information
sources. Thus, given a query which employs concepts in world view 115,
knowledge representation system 109 can employ the concepts in information
source descriptions 113 to relate the concepts used in the query to actual
information sources and can employ system/network view 117 to relate the
concepts used in the query to an access plan which describes how to
retrieve information from the sources as required to answer the query.
Access Plan Generation and Execution
When a user wishes to obtain information, the user inputs a query in system
101's query language at graphical user interface 103. System 101 then
answers the query. There are several steps involved. First, query
translator 107 translates the query into a form to which knowledge
representation system 109 can respond. Then the translated query is
analyzed in knowledge base system 109 to decide which of the external
information sources are relevant to the query, and which subqueries need
to be sent to each information source. This step uses world view 115 and
system/network view 117. The information in system/network view 117 is
expressed in a site description language which will be described in more
detail later.
Knowledge base 109 uses the conceptual information from world view 115 and
system/network view 117 to produce an information access description
describing how to access the information required for the query in
information sources 123. Knowledge base 109 provides the information
access description to access plan generation and execution component 119,
which formulates an access plan including the actual commands needed to
retrieve the information from sources 123.
1. Plan formulation: Given the information access description, planner 119
decides on the order in which to access sources 123 and how the partial
answers will be combined in order to answer the user's query. The key
distinction between this step and traditional database techniques is that
planner 119 can change the plan after partial answers are obtained.
Replanning may of course involve inferences based on concepts from
information source descriptions 113 and/or system/network view 117 and the
results of the search thus far.
2. Plan materialization: The previous step produced a plan at the level of
logical source accesses. This step takes these logical accesses and
translates them to specific network commands. This phase has two aspects:
Format translation: the description of the sites is given at a logical
level. However, to actually access the site, one must conform to a syntax
of a specific query language. In this step, these translations are done.
Specific network commands are generated to access the sites. Here,
information from the system/network view is taken into account. Depending
on the site being accessed, the system will generate the appropriate
commands for performing the access.
The translations to service and site-specific access commands are performed
by Information Access Protocol Modules 121 (0 . . . n), described in the
following section.
Several points should be noted about the above process:
In executing the plan, system 101 uses a work space in the computer system
upon which system 101 is implemented to store its intermediate results.
After executing part of the plan, system 101 may decide to replan for the
rest of the query.
Information Access Protocol Modules 121
Access to information sources is done using a variety of standard
information access protocols. The purpose of these modules is to translate
generic information access operations (retrieval, listing collections,
searching indexes) into corresponding operations of the form expected by
the information source. For many standard Internet access protocols, the
translation is straightforward.
Examples of access protocols supported by these modules include several
network protocols defined by Internet RFC draft standard documents,
including FTP (File Transfer Protocol), Gopher, NNTP (Network News
Transfer Protocol), HTTP (Hypertext Transfer Protocol). In addition, other
modules support access to local (as opposed to network-based) information
repositories, such as local file-systems and databases.
Site Description Language
As previously pointed out, the concepts in information source descriptions
113 relate concepts in world view 115 to information sources 123. These
relationships are expressed using a site description language. CLASSIC and
related knowledge representation systems employ description languages
which can function as site description languages, but such site
description languages do not permit efficient reasoning. In a preferred
embodiment, efficiency has been substantially increased by the use of a
site description language which extends CLASSIC.
The following discussion of the site description language employed in the
preferred embodiment employs the example below:
Consider an application in which we can obtain information about airline
flights from various travel agents. We have access to fares given by
specific travel agents and to telephone directory information to obtain
their phone numbers. In practice, the information about price qu | | |