|
Description  |
|
|
TECHNICAL FIELD
The present invention relates to retrieving data from heterogeneous data
sources including structured sources and semi-structured sources and, more
particularly, extracting data from World Wide Web pages in response to a
query phrased in a structured query language.
BACKGROUND INFORMATION
The World Wide Web (WWW) is a collection of Hypertext Mark-Up Language
(HTML) documents resident on computers that are distributed over the
Internet. The WWW has become a vast repository for knowledge. Web pages
exist which provide information spanning the realm of human knowledge from
information on foreign countries to information about the community in
which one lives. The number of Web pages providing information over the
Internet has increased exponentially since the World Wide Web's inception
in 1990. Multiple Web pages are sometimes linked together to form a Web
site, which is a collection of Web pages devoted to a particular topic or
theme.
Accordingly, the collection of existing and future World Wide Web pages
represents one of the largest databases in the world. However, access to
the data residing on individual Web pages is hindered by the fact that
World Wide Web pages are not a structured source of data. That is, there
is no defined "structure" for organizing information provided by the Web
page, as there is in traditional, relational databases. For example,
different Web pages may provide the same geographic information about a
particular country, but the information may appear in various locations of
each page and may be organized differently from page to page. One
particular example of this is that one Web site may provide relevant
information on one Web page, i.e. in one HTML document, while another Web
site may provide the same information distributed over multiple,
interrelated Web pages.
A further difficulty associated with retrieving data from the Word Wide Web
is that the Web is "document centric" rather than "data centric". This
means that a user is assumed to be looking for a document, rather than an
answer. For example, a user seeking the temperature of the Greek Isles
during the month of March would be directed to documents dealing with the
Greek Isles. Many of those documents might simply contain the words
"March," "Greek," and "temperature" but otherwise be utterly devoid of
temperature information, for example, "the temperature during the day is
pleasant in March, especially if one is visiting the Greek Isles." These
documents are useless to the requesting user, however, current techniques
of accessing the Web cannot distinguish useless "near-hits" from useful
documents. Further, the user is seeking an "answer" (e.g. 65.degree. F.)
to a particular question, and not a list of documents that may or may not
contain the answer the user is seeking.
Another difficulty associated with extracting data from Web pages is that
each Web page potentially provides data in a different format from other
Web pages dealing with the same topic or in a different context from the
request itself. For example, one Web page may provide a particular value
in degrees Centigrade, while another World Wide Web page, or the user
seeking the information, may expect that same information to be in degrees
Fahrenheit. A requesting system or user would be misled or confused by an
answer returned in degrees Centigrade because the requester and the data
source do not share the same assumptions about the provision of data
values.
These problems are not limited to retrieving data from HTML documents
distributed over the Internet. Larger organizations have begun building
"intranets", which are collections of linked HTML documents internal to
the organization. While "intranets" are intended to provide a member of an
organization with easy access to information about the organization, the
problems discussed above with respect the WWW apply to "intranets".
Requiring members of the organization to learn the data context of each
Web page, or requiring them to learn a specialized query language for
accessing Web pages, would defeat the purpose of the "intranet" and would
be virtually impossible on the Internet.
SUMMARY OF THE INVENTION
The present invention allows semi-structured data sources to be queried
using a structured query language. This allows semi-structured data
sources, such as World Wide Web pages (HTML documents), flat files
containing data (data files containing collections of data that are not
arranged as a relational database), or menu-driven database systems
(sometimes referred to as "legacy" systems) to augment traditional,
structured databases without requiring the requester to learn a new,
separate query language. Structured queries directed to semi-structured
sources are identified, converted into commands the semi-structured data
sources understand, and the commands are issued to the data source. Data
is extracted from the semi-structured data source and returned to the
requester. Thus, semi-structured data sources can be accessed using a
structured query language in a way that is transparent to the requester.
A system according to the invention queries both structured and
semi-structured data sources. The system includes a request translator, a
query converter, a command transmitter, a data retriever, and a data
translator. The request translator receives a data request which has an
associated data context and translates that data request into a query
which has an associated data context which is appropriate for the data
source to be queried. The query converter converts at least a portion of
the query into a command or series of commands that can be used to
interact with a semi-structured data source such as a Web page or a flat
file containing data. The command transmitter issues those commands to the
semi-structured data sources, and a data retriever extracts data from the
data sources. Extracted data is translated by the data translator from the
data context of the data source into the data context associated with the
initial request.
A method according to the invention queries both structured and
semi-structured data sources. The method includes translating a data
request into a query, converting at least a portion of the query into a
stream of commands, issuing the commands to the semi-structured data
sources, extracting data from the data sources, and translating the
retrieved data. The data request, which has an associated data context, is
translated into the query which has a data context that matches the data
source to be queried. At least a portion of that query is converted into
one or more commands which can be used to interact with a semi-structured
data source. Those commands are issued and data is extracted from the data
source. Extracted data is then translated from the data context associated
with the data source into the data context associated with the initial
request.
In other aspects of the invention, a method and system for querying
semi-structured data sources in response to a structured data request
comprise the steps of, and means for, converting the data request into one
or more commands, issuing the commands to a semi-structured data source,
and extracting data from the semi-structured data source. The
semi-structured data source can be a World Wide Web page, a flat file
containing data, or a menu-driven database system. In some embodiments,
the conversion of the data request into one or more commands also includes
determining if the requested data is provided by a Web page and then
determining, for each requested datum provided by the Web page, one or
more commands to issue to the Web page in order to retrieve the data.
These determinations are made by accessing a file which is stored in a
memory element of a computer and which includes information on the data
elements provided by the data source as well as the commands necessary to
access the data.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention is pointed out with particularity in the appended claims. The
above and further advantages of this invention may be better understood by
reference to the following description taken in conjunction with the
accompanying drawings, in which:
FIG. 1A is a diagram of an embodiment of a system according to the
invention which includes data receivers and data sources interconnected by
a network;
FIG. 1B is a simplified functional block diagram of a node as shown in FIG.
1A;
FIG. 2 is a flowchart of the steps taken by an embodiment of the request
translator;
FIG. 3 is a diagram of a data translator according to the invention;
FIG. 4 is a diagram of an embodiment of an ontology as used by the system
of FIG. 1A;
FIG. 5 is a diagram of an embodiment of an ontology showing examples of
data contexts;
FIG. 6 is a block diagram of an embodiment of a system according to the
invention which queries both structured and semi-structured data sources;
FIG. 7 is a set of screen displays showing a data source, its description
file, its export schema, and its specification file; and
FIG. 8 is a block diagram of a state diagram modeling one embodiment of a
specification file.
DESCRIPTION
Referring to FIGS. 1A and 1B, data receivers 102 and data sources 104 are
interconnected via a network 106. Although data receivers 102 are shown
separate from data sources 104, any node connected to the network 106 may
include the functionality of both a data receiver 102 and a data source
104.
Each of the nodes 102, 104 may be, for example, a personal computer, a
workstation, a minicomputer, a mainframe, a supercomputer, or a Web
Server. Each of the nodes 102, 104 typically has at least a central
processing unit 220, a main memory unit 222 for storing programs or data,
and a fixed or hard disk drive unit 226 which are all coupled by a data
bus 232. In some embodiments, nodes 102, 104 include one or more output
devices 224, such as a display or a printer, one or more input devices
230, such as a keyboard, mouse or trackball, and a floppy disk drive 228.
In a preferred embodiment, software programs running on one or more of the
system nodes define the functionality of the system according to the
invention and enable the system to perform as described. The software can
reside on or in a hard disk 226 or the memory 222 of one or more of the
system nodes.
The data sources 104 can be structured databases, semi-structured Web
pages, or other types of structured or semi-structured sources of data
such as files containing delimited data, tagged data or menu-driven
database systems. The network 106 to which the nodes 102, 104 are
connected may be, for example, a local area network within a building, a
wide-area network distributed throughout a geographic region, a corporate
Intranet, or the Internet. In general, any protocol may be used by the
nodes 102, 104 to communicate over the network 106, such as Ethernet or
HTTP (Hypertext Transfer Protocol).
A set of assumptions regarding data is associated with each node 102, 104.
That is, each node 102, 104 has an associated data context 108-118. For
example, a particular data receiver 102 may always expect that when data
is received, time values are in military time, monetary values are in
thousands of U.S. dollars, and date values are returned in month-day-year
format. This set of assumptions is the data context 108 of that particular
data receiver 102. Another data receiver 102 may make a different set of
assumptions about received data which are represented as its own data
context 112. When a data receiver 102 makes a request for data, its data
context 108, 112, 114, 116 is associated with the request. Similarly, each
data source 104 provides the data context 110, 118 associated with its
data.
The data context 108-118 of a node 102, 104 may be a file containing a list
of data formats and associated meanings expected by that node 102, 104.
For example, if a particular node 102, 104 expects to receive or provide
data which it calls "net income" in units of dollars with a scale of
thousands, that set of expectations may be specified in a file which
represents at least a portion of the data context 108-118 associated with
that node 102, 104. The data context 108, 112, 114, 116 of a data receiver
102 may be provided with each new request made by the data receiver 102.
In one embodiment discussed in greater detail below, the data context
108-118 for each node 102, 104 may be stored in a central location through
which all requests are routed for context mediation, or the data contexts
108-118 may be stored in a de-centralized manner. For example, the data
contexts 108-118 may be stored as a directory of URL (Uniform Resource
Locator) addresses which identify the location of each data context
108-118.
A request for data made by a data receiver 102 is associated with the data
context 108, 112, 114, 116 of the data receiver 102. Referring to FIG. 2,
one embodiment of a request translator 300 determines if the data context
108, 112, 114, 116 of the data receiver 102 is different from the data
context 110, 118 of the data sources 104 that will be queried to satisfy
the request. The request translator 300 may be resident on the data
receiver 102 making the request, or it may reside on another node 102, 104
attached to the network. In some embodiments, the request translator 300
resides on a special purpose machine 120 which is connected to the network
106 for the sole purpose of comparing data contexts 108-118 and resolving
conflicts between the data contexts 108-118 of data receivers 102 and data
sources 104.
The request translator 300 may be implemented as hardware or software and,
for embodiments in which the request translator is implemented in
software, it may be the software program that generates the request.
Alternatively, the request translator 300 may be a separate functional
unit from the hardware or software used to generate the request, in which
case the request translator receives the request as constructed by that
hardware or software. For example, the request translator may be part of
an SQL-query language application, or the request translator may receive
requests made by an SQL-query language application, for example, a
spreadsheet having embedded queries resulting in ODBC-compliant (Open
Database Connectivity-compliant) commands.
The request translator may be provided with the identity of the data
sources 104 to be queried (step 302). That is, the request may specify one
or more data sources 104 to which the query should be directed. In these
embodiments, the request translator compares the data context 110, 118 of
the data source 104 to the data context 108, 112, 114, 116 of the data
receiver 102. If any conflicts are detected, e.g. the data source 104
expects to provide monetary values in hundreds of Japanese Yen and the
data receiver 102 expects to receive monetary values in thousands of U.S.
Dollars, the request translator translates the request to reflect the data
context 108 of the data source 104. In other embodiments, the request
translator is not provided with the identity of the data sources 104 to be
queried, and the request translator may determine which data sources 104
to query (step 304). These embodiments are discussed in more detail below.
When the request translator is translating the data request made by the
data receiver 102, it must detect conflicts between the names by which
data are requested and provided (step 306), and it must detect the context
of that data (step 308). For example, a data receiver 102 may make a
request for a data value that it calls "net worth". A data source 104 may
be identified as having the data to satisfy the request made by the data
receiver 102, however, the data source 104 may call that same number
"total assets". The request translator must recognize that, although
different names are used, the data source 104 and the data receiver 102
are referring to the same data entity.
Name and context conflicts may be determined through the use of an
ontology, or set of ontologies, in connection with the data context
108-118 mappings. An ontology is an overall set of concepts for which each
data source 104 and data receiver 102 registers its values. Ontologies may
be distributed over the network 106 on multiple notes 102, 104.
Alternatively, all ontologies may reside on a single node 102, 104
connected to the network 106 for the purpose of providing a library of
ontologies.
Referring to FIG. 4, an example of a financial ontology 200 is shown. Nodes
102, 104 register context values for sales, profit, and stock value. A
data source 104 may be registered by its system administrator or by a
context registration service, or a user may register a particular data
source 104 from which it desires to receive data. As shown in FIG. 4, a
first user 202 and a second user 204 have registered with the financial
ontology 200. The first user 202 registers that it uses the name "profit"
for profit and "stock cost" for stock value. The second user 204 has
registered that its name for the concept of profit in the financial
ontology 200 is "earnings" and it calls stock value by the name "stock
price". Each user 202, 204 is a possible data receiver 102, and is
attached to the network 106. In much the same way, data sources 104
register, or are registered. For example, a first data source 212 has
registered that it can provide data which it calls "sales" and "profit",
which map to ontology 200 values of sales and profit. A second data source
214, in contrast, provides a "turnover" datum which maps to sales in the
financial ontology 200 and a "net income" datum which matches to profit in
the financial ontology 200.
The data contexts 108-118 registered within each ontology may exist as a
file which has entries 502, 504 for each node 102, 104 corresponding to
shared concepts in the ontology 200, shown in FIG. 5. The data contexts
108-118 may be provided as data records in a file or as a list of pointers
which point to the location of each node's data context 108-118.
Attributes may directly link to concepts in an ontology. For example, an
entry may specify that if sales figures are desired from the first data
source 212, request data using the name "sales." Alternatively, entries in
a data context might rely on other attributes for their value. For
example, an attribute may derive its currency context based on the value
of the user's location. For example, source context 502 reports data for
the NET-SALES attribute in the currency corresponding to the value of the
LOC-OF-INCORP attribute. More specifically, NET-SALES for French companies
have a currency context of Francs, while German companies expect or
provide NET-SALES in units of Marks. Ontologies may be distributed over
the network 106 on multiple nodes 102, 104. Alternatively, all ontologies
may reside on a single node 102, 104 connected to the network 106 for the
purpose of providing a library of ontologies.
A step that the request translator takes before actually querying the data
source 104 is to detect any conflicts in the names used by the data
receiver 102 and the data source 104 (step 306). For example, when the
request translator 300 initially receives a request from the first user
202 for companies having "profit" and "stock cost" in excess of some
value, it must detect any conflicts in the names used by the first user
202 and the data source 104. Assuming that the first user 202 specifies
the second data source 214 as the source 104 from which data should be
retrieved, the request translator must recognize that when the first user
202 requests "profit", user 202 is seeking profit which is represented by
"turnover" in the second data source 214. Similarly, when the first user
202 requests data regarding "stock cost", that data maps to stock value in
the financial ontology 200 for which the second data source 214 has not
registered. Thus, the request translator 300 would return a message that
the second data source 214 cannot satisfy the entire data request made by
the first user 202.
Another step that the request translator 300 takes before actually querying
the data source 104 is to detect conflicts in the data context 108
associated with the data receiver 102 and the data source 104 (step 308).
For example, the second user 204 in FIG. 4 expects to receive "earnings"
and "stock price" values in units of tens of pounds, while the first user
202 expects to receive "profit" and "stock cost" data in units of ones of
dollars. Thus, when the second user 204 requests "earnings" and specifies
that the second data source 214 should be used, the request translator
detects the conflict between what the second user 204 calls "earnings" and
what the second data source 214 calls "net income", because both of those
data names map to "profit" in the financial ontology 200. The request
translator 300 also detects the context conflict between the second user
204 and the second data source 214. The second user 204 expects to receive
data in tens of pounds, while the second data source 214 expects to give
data in terms of ones of dollars. The request translator 300 translates
the request made by the second user 204 for "earnings" in units of tens of
pounds to a query directed to the second data source 214 for "net income"
in units of ones of dollars.
When the request is translated into a query, the meaning ascribed to data
by separate nodes 102, 104 can be taken into account. For example, data
source 212 may provide "profit" data which excludes extraordinary
expenses. However, the first user 202 may desire "profit" data including
extraordinary expenses. The ontology 200 may provide a default translation
for this difference in meaning, or the first user 202 may provide a
translation which overrides the default translation.
Other translations may be inferred from entries in the ontology 200. For
example, currency values for a given ontology 200 may be inferred from a
location entry. Thus, data receivers 102 located in England may be assumed
to desire financial data in pounds. The ontology 200 may provide for
translations between these units. These assumptions may be overridden by a
particular data receiver 102 or data source 104, as described below.
Another example of inferring translations from entries in the ontology 200
is as follows. A requester may expect "earnings" to be calculated as
"revenue" minus "expenses". A data source, however, may provide "earnings"
as "revenue" minus "expenses" minus "extraordinary expenses". The ontology
can provide the translation from the source to the receiver, which may
include adding the "extraordinary expenses" into the "earnings" numbered
provided by the data source.
Referring to FIG. 3, a data translator 400 receives data from the data
sources 104 that are queried. Since a conflict between the data context
108, 112, 114, 116 of the data receiver 102 and the data context 110, 118
of the data source 104 has already been detected, the data received from
the data source 104 is translated to match the data context 108 that the
data receiver 102 expects. Once translated, the received data is in a form
the data receiver 102 expects, and the request made by the data receiver
102 is satisfied. The data translator 400 may be provided as a separate
unit from the request translator 300, or they may be provided as a unitary
whole. Alternatively, the request translator 300 and the data translator
400 may be programs running on one or multiple computers.
The translations effectuated by the request translator 300 and the data
translator 400 may be accomplished by using pre-defined functions, look-up
tables, or database queries among other well-known techniques. For
example, when the "net income" datum must be translated by the request
translator 300, it may request the exchange rate from dollars to pounds
from an appropriate currency database and then use that exchange rate to
translate the received datum. In some embodiments, the ontology 200
provides a set of default translations for the request translator 300 and
data translator 400 to use. These default translations may, however, be
overridden by a data receiver 102 or data source 104 that prefers a
different translation to be used. For example, an ontology 200 may provide
a default translation between tens of pounds and ones of dollars that uses
a pre-defined function to multiply data in pounds by 6.67. Alternatively,
the conversion could be done as a number of steps. For example, the
ontology may provide a conversion from dollars to pounds and a conversion
from tens to ones which are applied in succession to the data. A
particular data receiver 102 may not desire such a rough estimate,
however, and may therefore provide its own translation in its data context
108-118 which overrides the default translation provided by the ontology
200.
Multiple conversions may be used if the query accesses multiple data
sources. For example, a data receiver 102 may make a request having two
parts. One part may be satisfied by a first data source 104 and that data
is required to be converted by a look-up function. The second part of the
request may be satisfied by a second data source which requires data to be
converted by a database query.
The request translator 300 may query the data source 104 for the data
receiver 102. In these embodiments, the request translator may optimize
the query (step 310) using any well-known query optimization methods, such
as Selinger query optimization. Alternatively, the request translator 300
may separate a query into several separate sub-queries and direct those
sub-queries to one data source 104 or multiple data sources 104. In
another embodiment, the request translator 300 simply passes the query to
a query transmitter which may also optimize the query or separate the
query into several sub-queries, as described above.
In some embodiments, the data receiver 102 does not specify which data
source 104 to use in order to retrieve the data. For example, in FIG. 4
the second user 204 may simply request a list of all the companies having
"earnings" in excess of some number of pounds, and a "stock price" below a
certain number of pounds. The request translator 300 determines if such a
request may be satisfied (step 304). Since the second user 204 has
registered that "earnings" are equivalent to profit in the financial
ontology 200 and that "stock price" is equivalent to "stock value" in the
financial ontology 200, the request translator 300 may then determine if
any data sources 104 have also registered with the financial ontology 200
as providing those values.
The request translator 300 is able to determine that the first data source
212 and the second data source 214 have both registered with the financial
ontology as providing a "profit" datum while the third data source 216 has
registered with the financial ontology as providing a "stock value" datum.
The request translator may separate the request into two sub-queries, one
for "stock price", which is directed to the third data source 216, and one
for "earnings".
At this point, the request translator 300 may further optimize the query by
selecting to which data source the request for "earnings" should be
directed (step 310). The first data source 212 has registered with the
financial ontology as providing a "profit" datum, called "profit" by the
first data source 212, in tens of pounds, while the second data source 214
has registered with the financial ontology as providing a "profit" datum,
called "net income" by the second data source 214, in ones of dollars.
Since the second user 204 requested earnings in tens of pounds, if the
query is directed to the second data source 214, no context conversion is
necessary. Therefore, the request translator 300 may choose to request the
"profit" datum from the second data source 214 in order to further
optimize the query.
However, the request translator 300 may determine that the second data
source 214 is unavailable for some reason. In such a case the request
translator may direct the request for "profit" data to the first data
source 212 by translating the request made by the second user 204 from
tens of pounds into a query directed to the second data source 214 in ones
of dollars. As described above, this translation may be done by a
predefined function, a look-up table, or a database query.
Once the data sources 104 are chosen and any context translation that is
necessary is done, queries are submitted to the selected data sources 104.
For example, a query could be submitted to the second data source 214,
which requests all companies having a "profit" higher than a certain
number of pounds, while a query is submitted to the third data source 216
requesting a list of all the companies having a "stock value" lower than a
certain number of dollars. This is done by converting the request for
"stock value" in tens of pounds to a query specifying "stock value" in
ones of dollars. This allows the third data source 216 to efficiently
process the request and return data. The returned data, of course, is in
units of ones of dollars, and must be translated into units of tens of
pounds before being presented to the data receiver 102 that makes the
request.
Once the results of both queries are returned, those results must be
"joined", which is a well-known merge routine in the database field.
Joining the query results may be done by the request translator or it may
be done by the data receiver 102 itself.
In some embodiments, one or more of the target data sources 104 is a
semi-structured data source, that is, the data source 104 is of a type
that cannot or does not respond to traditional, structured queries as do
relational databases. For example, referring to FIG. 6, the data receiver
102 issues a request that, as described above, is translated by the
request translator 300 into three sub-queries 602, 604, and 606. The data
receiver 102 may specify the target data sources 104 for the request or
the request translator 300 may determine which sources 104 to query as
described above. Sub-queries 604 and 606 may be issued direc | | |