|
Claims  |
|
|
That which we claim is:
1. A system for storing and manipulating information in an information
base, said system comprising
(a) an information storage device;
(b) a plurality of information elements stored in said storage device, each
information element having at least one attribute with an orderable value,
and
(c) a topological map stored in said storage device for each said
attribute, said map comprising
(1) means for representing a predetermined number of ranges of attribute
values, which ranges collectively include the attribute values for all
information elements in the information base; and
(2) means defining a correspondence between each of said information
elements and the ranges to which they map.
2. A system as defined in claim 1 additionally including
(d) an input device cooperating with said information storage device for
receiving a query having specifications based upon specified parameters
related to an attribute of the stored information elements; and
(e) means responsive to receipt of a query for accessing said topological
map based upon said query and for identifying from said map, without
inspection of the information elements, information elements in the
information base which are known to meet the specifications of the query.
3. A system as defined in claim 2 wherein said means (e) also includes
means for identifying from said map, without inspection of the information
elements, information elements in the information base which are known not
to meet the specifications of the query.
4. A system as defined in claim 2 wherein said means (e) also includes
means for identifying from said map, without inspection of the information
elements, information elements in the information base which may meet the
specifications of the query.
5. A system as defined in claim 4 including means for inspecting only those
elements of the information base which may meet the specifications of the
query and for identifying which of those elements do meet the
specifications of the query, whereby all of the information elements of
the information base which meet the specifications of the query are
identified.
6. A system as defined in claim 1 additionally including
(d) an input device cooperating with said information storage device for
receiving a query having specifications based upon specified parameters
related to an attribute of the stored information elements; and
(e) means responsive to receipt of a query for accessing said topological
map based upon said query and for generating therefrom an output map
having elements which correspond to each of the information elements in
the information base and which indicate whether each respective
information element does, does not or may meet the specifications of the
query.
7. A system defined in claim 6 additionally including
(f) means for inspecting only those information elements of the information
base which were indicated in said output map that they may meet the
specifications of the query and for identifying which of those elements do
meet the specifications of the query, whereby all of the information
elements of the information base which meet the specifications of the
query are identified.
8. A system as defined in claim 1 wherein the information elements stored
in said storage device each have a plurality of different attributes, and
wherein said system additionally includes
(d) an input device cooperating with said information storage device for
receiving a query having specifications based upon Boolean logic related
to a plurality of the attributes of the stored information elements and to
specified parameters related to the attributes;
(e) means responsive to receipt of a query for accessing the topological
maps for each attribute specified by said query and for generating
therefrom, for each specified attribute, an intermediate output map having
elements which correspond to each of the information elements in the
information base and which indicate whether each respective information
element does, does not or may meet the specifications of the attribute
specified in the query; and
(f) means for combining the respective intermediate output maps in
accordance with the Boolean logic of the query to produce an output map
having elements which correspond to each of the information elements in
the information base and which indicate whether each respective
information element does, does not or may meet the specifications of the
query.
9. A system defined in claim 8 additionally including
(g) means for inspecting only those information elements of the information
base which were indicated in said output map that they may meet the
specifications of the query and for identifying which of those elements do
meet the specifications of the query, whereby all of the information
elements of the information base which meet the specifications of the
query are identified.
10. A system as defined in claim 1 wherein said means for representing a
predetermined number of ranges comprises a plurality of codes having
distinct and unique values, and wherein said means defining a
correspondence between each of said elements and the ranges to which they
map comprises a series of said codes, equal in number to the number of
information elements in the information base, with each code in the series
corresponding to a respective one of said information elements, and with
the values of the respective codes of said series defining a
correspondence between the respective information element and the range to
which the attribute value of that element maps.
11. A system as defined in claim 10 wherein each said topological map is
represented by an array of said codes.
12. A system as defined in claim 10 wherein each said topological map is
represented by a bit-map, each bit-map corresponding to a given code.
13. A system as defined in claim 12 wherein the bit-map comprises a
multi-level bit-map.
14. A system for storing and manipulating data records in a computer data
base, said system comprising
(a) a data storage device;
(b) a plurality of data records stored in said storage device, each data
record consisting of multiple fields of data, with each field having an
orderable value;
(c) a topological map for each of said fields, each said topological map
being stored in said storage device and including
(1) a plurality of range codes having distinct and unique values
representing a predetermined number of ranges of values for said field,
with the ranges collectively including the field values for all of the
records in the data base; and
(2) an array of elements, equal in number to the number of data records in
the data base, with each element in the array corresponding to a
respective one of said records in the data base, and with each element
possessing a range code to define a correspondence between each record and
the field value to which the range code maps.
15. A system as defined in claim 14 including
(d) an input device cooperating with said data storage device for receiving
a query having specifications based upon specified parameters and logic
related to one or more fields of the stored data records;
(e) means responsive to receipt of a query for selecting the stored
topological maps for the field or fields specified in the query and for
selecting relational operations corresponding to the logic specified in
the query; and
(f) means for performing the selected relational operations employing the
selected topological maps to identify a superset of data records in the
data base in which is included all records of the data base which meet the
specifications of the query.
16. A system as defined in claim 15 wherein said data storage device
includes high speed random access memory into which said topological maps
are loaded when performing the selected relational operations to identify
said superset of data records.
17. A system as defined in claim 15 wherein said data storage device
comprises a plurality of computer systems interconnected to communicate
with one another, the information elements of the information base being
distributed among said computer systems and said systems operating in
parallel to identify said superset of records.
18. A system as defined in claim 14 including means defining a correction
map for storing the range codes for every attribute for information
elements added to the information base after creation of said topological
map.
19. A method for generalized topological mapping of an information base
composed of a plurality of elements, each element having at least one
relatable attribute with an orderable value, comprising
(a) selecting an attribute of the elements;
(b) for the selected attribute, defining a predetermined number of ranges
of attribute values, which ranges collectively include the selected
attribute values for all elements in the information base; and
(c) for the selected attribute, storing in a information storage device a
topological map defining a correspondence between said elements and the
ranges to which they map.
20. A method according to claim 19 wherein each element of the information
base comprises a plurality of relatable attributes, each with an orderable
value, and said method includes repeating steps (a), (b) and (c) to
produce stored topological maps for each of said plurality of relatable
attributes.
21. A method according to claim 19 or 20 wherein the respective elements of
the information base comprise records.
22. A method according to claim 19 or 20 wherein the respective elements of
the information base comprise groups of records.
23. A method for controlled topological manipulation of an information base
composed of a plurality of elements, each element having at least one
relatable attribute with an orderable value, in response to a query having
specifications based upon specified parameters and logic related to at
least one attribute of the elements, said method comprising
(a) selecting an attribute of the elements;
(b) for the selected attribute, defining a predetermined number of ranges
of attribute values, which ranges collectively include the selected
attribute values for all elements in the information base;
(c) for the selected attribute, storing a topological map defining a
correspondence between said elements and the ranges to which they map;
(d) selecting the appropriate topological maps produced in accordance with
step (c) for the attribute or attributes specified in the query,
(e) selecting relational operations corresponding to the logic specified in
the query, and
(f) performing the selected relational operations employing the selected
topological maps to define a superset of elements in the information base
in which is included all elements of the information base meeting the
specifications of the query.
24. A method according to claim 23 wherein the query is based upon
specified parameters and logic related to a plurality of attributes of the
elements, and a plurality of topological maps related to such attributes
are selected in step (d), and wherein said step (f) comprises combining
the plurality of selected topological maps in accordance with the selected
relational operations to produce said superset.
25. A method for storing and manipulating information in an information
base comprising
(a) storing in an information storage device a plurality of information
elements, each element having a plurality of attributes with an orderable
value,
(b) for each said attribute of the elements,
(1) defining a predetermined number of ranges of attribute values, which
ranges collectively include the attribute values for all elements in the
information base, and
(2) storing in the information storage device a topological map defining a
correspondence between said elements and the ranges to which they map,
whereby a stored topological map is produced for each of said plurality of
attributes;
(c) receiving a query having specifications based upon specified parameters
and logic related to one or more attributes of the stored elements;
(d) selecting the stored topological maps for the attribute or attributes
specified in the query;
(e) selecting relational operations corresponding to the logic specified in
the query;
(f) performing the selected relational operations employing the selected
topological maps to define a superset of elements in the information base
in which is included all elements of the information base meeting the
specifications of the query; and
(g) utilizing said superset to selectively access the stored elements of
the information base.
26. A method for storing and manipulating information in an information
base comprising
(a) storing in an information storage device a plurality of information
elements, each element having a plurality of attributes with an orderable
value,
(b) also storing in the information storage device, for each said attribute
of the elements, a topological map which includes a plurality of range
codes representing a predetermined number of ranges of attribute value,
which ranges collectively include the attribute values for all information
elements in the information base, and an array of said range codes which
defines a correspondence between each of said information elements and the
ranges to which they map;
(c) receiving a query having specifications upon specified parameters and
logic related to one or more attributes of the stored information
elements;
(d) accessing the topological maps for each attribute specified in the
query and generating therefrom an output map having elements which
correspond to each of the information elements in the information base and
which indicate whether each respective information element does, does not,
or may meet the specifications of the query.
27. A method as defined in claim 26 including the steps of
adding additional information elements to the information base,
creating a correction map containing, for each such added elements, the
range codes for every attribute of the added information element, and
each time a topological map is accessed in accordance with step (d),
processing the correction map to update the topological map.
28. A method as defined in claim 26 wherein the query is based upon Boolean
logic related to a plurality of the attributes of the stored information
element and to specified parameters related to the attributes, and
wherein said step of accessing the topological maps and generating an
output map comprises
(1) accessing the topological map for each attribute specified in said
query and generating therefrom, for each specified attribute, an
intermediate output map having elements which correspond to each of the
information elements in the information base and which indicate whether
each respective information element does, does not, or may meet the
specifications of the respective attributes specified in the query;
(2) combining the intermediate output maps in accordance with the Boolean
logic of the query to produce said output map. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
MICROFICHE APPENDIX
A microfiche appendix consisting of one fiche of 67 frames is included as
part of this specification. This appendix contains a source code listing
of a computer program which implements a working embodiment of the
invention disclosed herein.
FIELD OF THE INVENTION
This invention relates to the efficient retrieval, manipulation, and
analysis of stored information in an information base.
BACKGROUND OF THE INVENTION
It is frequently desirable to retrieve information elements stored in an
information base on the basis of queries--for example a search for all
information elements in the information base that have certain values of
certain fields or attributes. Data processing systems typically require
the query specification to employ exact values in order to retrieve the
desired information from the information base. Thus, mathematically exact
values of particular attributes (fields) are input, which are then
compared with corresponding attribute (field) values of the information
elements in the information base to select those elements with exactly
equivalent values. This is also true of data manipulations, such as
sorting, where it is desired to output information elements based on an
ordering rule of one or more attributes (first, the record with the
highest attribute value, then the next highest and so on). Such selective
access permits the system to abstract the information base and deal only
with the elements which are pertinent to the specifications of the query.
Methods currently used to handle such selective query specifications fall
into two broad classes. The first is an exhaustive iterative examination
of each of the elements of the information base to find those meeting the
specifications of the query. The second is to store, for all elements,
duplicate values of selected attributes (associated with a corresponding
element address) in a specialized data structure (index) designed for
rapid access to values and corresponding information elements meeting the
specification. Examples of such specialized data structures include
ordered lists, trees, hashed indexes and a number of other variations, of
which, only a few are commercially viable.
Where applicable, such data structures or indexes provide much faster
access than iterative search methods but are subject to the following
limitations:
(1) The index files needed for reference to the attribute or attributes of
the information base may be of substantial size, especially when the
information element contains a large number of attributes which are
indexed for subsequent retrieval. In some instances the storage
requirements for the index files may equal or exceed the storage
requirements for the information base itself.
(2) Indexes provide efficient access only for the specific attribute or
combination of attributes for which the index is designed. They are
inefficient or inapplicable for the flexible inquiries encountered in
commercial practice which include a broad range of logical relations
between varied combinations of numerous attributes, often on the basis of
partial or inexact specifications.
Thus, while these methods satisfy the minimal requirements of data
processing systems, they are far from adequate for the increasingly
critical need for a general approach to efficient processing of complex
multi-attribute specifications.
It should be noted that there are specialized examples of current methods
which superficially deal with inexact specifications, such as partial key
access, occasional use of explicit ranges and some recent systems which
purport to permit the use of "plain english" specifications. However, such
systems are still dependent on the ability of the logic of the system to
translate or cross-relate such input to an exact key structure. Thus,
partial keys will locate a record in a tree index (after operator
inspection of a number of incorrect records) only if the initial
characters of the partial input exactly match the initial characters of
the complete key. Equivalent limitations apply to all other such methods
and performance becomes less efficient and more inaccurate as the
specifications become less precise. This is also true of recent
developments in "artificial intelligence" systems, which employ very
complex (and thus computationally bound) analytical logic, rule logic,
classifier logic and so on to translate incomplete and imprecise input
into the most specific and highest probability output possible, generally
incorporating prompts for additional input to clarify ambiguities.
Conversely, there is no general approach in the prior art which
purposefully utilizes less precise representations of data to enhance the
efficiency and validity of manipulating exact data values. It is an object
of this invention to provide such generalized systems.
SUMMARY OF THE INVENTION
The present invention provides methods and means for manipulating
information in a stored information base predicated on the unique
compactness and ease of processing of coded maps of the attributes of an
information base, referred to herein as topological maps. Each map
comprises compact symbols corresponding to predefined ranges of values of
an attribute. A symbol for the range encompassing the value of that
attribute for each information element is stored in said map in
correspondence to each element. The information elements, as well as the
topological maps, are stored in an information storage device.
A query is processed by accessing the pertinent topological map or maps
based upon the specifications of the query, and identifying from the map
or maps, the information elements in the information base which meet the
specifications of the query. Simple queries concerning a single attribute
are resolved by accessing the pertinent topological map for that
attribute, while more complex queries involving multiple attributes are
resolved by combining the topological maps for the attributes involved in
the query in accordance with the logical operators of the query.
The attribute value or range of values in the specifications of the query
is compared to the predefined ranges represented by the symbols in the
topological map for that attribute. Then, by referring solely to the map,
it is possible to quickly eliminate from consideration those information
elements of the information base which, as shown by their range symbols in
the topological map, could not possibly meet the specifications of the
query since the mapped range for that attribute value is clearly outside
the value or range of values in the specification. In a like manner, it is
possible to quickly identify information elements in the information base
which are known with certainty to meet the specifications of the query
because the mapped range for that attribute value is wholly within the
range specified in the query. Because the symbols used in the map
represent ranges of values rather than exact values, the resolution of the
query may find some information elements which "may" meet the
specifications of the query but which cannot be determined with certainty
solely by reference to the map. Only those information elements which
"may" meet the specifications of the query need be inspected to determine
whether they meet the specifications of the query.
In addition to efficient manipulation of exact values, this approach can
concurrently manipulate and correlate approximate or qualitative values
with equal effectiveness and efficiency. Thus, an attribute corresponding
to colors of clothing might have ranges defined as yellows, greens, blues,
browns, reds, etc. Any variations in colors or terminology derived from
one of these basic colors would be coded as belonging to that color range
(tan to brown, rose to red and so on). Typical records in a clothing store
data base would contain the garment type, color, style, brand and so on in
addition to quantitative data such as cost, price, number on hand, sales,
etc. The present invention will coherently manipulate both types of data,
accurately answering queries such as "what are the relative sales of tall
and short sizes of women's dresses in red and black colors for the fall
and summer seasons of the past three years".
Such extensive specifications can be processed by general purpose computers
with complete accuracy and specificity at rates which are orders of
magnitude faster than prior art because the compactness of such coded
topological maps permits a large number of attributes to be mapped with
less storage than is required for prior art key structures with only one
or two keys. Thus, all maps pertinent to an inquiry can be rapidly loaded
into high speed semiconductor memory simultaneously and the processor not
only has much less key data to evaluate, but can compare the compact codes
with simple, high speed logic and can access this key data at
semiconductor speeds, which are orders of magnitude faster than disk
input-output transfer rates. These increases in rate of processing are
compounding rather than additive and the net improvement is thus much
greater than the sum of these effects.
The use of value ranges to characterize elements of the information base
may, at a glance, appear to significantly reduce the specificity and thus
the efficiency of the system. However, for inquiries relating to two or
more ranges and attributes, which includes the vast majority of inquiries,
the number of uncertain elements is insignificant. Such uncertainty as may
be introduced by this approach is readily eliminated by directly checking
uncertain records, which are identified in the normal course of processing
the maps. Thus, any reduction in efficiency which may be introduced by
such checking is minor or insignificant relative to the enormous
improvement in the speed of isolating the records initially.
For direct access to specific records via primary keys, the present
invention offers no significant improvement in speed relative to the prior
art, since speed of current techniques is already at the limits of human
perception. However, this invention still offers major improvements in
terms of reduced storage space for direct key files, and can offer major
improvements in speed where direct access by a variety of secondary keys
is required. This again relates to the feasibility of storing the complete
compact code maps in high speed memory.
A few prior art techniques have been disclosed which have a superficial
resemblance to elements of this invention. These are exemplified by the
section entitled "Retrieval on Secondary Keys" in Volume 3/Sorting and
Searching/The Art of Computer Programming by Donald E. Knuth, who is
widely recognized as one of the leading authorities in this field. Both
the preamble and summary of this section point out the difficulty and
major limitations of the current art in coping with complex,
multi-attribute queries and define the examples included as highly
specialized techniques of narrow utility.
Knuth provides an example on page 554 of an "orthogonal range query" (two
perpendicular dimensions). He proposes partitioning the two dimensions
into ranges, but only for the limited purpose of defining a combined class
which is equivalent to the product of the two dimensions (i.e. area or
domain). He then proposes forming an inverted list of record numbers
corresponding to these product classes, each corresponding list including
any records whose dimensions would be encompassed by the product class.
Knuth finally proposes processing this list for all product classes which
may encompass any set of values of both dimensions which are included in a
specification defining upper and lower limits for each dimension (i.e.
areas within the domain). This isolates all records which fall within the
specified ranges, but also includes many records which satisfy one of the
two range specifications but not both. Because the ranges of Knuth's
method depend on two attributes, the lack of specificity for each
range/product class is compounded. For example, if the values satisfying a
query were included in a set of elements falling between the midpoints of
two adjacent ranges for each of the two dimensions, both Knuth's proposal
and the present invention would be 25 percent efficient (i.e., 1 value out
of 4 selected would actually meet the specification). However, if the
number of ranges were doubled for both methods, every selection by the
present invention would meet the specification (i.e., 100 percent
efficiency), while Knuth's proposal would still produce 1 invalid record
for each correct one (i.e., 50 percent efficiency). Hence, additional
complex processing is necessary to discard these irrelevant records.
Although this approach emulates a few of the functions of the present
invention for limited and specialized information retrieval requirements,
the structure proposed by Knuth is inherently multidimensional; queries
referencing only one of the attributes described by this multidimensional
structure radically reduce the efficiency of utilizing this structure to
find records satisfying the query. Thus, a dominant distinction is that
Knuth's proposal (as he points out) cannot be straightforwardly extended
to efficient interaction with other such lists to satisfy the general case
of queries where the combinations of attributes referenced in the query
are determined dynamically and not known a priori. Conversely, the ability
to define correspondences in the form of "topological maps" which may be
rapidly searched and which are compatible with the concurrent processing
of any number of dimensions or types of information is an important
feature of this invention. A considerable number of additional
distinctions will be apparent from the detailed specifications, but the
above is clearly sufficient to distinguish this invention from the prior
art.
It will be evident to those skilled in the art that the scope of this
invention encompasses numerous variations and that references to specific
preferred embodiments are illustrative and not limiting. Thus, means
incorporating parallel processors, dedicated logic processors, optical
devices and so on and methods utilizing inverted maps, dual cross-coded
maps and so on, separately or in combinations will be seen as synergistic
embodiments which are clearly within the spirit of this invention.
DESCRIPTION OF THE DRAWINGS
Further features and aspects of the invention will become apparent from the
detailed description and illustrative example which follow, and from the
accompanying drawings, in which:
FIG. 1 is a block schematic representation illustrating the primary
elements of the system of the present invention; and
FIG. 2 is a block schematic representation illustrating an arrangement of
microcomputers suitable for implementation of the present invention in a
parallel processing environment.
DESCRIPTION OF ILLUSTRATIVE EMBODIMENT
The present invention is essentially a contextual reference information
retrieval system. Given any reference to information by content of the
information, this system efficiently retrieves information matching the
specification.
As illustrated schematically in FIG. 1, the system, generally indicated by
the reference character 10, is used to access and manipulate an
information base 12 which is stored in a storage device 14. The
information base 12 is comprised of one or more information elements. Each
information element is comprised of one or more attributes (or fields),
one or more of these attributes having an orderable value. By "orderable
value" is meant that the attribute of the element has a value capable of
being evaluated and being placed in some order in relation to the value of
that attribute for other elements in the information base. This may
include numbers, characters of the alphabet, symbols, codes, etc. The
system 10 consists of two subsystems: an input subsystem 20, and an output
subsystem 30.
The input subsystem accepts input from an input device 22. The input device
22 is capable of receiving information base elements where an information
base element is comprised of one or more attributes and the corresponding
values for these attributes. The input subsystem 10 is used to process the
individual information elements as they are input to the information base,
or as changes or deletions are made and to produce processed
representations or "topological maps" 16 of the attributes of the
information base elements. The topological maps 16 are stored in the
storage device 14 for subsequent use by the output subsystem 30.
The output subsystem 30 is given a query 32 as input, i.e., a reference to
the information on the basis of a specification of the values of one or
more attributes. The query 32 may be entered into the output subsystem 30
by any suitable input device, and may for example, utilize the same input
device 22 as is employed by the input subsystem 20. The output subsystem
30 then utilizes the storage device 14 to retrieve the topological maps 16
of the attributes referenced by the specification. These topological maps
are then manipulated in accord with the query, the end result being one or
more output maps 18 indicating information elements which either:
(1) Do meet the specification, or
(2) May meet the specification, or
(3) Do not meet the specification.
The output map thus defines a "superset" of the information elements in the
information base which meet the specification of the query. It is a
superset because some of the elements "may" satisfy the query.
The output map which is generated indicates which of the elements in the
superset do meet the specification of the query and which of those
elements may meet the specification. Those elements which the map
indicates do meet the specification are known with certainty without ever
having accessed or inspected the stored information elements themselves.
Now, only those elements which the output map indicates may meet the
specification are accessed and inspected to determine which ones do meet
the specification. The results of the query are communicated to the user
by an output device 34. As will become apparent from the illustrative
example which follows, the output subsystem is capable of rapidly
resolving various kinds of queries, including queries as to exact values
of certain attributes, range queries, and complex queries about multiple
attributes using Boolean logic.
The specific configuration of the storage device 14 is not critical, and
may take various forms depending on how the present invention is
implemented. In a microcomputer implementation, for example, it is
desirable that at least a portion of the data storage device comprise high
speed random access memory, such as semiconductor memory for example. The
topological maps would be loaded into the high speed random access memory
during resolution of a query to facilitate manipulation and processing of
the maps. Additional data storage --e.g. for permanent storage of the
information elements of the information base and topological maps--can be
handled by other suitable data storage means such as magnetic media,
bubble memory devices, optical (laser) memory devices, etc.
For illustrative purposes, we will consider an example of applying this
technique to an information base composed of elements represented by fixed
length ASCII records. Each element in the information base is comprised of
attributes; each attribute type can be either "Alpha", meaning that it can
store characters or digits, or "Integer" meaning that it can store an
integer value (represented as an ASCII string of digits). The attribute
list for the elements of this information base is shown below:
______________________________________
##STR1##
##STR2##
______________________________________
Thus, the information base contains 103 records (since we started numbering
with 0), and slot 103 is currently the End of File for this information
base.
Using the above as the information base, a description of the process is
given below.
Input Subsystem
(a) Range Definition
Before the topological maps may be created, a range definition must be
created for each attribute. The range definition comprises one or more
unique ranges of values for the attribute, i.e., a lower bound and upper
bound for this attribute. Collectively all the ranges which make up the
range definition for the attribute must include all possible values of
this attribute for all elements in the information base, i.e., for every
possible attribute value there must be at least one range which includes
that value. One way to determine such a range definition is to arbitrarily
define these ranges. An example of a "range definition" is:
______________________________________
Range Number > <=
______________________________________
0 -.infin.
300
1 300 700
2 4000 7372
3 5700 9300
4 700 4000
5 7372 +.infin.
______________________________________
An information element is said "to map" to a particular range of an
attribute if the value of the attribute for that information element is
included in said range. The attribute value is also said to map to that
range.
Some examples of values and the range(s) to which they map for this
definition are:
______________________________________
Value Range Number(s)
______________________________________
4563 2
5900 2,3
12 0
479 1
______________________________________
Although the ranges may be constructed to overlap, as shown above,
advantageously, these ranges are constructed to be mutually exclusive so
that any given value of an attribute maps to exactly one range in the
range definition. Additionally, it is in general advantageous to have
approximately equal number of information elements map to each range of an
attribute. This can be done by taking a sample of the information elements
in the information base and selecting a range definition such that an
equal number of the information elements in the sample map to each range,
i.e., the range definition "equi-partitions" the sample. Assuming the size
of the sample was large enough to be statistically significant, this range
definition will also serve to approximately equi-partition the information
base as a whole.
Additionally, a unique code/representation is associated with each of the
ranges for an attribute. Thus, if a range definition with 250 ranges were
created, the range including the lowest attribute values would be assigned
a 0, the next lowest range a 1, and so on up to 249 for the highest range.
This advantageous range definition for an attribute can be effected by this
method:
(a) Determine the number of ranges to be in the range definition,
NUM-RANGES. (A typical value of NUM-RANGES is 250.)
(b) Take (NUM-RANGES * SAMPLES-PER-RANGE) samples from the information
base, and array, the SAMPLE-ARRAY. (A typical value of SAMPLES-PER-RANGE
is 30.)
(c) Sort the entries in SAMPLE-ARRAY into ascending order (based on the
ordering rule appropriate for this type of attribute).
(d) Then every SAMPLES-PER-RANGE'th entry is selected from SAMPLE-ARRAY to
serve as the upper-bound for a range (the lower bound being defined by the
previous upper bound). This result is stored in RANGE-DEF-ARRAY. Finally,
the last value stored in RANGE-DEF-ARRAY is the highest value possible for
this attribute.
To take a specific example, consider the case of our information base
described above. Assume that we wish to create a range definition for
SALARY, NUM-RANGES=8, and SAMPLES-PER-RANGE=5. Thus, we must take
(NUM-RANGES * SAMPLES-PER-RANGE)=(8 * 5)=40 samples from the information
base transferring the value of SALARY from the information base element
selected to an entry in the SAMPLE-ARRAY. Graphically, the sampling
operation can be seen as:
______________________________________
##STR3##
______________________________________
Now that we have SAMPLE-ARRAY filled with sample values, the array is
sorted in ascending order. Once it is sorted, every SAMPLE-PER-RANGE'th
value of the sorted SAMPLE-ARRAY is used as the upper bound for a range.
Graphically, this process can be seen as:
______________________________________
##STR4##
______________________________________
The end result of this process is that RANGE-DEF-ARRAY looks like this:
______________________________________
##STR5##
______________________________________
Input Operation
The input procedure comprises the following processing steps:
(A) Inputting one or more information elements each of which has one or
more attributes, each attribute having a value.
(B) For each attribute element
(1) For each information element
(a) Determine the range to which the attribute maps using the range
definition defined for this attribute.
(b) Store the code representing the range to which this attribute value
mapped in a location in this attribute's topological map which corresponds
to this information element's record number.
To determine the range to which a value VALUE maps, the following algorithm
is applied:
(a) FOR i :=0 TO NUM-RANGES
(1) IF RANGE-DEF-ARRAY[i]>=VALUE GOTO EXIT
(b) EXIT: RETURN(i);
So, i is incremented until we find a boundary value which is greater than
or equal to the value. At this point, the index into the RANGE-DEF-ARRAY
is the range number for this value of this attribute.
Graphically, this process can be seen as:
______________________________________
##STR6##
______________________________________
The information base has been stripped in the above illustration to show
just the SALARY attribute, as this is the attribute which we are concerned
with inputting. The method described above is used for each value to
determine the range to which it maps. The corresponding range number for
the range to which the value maps is then stored in the SALARY topological
map in correspondence with the record which gave rise to it. In this case,
the correspondence is kept by using the record number for each information
base element as the offset into the topological map in which to store the
range number for the information element.
For further illustration, consider a new element to be added to the
information base:
______________________________________
NAME SALARY JOB-ID
______________________________________
Jo Jo 2200 8391
______________________________________
This record is stored at the end of file (EOF):
##STR7##
A new topological map entry is made as follows:
##STR8##
One optimization which can be important to the method functioning
efficiently is the concept of a "correction map". If the above method is
literally applied to an information base, then each time a new information
base element is added or an attribute of an existing information element
is changed, an update is required for each attribute's topological map. In
the case of our example information base there are three attributes.
Therefore, each record being added will require three accesses to update
the topological maps. In the case of an information base wherein each
element is described with 12 attributes (not uncommon), this would require
12 accesses being made per element added. For our example information
base, if 100 elements were added to the information base, then (100 *
3)=300 accesses would be required.
However, we can define a correction map, where this "correction map" can
store the range number for every attribute of an information element.
Thus, when a new record is added, only one update need be made instead of
three; this one update will be made to the correction map. The correction
map entry will have the record number of the added information element
along with the range number for each attribute. When the map for an
attribute is retrieved, the correction map is also loaded, and each entry
in the correction map is processed to update the memory image of the
attribute map to get a 100% up-to-date map.
Later, all of the entries in the correction map may be processed together
to update the topological map for each attribute. In the case of our
example, since 100 elements have been added, and since they were all added
at EOF, the elements of the topological map to be updated will all be
located in the same physical block. Thus, we only require 1 access for
each attribute, in which all 100 entries may be added to the topological
map. The total number of accesses required for the "raw" application of
the method was (100 * 3)=300 accesses. With this improvement, however,
only (100 * 1)+(3 * 1)=103 accesses were required. We have thus reduced
the storage device I/O requirements by about 3:1.
Output Subsystem Operation
One general form of query is called a Boolean query. It is defined to be
one or more range queries joined by AND, OR, or NOT. A range q | | |