|
Claims  |
|
|
We claim:
1. A database management system comprising:
a memory for storing data structures, including
a detail table means for storing a plurality of records, each of the
plurality of records being addressable by record pointers and including
dimension fields and summary fields, the dimension fields including
dimension values, and the summary fields including numeric information;
a summary table means including summary nodes for storing summary
information of the summary fields for summary sets of the plurality of
records;
a detail index means including detail index nodes for storing the record
pointers for index sets of the plurality of records, wherein
each index set of the plurality of records includes
a plurality of records having a common combination of dimension values for
the associated dimension fields, and wherein
each summary set of the plurality of records includes
a plurality of index sets having a common combination of dimension values
for the associated dimension fields;
a dimension node means for storing information identifying the summary
table means and the detail index means;
an input means for providing an input set of dimension values for
identifying a desired set of the plurality of records;
a processor means connected to the memory and from the input means for
accessing the memory and performing operations upon data structures; and
a control means connected to the processor means for controlling operations
of the processor means,
the control means including,
a search means connected from the processor means and responsive to the
input set of dimension values for
directing the processor means to read the dimension node means to identify
the summary table means and the detail index means, and
responsive to the identifications of the summary table means and the detail
table means,
directing the processor means to search
the summary table means to find a summary node having a set of dimension
values corresponding to the input set of dimension values or
the detail index means to find an index set having a set of dimension
values corresponding to the input set of dimension values
and to provide as an output
the summary values stored in a summary node having a set of dimension
values corresponding to the input set of dimension values, or
a record pointer for an index set having a set of dimension values
corresponding to the input set of dimension values;
a gather means connected from the processor means and responsive to a
record pointer provided as an output by the processor means for directing
the processor means
to read the index set of records and provide as an output the records of
the index set of records identified by the record pointers contained in
the detail index means found by the processor means;
a summarize means connected from the processor means and responsive to the
records of the index set of records provided by the processor means for
directing the processor means to extract summary information from the
records of the index set of records read by the processor means; and,
a display means
connected from the processor means and responsive to
the summary values stored in a summary node having a set of dimension
values corresponding to the input set of dimension values or
the summary information extracted from the records of the index set of
records
for directing the processor means to generate an output presenting
the summary values stored in a summary node having a set of dimension
values corresponding to the input set of dimension values, or
the summary information extracted from the records of the index set of
records.
2. The database management system of claim 1, wherein:
the memory means further includes
a key value table means for associating an integer value with each
dimension value;
the detail table means represents the dimension values by storing the
associated integer values in the dimension fields of the detail table
means; and
in the control means,
the search means is responsive to the integer values associated with a set
of dimension values for finding the summary table means or detail index
means associated with a desired set of records.
3. The database management system of claim 2 wherein said summary
information in the summary nodes comprises:
statistical information including a count of the number of records
associated with a set of records, and
for each summary field,
a sum of the values,
a sum of the squares of the values,
a minimum value, and
a maximum value
of the values stored in the summary field in the set of records.
4. The database management system of claim 2 wherein:
each summary set of the plurality of records is
defined by a combination of dimension values and associated dimension
fields to contain all the records containing the combination of dimension
values for the associated dimension fields and wherein
the size of each summary set is no less than a predetermined threshold
number, or wherein
the size of a containing summary set formed by removing a dimension value
and associated dimension field from the definition is no less than the
predetermined threshold number.
5. The database management system of claim 4 wherein:
each index set of the plurality of records is
defined by a combination of dimension values and associated dimension
fields to contain all the records containing the combination of dimension
values for the associated dimension field and wherein
the size of the sets are less than the predetermined threshold number, or
wherein
the size of a containing set formed by removing a dimension value and
associated dimension field from the definition is no less than a
predetermined threshold number
or
the set contains varying dimension values for only one dimension field.
6. The database management system of claim 5 wherein the dimension nodes
means comprises:
a search tree including,
dimension nodes for identifying sets of records according to combinations
of dimension values, and
the summary nodes and the detail index nodes arranged in a hierarchical
fashion, wherein
the plurality of summary nodes is divided into
a first plurality of summary nodes for storing the summary information for
the sets whose record pointers are stored by the detail index means, and
a second plurality of summary nodes for storing the summarizing record of
sets whose record pointers are not stored by the detail index means,
each summary node of the first plurality of summary nodes contains a
pointer to a detail index node storing record pointers for the set of
records summarized by the summary node,
each summary node of the second plurality of summary nodes contains
pointers to the dimension nodes, there being one dimension node for each
dimension field not specified in the combination of dimension values for
the summary node and the dimension node being associated with the
dimension field,
each dimension node stores pointers to child summary nodes, there being one
pointer for each dimension value of the dimension field associated with
the dimension node contained in the set of records summarized by a parent
summary node of the dimension node,
the child summary node summarizing the subset of the set of records
containing the dimension value.
7. The database management system of claim 6 wherein:
the dimension nodes, the summary nodes, and the detail index nodes are
arranged in a directed graph wherein each summary node is a child of a
plurality of the dimension nodes, and
each dimension node stores pointers to child summary nodes, there being one
pointer for each dimension value of the dimension field associated with
the dimension node contained in the set of records summarized by a parent
summary node of the dimension node, and the child summary node summarizing
the subset of the set of records containing the dimension value.
8. The database management system of claim 7 wherein:
the summary tree is represented as a plurality of data tables, and wherein
a root summary node is represented as a table with one row containing the
summary information for a set of all the records in the detail table,
the detail index nodes for set of records are represented as a table of
record pointers of the set of records, and
each dimension node is represented as a two-dimensional table, each row of
the table representing a child summary node of the dimension node, each
row containing summary information for the child summary node, and each
row containing pointers to
tables representing child dimension nodes of the child summary node when
the child summary node is one of the second plurality of summary nodes, or
to
a table representing child detail index node of the child summary node when
the child summary node is one of the first plurality of summary nodes.
9. The system of claim 8 wherein:
the memory means includes
the key info table means for storing non-summary fields, each non-summary
field containing information associated with one dimension field and the
non-summary fields including non-summary values;
each non-summary value associated with one of the dimension values of the
dimension field being associated with the non-summary field; and
the control means further includes
an extract means connected from the processor means and responsive to the
input set of dimension values for directing the processor means to read
and provide as an output non-summary information from the key info tables
for a plurality of dimension fields and dimension values defined by the
input set of dimension values, and
the display means further includes
an extract display means connected from the processor means and responsive
to the output of the extract means for directing the processor means to
generate display outputs presenting the extracted non-summary information.
10. The database management system of claim 9 wherein:
the detail table records further include
detail non-summary fields for storing information associated with
individual detail table records, the detail non-summary fields including
non-summary detail values, and
the extract means is further responsive to the input set of dimension
values for reading and providing as an output non-summary information from
the detail table for a plurality of dimension fields and dimension values
defined by the input set of dimension values; and
the extract display means is further responsive to the output of the
extract means for directing the processor means to generate display
outputs for presenting the extracted non-summary information. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
REFERENCE PUBLICATIONS
"Multi-attribute Retrieval with Combined Indexes" , Nov, 1970,
Communications of the ACM, pp 660-665, Vol 13, Number 11.
BACKGROUND OF THE INVENTION
1. Field of the invention
This invention relates to methods of storing and accessing data on digital
computers, and in particular, to an improved data base system for
organizing large amounts of data for fast retrieval and processing.
2. Description of the Prior Art
Databases are used to store large amounts of data in digital computers. To
analyze this data, users need to be able to identify sets of records based
on a combination of attributes and generate summary information, such as
sums, averages, and other statistical functions, for these sets.
Traditional databases may provide support for identifying some of these
sets, but not all of them in an efficient manner. Multidimensional
databases can provide fast access to more sets, for a small number of
attributes. Even so, providing summary information on a set requires
accessing all elements of that set, and is a time-consuming operation for
large sets thereby delaying interactive queries for this information.
The following U.S. Patents disclose typical database management systems.
U.S. Pat. No. 4,554,631, entitled "Keyword Search Automatic Limiting
Method."
U.S. Pat. No. 4,606,002, entitled "B-Tree Structured Data Base Using Spare
Array Bit Maps to Store Inverted Lists."
U.S. Pat. No. 4,611,272, entitled "Key-Accessed File Organization."
U.S. Pat. No. 4,468,728, entitled "Data Structure and Search Method for a
Data Base Management System."
OBJECTS OF THE INVENTION
Accordingly, it is a primary object of the invention to have an improved
database management system.
It is an object of the invention to have a database management system for
providing rapid summary information for large sets of records.
SUMMARY OF THE INVENTION
The above objects and advantages are achieved in a preferred embodiment of
the present invention. According to the preferred embodiment, a data base
management system for storing and accessing data provides fast interactive
access to summary information for different sets of input records, where
the sets are defined by specifying values for multiple attributes. The
method involves calculating a large portion of this summary information
and building it into a data structure. At access time, users specify sets
by giving values for multiple attributes, and the data structure is
searched for the appropriate summary information for that set. If found,
the summary information is displayed; otherwise, the summary information
is calculated from the records themselves.
The data structure consists of the original data in a relational detail
table, a summary tree structure for organizing and summarizing the data
along several dimension fields, and key value tables for encoding
dimension field values as integers.
Key info tables store information associated with dimension field values,
for convenient reference. The summary tree is a search tree where internal
nodes describe and summarize numeric fields in sets of records. Nodes
deeper in the tree describe more specific sets of records, until the size
of the set is smaller than a given threshold, at which point the
individual records of the set are indexed by a detail index node. The tree
can be represented as a numbered set of tables.
At access time, the user specifies a set of dimension values. The summary
tree is walked using these values to find the appropriate node. If the
node is a summary node, the summary information is displayed. If the node
is a detail index node, the set of records is read from the detail table,
and the summary information is calculated from this set.
BRIEF DESCRIPTION OF THE DRAWINGS
The novel features which are believed to be characteristic of the invention
both as to its organization and method of operation, together with further
objects and advantages will be better understood from the following
description when considered in connection with the accompanying drawings.
It is expressly understood, however, that each of the drawings is given
for the purpose of illustration and description only and is not intended
as a definition of the limits of the present invention.
FIG. 1 shows in block diagram form the systems architecture for the
invention, illustrating the general process of building and accessing the
database.
FIG. 2 shows sample data files, suitable as input to this invention, and
the corresponding data dictionary.
FIG. 3 gives a high-level view of the database, showing the major
components.
FIGS. 4 and 5 show the structure of the summary tree, which is used to
access and organize the data along arbitrary dimension fields. FIG. 5
shows the summary tree with sharing subtrees.
FIG. 6 shows the contents of a summary node.
FIG. 7 shows how the summary tree can be represented in a tabular format.
FIG. 8 shows the builder program, which builds the database.
FIG. 9 shows how the summary tree portion of the database is built.
FIG. 10 shows the diver program, and how it accesses the database to
provide summary information to the user.
DESCRIPTION OF THE PREFERRED EMBODIMENT
The present invention is a method of storing data in the memory of a data
processing system and accessing the data in a manner that provides fast
interactive access to summary information for different sets of input
records, where the sets are defined by specifying values for multiple
search keys. The method involves calculating a large portion of this
summary information and building it into a database. When this database is
accessed, this summary information is available without having to
calculate it. Although the process of building the database may be a
time-consuming operation, it can be done when the system is off-line. At
access time, when an operator is waiting for an answer, a database query
can be answered quickly.
FIG. 1 shows the system architecture of the present invention. The
invention generally comprises a processor 10, a main memory 12, and
secondary storage 14. The processor 10 is used to run builder 20 and diver
32 programs, using main memory 12 for temporary storage and intermediate
results, and using secondary storage 14 for permanent storage. Input data
files 16, described by a data dictionary 18 is fed into the builder
program 20. The builder program 20 produces a database 22, which is
comprised of key value tables 24, key info tables 26, a summary tree 28,
and a detail table 30. The diver program 32, given dimension selections 34
from an end user 36, uses the database 22 to produce screen displays and
reports 38 to the end user 36.
FIG. 2. shows sample input data that can be used as input to the builder
program 20. The input to the builder 20 is modelled as a set of flat
files. Each flat file 50, 52 consists of a number of records 48, each of
which describes a single entity. The records 48 are divided up into
different fields 46, which represent different attributes of the entity.
Multiple files are related by using a common field to perform a standard
relational database join operation to produce a single logical file, with
one record for each entity.
The data dictionary 18 of FIG. 1 is used to describe the fields in the
input files, and assigns a type to each of the fields. The data dictionary
contains three pieces of information about each field: a field name 40, a
field type 42, and an associated dimension 44. The field name 40 is used
to identify the information contained in the field. The field type 42
identifies the field as either a dimension, a summary field, or a
non-summary field. The dimension field is a search key along which the
data is organized and summarized. The summary field is a numeric quantity
that provides useful information when summed and averaged. The non-summary
field contains information that is associated with each input record, or
with a value of the dimension field. The non-summary field would be a
field that is not important enough to be a dimension field, or a field
that is directly related to an existing dimension field. The associated
dimension 44 is used for non-summary fields to identify the dimension
field that the information is associated with, or "Detail" if the
non-summary information is unique for each input record.
The example in FIG. 2 illustrates a personnel database, where each record
48 in an employee file 50 represents an employee of a company. The input
files consist of the employee file 50, which contains data about each
employee, and the department file 52, which contains information about the
different departments. The department file 52 can be joined to the
employee file 50 using the common Dept. Id. field. The data dictionary 18
describes the various fields of the input files, and identifies the fields
as dimension fields, summary fields and non-summary fields. The Dept Name
and Manager fields are non-summary fields associated with the Dept. Id
dimension field. The Address and Name fields are non-summary fields
associated with each input record. Alternatively, the address and name
fields could have been associated with the Employee Id dimension field,
since the Employee Id field is unique for each record.
FIG. 3 shows the major components of the database. The database comprises
four parts: a summary tree 28, key value tables 24, a detail table 30, and
key info tables 26.
The key value tables 24 provide mappings between integers 54 and the
possible values 56 for the different dimension fields. For each dimension
field in the input, there is a key value table, with entries for all
values of that dimension field that appear in the input. The key value
tables 24 allow dimension values to be represented as compact integers in
the other parts of the database; the key value table can be indexed to
convert these integers 54 into the actual dimension values 56. In the
preferred embodiment of the invention, the key value tables 24 are sorted
according to their natural sorting order, so that sorting the numbers
associated with a key will result in sorting the key. In this example, the
natural sorting order is alphabetical for alphanumeric dimension fields.
The detail table 30 is a relational table representing the input data in a
tabular format. For each record in the input, there is an entry 68 in the
table, containing dimension fields 58, summary data fields 60, and those
non-summary data fields that are not associated with dimension fields.
Each entry 68 is identified by an associated record address 69 which, as
described further below, is stored in summary tree 28. The dimension
fields 58 are represented numerically as defined by their key value tables
24 to reduce the storage space needed.
The key info tables 26 are used to store information for non-summary fields
associated with dimension fields. For each dimension field that has
non-summary information associated with it, there exists a key info table
64. The key info table 64 is indexed 66 using the numeric ordering of the
key as defined by the appropriate key value table 70. For each value of
the dimension field, there is an entry in the key info table containing
the information associated with that dimension value.
The summary tree 28 is used to summarize and index the detail table 30.
Referring to FIG. 4, a root summary node 72 summarizes the set of all
records in the detail table 30, while lower levels 74, 76 summarize
smaller sets, which are defined by the combination of different dimension
values 78. Leaves 80 of the summary tree 72 are pointers to the detail
table 30. The same detail record may appear several times in the summary
tree 72. A node of the tree represents the set of detail records that are
descended from that node. A summary node 82 will summarize the set of
records, while a dimension node 84 will index that set of records along a
certain dimension field, forming smaller sets. When the set is smaller
than a certain threshold, the dimension nodes are replaced by a single
detail index node 86, which contains record pointers 80 to individual data
records by using record addresses 69, thus reducing the branching of the
summary tree. A detail index node 86 also replaces dimension nodes when
the combination of dimension values 78 defining the set in the summary
node includes all but one dimension value: the final dimension value would
divide the set into very small subsets, which are not worth summarizing in
the summary tree. This replacement should be made higher up in the tree if
ancestor dimension nodes describe the exact same set as this detail index
node; the intermediate dimension and summary nodes do not subdivide the
set.
The summary tree 28 of FIG. 1 consists of three types of nodes: summary
nodes 82, dimension nodes 84, and detail index nodes 86. Summary nodes 82
contain summary information, while dimension nodes 84 and detail index
nodes 86 simply provide structure to the tree.
As shown in FIG. 6, a summary node contains summary information for the set
of records it represents. The summary node contains a count of records in
the set 90, and summary information for each summary field 92, 94. This
summary information would be the sum 96, the sum of squares 98, the
minimum 100 and the maximum 102 of the set of values for the summary
field.
The summary tree 28, as shown in FIG. 4, is structured so that the first
level of the summary tree is a summary node 72, followed by alternating
levels of dimension nodes and summary nodes until a detail index node is
reached. If a summary node 104 represents fewer than a given threshold
number of detail records, or if the summary node represents a set of
records that differ in only one dimension field, the child of a summary
node is a detail index node 106. Otherwise, the children of a summary node
108 are dimension nodes 110, one for each dimension field that has not yet
been specified in the set 112 that the summary node represents. These
dimension nodes 110 represent the same set as the summary node, causing
detail records to be duplicated among these dimension nodes 110. The
children of a dimension node 114 are summary nodes 76, one for each value
of the given dimension in the set represented by the dimension node 114.
These summary nodes 76 represent fewer detail records than the dimension
node 114.
When a summary node 108 has dimension nodes 110 as children, each dimension
node represents the same set of detail records. Each detail record in the
set is a descendent of all the dimension nodes 110, creating duplication
in the tree. If the tree summarizes n dimensions, the first level of
dimension nodes 116 will cause detail records to appear n times. The
second level will cause detail records to appear n(n-1) times. If the tree
were fully developed, the dimension branching would cause each record to
appear in the tree n1 times. The detail index nodes reduce this number by
cutting off the dimension nodes at a certain level.
FIG. 5 illustrates another technique for eliminating redundancy. Subtrees
of the summary tree are shared where possible. In the summary tree, a
summary node representing the same set of records appears in several
places of the tree, depending on the order of dimensions used to access
it. For instance, the summary node 120 representing the set of records
with dimension field SEX having value M and dimension field ZIPCODE having
value 02046 is in a different part of the tree than the summary node
representing records with dimension field ZIPCODE having value 02046 and
dimension field SEX having value M. The corresponding subtree 88 could be
shared among different parts of the tree 122, 124, reducing the
duplication.
In the preferred embodiment of the invention, the summary tree is
represented using tables. This provides a compact representation of the
tree, and provides locality of reference for accessing brother summary
nodes under the same dimension node.
FIG. 7 shows the correspondence between the logical structure of the
summary tree and a tabular representation of the tree. The root summary
node 72 can be represented as a table with a single row The row contains a
count 90 of records in the database, summary information for the summary
fields 92, 94, and dimension pointers 126 to tables representing the child
dimension nodes.
A dimension node 128 and its child summary nodes 136 can be represented
using a two-dimensional table. Each row represents a summary node; each
row contains the dimension value 138 which identifies the summary node,
the count 90 of records in the set, the summary information for the
summary fields, and dimension pointers 126 to tables representing child
dimension nodes or detail index nodes.
A flow chart for the builder program 20 is shown in FIG. 8. The builder
program makes several passes on the input, generating the different parts
of the summary tree database. Step 140 analyzes the input file 16 to
determine the types, sizes and value ranges for the input fields. Step 142
creates the key value tables 24 by reading the input, and maintaining
tables for each dimension field in main memory 12 of unique dimension
values. These tables are sorted, and written out as key value tables 24.
Step 144 creates the key info tables 22, by reading the input file 16, and
storing the non-summary field values into the appropriate key info table.
Step 146 creates the detail table 30 by reading each input record,
translating the dimension values into their corresponding numeric indexes
using the key value tables 24, and storing the dimension values, the
summary fields, and the appropriate non-summary fields into the detail
table 30.
The detail table 30 is sorted to provide some locality of reference, so
that similar entries are located in the same area of secondary storage. In
the preferred embodiment of the invention, this is a multiple-key sort on
all dimension fields, where the dimension fields are ordered by the number
of unique values they have; the dimension field with the least number of
unique values is the primary key, while the dimension field with the most
number of values is the least significant key. This means that detail
records near each other in the detail tables share the most number of
possible dimension values. When reading in from secondary storage all
records that match a certain combination of dimension values, these
related records can be read at one time without performing too many
expensive disk seeks.
Step 148 creates the summary tree 28 itself. This is a recursive process
that creates the tree in a top-down manner and summarizes the sets in a
bottom-up manner. FIG. 9 illustrates this step in more detail. Step 150
selects all records in the detail table 30. Block 152 represents the
recursive process of summarizing a set of records and producing summary
information; it is used recursively at Step 168. Step 154 checks if the
number of records is smaller than the given threshold, or no dimension
fields are left to summarize on; if so, Step 178 builds a detail index
node, using the record numbers of the current set of records. Step 180
reads in the set of records from the detail table, and calculates summary
information from this set.
If the number of records is larger than the threshold and dimension fields
remain, for each remaining dimension field 156 the records are sorted
along that dimension field 160, and the records are divided into subsets
according to different values of that dimension field 162. For each
subset, the current summary tree built so far is examined in Decision Step
166 to see if the same subset has been built using a different order of
dimension fields. If so, a pointer to the appropriate subtree is placed
into this part of the tree in Step 176 to accomplish the sharing described
in FIG. 5. If the subset has not yet been processed, Step 168 is a
recursive call to the current procedure to summarize this subset, and
build a subtree and a summary node in Step 170. The process continues for
all subsets of records in Decision Step 172. The summary information of
these subsets are combined in Step 204 to generate summary information for
the current set of records. Sum values are added together, and minimums
and maximums combined to create the contents of a summary node. This
summary node is returned from the recursive procedure 152 to be placed in
the proper place in the summary tree.
FIG. 10 illustrates the logic of the Diver program 32 and how it uses the
database 22 to present summary information to the End user 36 given a set
of dimension selections 34. The dimension selections 34 specifies
dimension values for a set of dimension fields, selecting a set of detail
records. At Step 182, the root summary node 72 of the summary tree 28 is
read into main memory 12. Step 184 chooses a dimension from the dimension
selections. Step 186 follows the appropriate dimension pointer to its
target node. If this node is a detail index node 188, then this detail
index node contains a superset of the desired set. Step 200 reads the
detail records from the detail table 30, which are then sorted 202
according to their dimension values of the dimension selections 34. The
subset of records that match the dimension selections 34 are selected and
summarized to produce the desired summary information.
If the target node of Step 186 is a not a detail index node, it is a
dimension node. Step 190 finds a child summary node of this dimension node
that has the dimension value that matches the dimension selection. If no
summary node exists 192, no records match 198 the dimension selections,
and the appropriate display is generated 196. If the summary node exists
and more dimension selections remain, Steps 184 through 192 are repeated
until all dimension selections have been used. When all dimension
selections have been used, the summary node contains the summary
information for the desired input set. The appropriate display 196 is
generated to present this summary information and other derived
information, such as averages and standard deviations, to the user.
While the invention has been shown and described with reference to the
preferred embodiment thereof, it will be understood by those skilled in
the art that the above and other changes in form and detail may be made
therein without departing from the spirit and scope of the invention.
* * * * *
|
|
|
|
|
Description  |
|