|
Claims  |
|
|
What is claimed is:
1. In a computer system comprising a CPU, input/output means and memory
containing a file system, said file system comprising at least one source
file comprising text, means for generating a database comprising at least
one database component file derived from the source file, said database
providing the identification of text in the source file, said apparatus
comprising:
collector means for generating a database component file derived from the
source file, said database component file comprising a symbol
identification listing comprising symbol names and line numbers in the
source file at which each of the symbols occurs, and a line identification
listing comprising each line number of the source file and a corresponding
hash code and line length computed according to the contents of each line
of text in the source file;
index file generation means for generating an index file for at least one
database component file, said index file comprising a listing of each
symbol and the name of the database component file in which the symbol
occurs;
browser means for searching database component files in response to a query
specifying at least one symbol, said means comprising:
means for reviewing the index file to determine which database component
files the symbol occurs in;
means for reviewing the database component files to determine the line
numbers in the source file(s) at which the symbol occurs, said database
component files reviewed being determined by the means for reviewing the
index file to have at least one occurrence of the symbol;
means for retrieving the line of text in the source file specified by the
means for reviewing the database component files as having an occurrence
of the symbol;
means for generating a hash value and line length according to the contents
of the line of text retrieved;
means for comparing the generated hash value and line length with the hash
value and line length for the line as specified in the database file;
if the generated hash value and line length is equal to the hash value and
line length specified in the database file, means for providing the line
of text from the source file as a result of the query;
if the generated hash value and line length does not equal the hash value
and line length specified in the database file, means for providing a
message that the line code specified in the database is not the same line
of code in the text file;
whereby the response to the query comprise the lines of text identified as
responsive to the query or error messages generated because the source
file and database component file do not match.
2. The apparatus according to claim 1, wherein if the generated hash value
and line length do not equal the hash value and line length specified in
the database component file, said apparatus further comprising:
means for retrieving from the database component file the hash codes and
the line lengths for the lines of text above and below the line specified
as having an occurrence of the symbol;
means for generating line lengths and hash code values for a series of
sequential lines retrieved from the source file;
means for comparing the generated line lengths and hash values with the
line lengths and hash values retrieved from the database file for the same
number of lines comprising the line specified in the database component
file and a predetermined number of lines above and below the line
specified;
if the generated line lengths and hash values are equal to the line lengths
and hash values retrieved from the database component file, means for
providing the line of text from the source file as a result of the query;
and
if the generated line lengths and hash values do not equal the line lengths
and hash values retrieved from the database component file, means for
controlling the means for generating line lengths and hash code values and
means for comparing the line lengths and hash code values to generate the
line lengths and hash code values for other sequential lines of sequential
text until a match is found whereby the generated line lengths and hash
codes and respective line lengths and hash codes retrieved from the
database component file are equal and the lines of text from the source
file are returned to the user as a result of the query, or until line
lengths and hash codes for all sequences of lines of text are generated
and compared to the line lengths and hash code values retrieved from the
database component file whereby the search fails and an error message that
no match has been found is provided.
3. The apparatus according to claim 2 wherein the length of the sequence of
lines of test from the source file is three and the first sequence being
the line above the line specified as having an occurrence of the symbol,
the line specified and the line below the line specified.
4. The apparatus according to claim 1 wherein the means for generating the
hash value generates the hash value as the sum of the byte values in the
line.
5. The apparatus according to claim 1 wherein in response to a query said
browser means reviews index files located in the current working directory
and index files located in the directories identified in a first listing
provided to the browser means.
6. The apparatus according to claim 5, wherein said first listing is
located in a predetermined file accessed by the browser means.
7. The apparatus according to claim 1 wherein said source file comprises
statements which function to include other source files and said collector
means generates database component files for the included source files.
8. The apparatus according to claim 7 wherein said included source files
are located in directories separate from the directory where the source
file is located.
9. The apparatus according to claim 8 wherein the database component files
corresponding to the included source files are located in the same
directory as the database component file corresponding to the source file.
10. The apparatus according to claim 1 wherein database component files may
be exported to designated directories of the file system by providing the
collector means a second listing of source files and respective
directories where the corresponding database component files are to be
located.
11. The apparatus according to claim 10, wherein said second listing is
located in a predetermined file accessed by the collector means.
12. The apparatus according to claim 1, wherein the database component file
for a source file is located in a sub-directory of the directory where the
source file is located and is identified by a predetermined sub-directory
name.
13. The apparatus according to claim 12, wherein said sub-directory
comprising database component files is named,
[current file directory]/.sb
where [current file directory] is the path name for the directory
containing the source file and .sb is the sub-directory name identifier.
14. The apparatus according to claim 1, wherein said hash value is computed
to be the sum of hash values generated from selected key pieces of
information in the database component file.
15. The apparatus according to claim 1, wherein said database component
file comprises:
database component file information section comprising a source type ID,
version of the database component file, line indicator status, case
indicator, and name of language the source file is created in;
source name section comprising the name of the source file;
referenced file section comprising the names of the referenced files;
symbol table section comprising a symbol tables of symbols contained in the
source file and the line number(s) of the source file where each symbol is
located;
semantic table section comprising a record type ID, line number and
semantic tag for each line of text; and
line identification section comprising the line number, line length, hash
value and inactive indicator of each line of text in the source file.
16. In a computer system comprising a CPU, input/output means and memory
containing a file system, said file system comprising at least one source
file comprising test, a process for generating a database comprising at
least one database component file derived from the source file, said
database providing the identification of text in the source file, said
process comprising the steps of:
generating a database component file derived from the source file, said
database component file comprising a symbol identification listing
comprising symbol names and line numbers in the source file at which each
of the symbols occurs, and a line identification listing comprising each
line number of the source file and a corresponding hash code and line
length computed according to the contents of each line of text in the
source file;
generating an index file for at least one database component file, said
index file comprising a listing of each symbol and the name of the
database component file in which the symbol occurs;
searching database component files in response to a query specifying at
least one symbol comprising the steps of;
reviewing the index file to determine which database component files the
symbol occurs in;
reviewing the database component files to determine the line numbers in the
source file(s) at which the symbol occurs, said database component files
reviewed being determined by reviewing the index file for at least one
occurrence of the symbol;
retrieving the line of text in the source file specified as having an
occurrence of the symbol;
generating a hash value and line length according to the contents of the
line of text retrieved;
comparing the generated hash value and line length with the hash value and
line length for the line as specified in the database file;
if the generated hash value and line length is equal to the hash value and
line length specified in the database file, providing the line of text
from the source file as a result of the query;
if the generated hash value and line length does not equal the hash value
and line length specified in the database file, providing a message that
the line code specified in the database is not the same line of code in
the text file;
whereby the response to the query comprise the lines of test identified as
responsive to the query or error messages generated because the source
file and database component file do not match.
17. The apparatus according to claim 16, wherein if the generated hash
value and line length do not equal the hash value and line length
specified in the database component file, said process further comprising
the steps of:
retrieving from the database component file the hash codes and line lengths
for the lines of text above and below the line specified as having an
occurrence of the symbol;
generating line lengths and hash code values for a series of sequential
lines retrieved from the source file;
comparing the generated line lengths and hash values with the line lengths
and hash values retrieved from the database file for the same number of
lines comprising the line specified in the database component file and a
predetermined number of lines above and below the line specified;
if the generated line lengths and hash values are equal to the line lengths
and hash values retrieved from the database component file, providing the
line of text from the source file as a result of the query; and
if the generated line lengths and hash values do not equal the line lengths
and hash values retrieved from the database component file, generating
line lengths and hash code values for other sequential lines of sequential
text until a match is found with the line lengths and hash values
retrieved from the database component file whereby the generated line
lengths and hash codes and respective line lengths and hash codes
retrieved from the database component file are equal and the lines of text
from the source file are returned to the user as a result of the query, or
until line lengths and hash codes for all sequences of lines of text are
generated and compared to the line lengths and hash code values retrieved
from the database component file whereby the search fails and an error
message that no match has been found is provided. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The method and apparatus of the present invention relates to the
organization of databases. More specifically, the method and apparatus of
the present invention relates to the organization and identification of
database files derived from textual source files which form the database
and the information contained within the database files for optimum
retrieval and storage efficiency of textual files.
2. Related Applications
This application is related to U.S. patent application Ser. No. 07/499,439,
filed Mar. 27, 1990, entitled "Method and Apparatus for the naming of
Database Component Files to Avoid Duplication of Files", U.S. patent
application Ser. No. 07/500,138, filed Mar. 27, 1990, entitled "User
Extensible, Language Sensitive Database System" and U.S. patent
application Ser. No. 07/500,149, filed Mar. 27, 1990, entitled "Locking
Mechanism for the Prevention of Race Conditions" which are herein
incorporated by reference.
3. Art Background
A database is a collection of information which is organized and stored in
a predetermined manner for subsequent search and retrieval. Typically, the
data is organized in such a manner that the data is indexed according to
certain parameters and can be retrieved according to those parameters.
Data contained in databases vary according to the applications. For
example, a database may contain information to index words in a text file
such that words or strings of words in the text file may be retrieved
quickly.
The data contained in the database may be organized in a single file or
multiplicity of files for access and retrieval. Sometimes the potential
for duplications of files occurs because of the nature of the source
information from duplications of files occurs because of the nature of the
source information from which the database is derived. Thus, if the source
information contains duplicate information the database may similarly
contain duplicate information. One application where this occurs is in the
environment of computer program compilers and processes which assist in
the indexing and retrieval of source file information in text form
according to certain compiler information generated during the process of
compilation of the source file.
For example, software developers frequently need to review specific lines
or sections of a source code program in textual format that contains a
certain variable or symbol (hereinafter referred to collectively as
"symbols") in order to determine where in the program the symbol occurs
and how the value of the symbol changes throughout the execution of the
program. One method to provide this capability of search and retrieval is
to form a database which contains an index of all the symbols in the
source program and the corresponding line numbers in the source files
where these symbols appear. However, a source program may be quite large
and span not one but a multiplicity of separate files, whereby the files
are combined during the compilation process by linking or include
statements (such as the "#include" statement in the C programming
language) located in the main program. Thus, those files which are
frequently used will be included in the database multiple times even
though the information contained therein is the same.
There is also a need to insure that the database component files which
comprise the database match the current version of the source files from
which the database component file is derived. The user may inadvertently
modify the textual source files from which the database is derived without
updating the database component file. Thus, the database may provide
incorrect information for the retrieval of text from the source file.
In a multitasking environment, multiple processes or devices may access or
attempt to access files simultaneously. A race condition occurs when one
process or device attempts to read or write information to a file while
another file is attempting to write information to the same file. This
results in corrupt data being written into the file and/or corrupt data
being read out of the file.
SUMMARY OF THE INVENTION
It is therefore an object of the present invention to provide a means for
minimizing the duplication of files within a database.
It is an object of the present invention to provide a means for searching
for database files and providing the means in certain instances for
providing the corresponding portions of the source file when the integrity
between the database file and source file is lost.
It is an object of the present invention to provide a means for checking
the integrity of the database with the current version of the source file.
It is an object of the present invention to provide a means for preventing
errors which arise due to race conditions which occur in a multitasking
system.
In the method and apparatus of the present invention a database component
file to be added to the database is given a unique name that is dependent
upon the contents of the file such that, when the contents of the source
file changes, the name of the corresponding database component file to be
added to the database also changes. Conversely, if two database component
files have identical information contained therein, the same file name
will be generated and the duplication of information in the database is
prevented by providing a simple test that checks for the existence of the
name of the database component file before the generation and addition of
the file to the database. If the file name exists in the database, the
information is already contained in the database and the file is not
generated and added to the database information.
Preferably the name of the file is generated by computing a hash value from
the sum of the contents of the file and concatenating the hash value to
the name of the file. Because the source file name is used in conjunction
with the hash value to construct the database component file name, the
hash value does not have to be unique for all files but only for those
source files having the same name. Therefore, the likelihood of conflicts
is minimal. In addition, through the selection of heuristic methods for
computing the hash value, a high degree of confidence can be maintained
that the file names are unique. Furthermore, because the database
component file names are unique for each source file, the process of
searching for the correct file is simplified and there is no need to
specify the locations of database component files, e.g. the directory
where the database component file is located, because the file name is
unique for a particular file contents and a query or search program can
safely assume that any file with the same name was generated from the same
source file.
Each database component file contains information regarding the text
contained in one source file which enables the user to quickly determine
the frequency of occurrence of the specified text and the location of the
specified text in the source file. For each textual word (referred to
herein as a "symbol"), an entry in the database component file is provided
containing symbol information. The symbol information comprises the symbol
name, symbol type and line number in the source file where the symbol is
located. Line identification information is also provided which contains
the line numbers of the source file, the length of the line, (i.e., the
number of characters in the line) and corresponding hash values which are
computed from the contents of the line of text in the source life. Before
a line of text identified in a query is displayed, the line identification
information provides the means to verify that the line of text identified
in the symbol information is the same line of text as the one now
contained in the source file. The has value and line length corresponding
to the line of text (referenced in the database) is compared to a hash
value and line length computed for the text retrieved from the current
source file. If the computed hash value and line length does not match the
hash value and line length contained in the line identification
information, the text does not match the database reference because the
source file has been changed subsequent to the generation of the database.
A locking mechanism is also provided which prevents the errors which arise
when race conditions occur in a multi-tasking system by using temporary
file names and file directories in conjunction with atomic commands.
BRIEF DESCRIPTION OF THE DRAWINGS
The objects, features and advantages of the present invention will be
apparent from the following detailed description in which:
FIG. 1 is a block diagram of an exemplary computer employed in the present
invention.
FIG. 2 is illustrative of database component files generated according to
the present invention.
FIGS. 3a, 3b and 3c illustrate a source file, the database component file
generated therefrom according to a preferred embodiment of the present
invention and the contents of the database file.
FIGS. 4a, 4b and 4c illustrate the structure of a preferred embodiment of
the present invention.
FIGS. 5a and 5b and 5c are flow charts of the process of the preferred
embodiment of the present invention.
FIG. 6 illustrates a symbolic link which connects a common database
component file to one or more directories in the system.
FIG. 7 illustrates the implementation of the split function to increase the
speed of performing queries on large database component files.
FIG. 8 is illustrative of a user interface to the system of the present
invention.
FIG. 9 illustrates race conditions which can occur in a multitasking
environment.
FIG. 10 is a flow chart illustrating the process for creating a database
component file.
FIG. 11 is a flowchart illustrating the process for issuing a query and
building an index file according to a preferred embodiment of the present
invention.
DETAILED DESCRIPTION OF THE INVENTION
Notation and Nomenclature
The detailed descriptions which follow are presented largely in terms of
algorithms and symbolic representations of operations on data bits within
a computer memory. These algorithmic descriptions and representations are
the means used by those skilled in the data processing arts to most
effectively convey the substance of their work to others skilled in the
art.
An algorithm is here, and generally, conceived to be a self-consistent
sequence of steps leading to a desired result. These steps are those
requiring physical manipulations of physical quantities. Usually, though
not necessarily, these quantities take the form of electrical or magnetic
signals capable of being stored, transferred, combined, compared, and
otherwise manipulated. It proves convenient at times, principally for
reasons of common usage, to refer to these signals as bits, values,
elements, symbols, characters, terms, numbers, or the like. It should be
borne in mind, however, that all of these and similar terms are to be
associated with the appropriate physical quantities and are merely
convenient labels applied too these quantities.
Further, the manipulations performed are often referred to in terms, such
as adding or comparing, which are commonly associated with mental
operations performed by a human operator. No such capability of a human
operator is necessary, or desirable in most cases, in any of the
operations described herein which form part of the present invention; the
operations are machine operations. Useful machines for performing the
operations of the present invention include general purpose digital
computers or other similar devices. In all cases there should be borne in
mind the distinction between the method of operations in operating a
computer and the method of computation itself. The present invention
relates to method steps for operating a computer in processing electrical
or other (e.g., mechanical, chemical) physical signals to generate other
desired physical signals.
The present invention also relates to apparatus for performing these
operations. This apparatus may be specially constructed for the required
purposes or it may comprise a general purpose computer as selectively
activated or reconfigured by a computer program stored in the computer.
The algorithms presented herein are not inherently related to a particular
computer or other apparatus. In particular, various general purpose
machines may be used with programs written in accordance with the
teachings herein, or it may prove more convenient to construct more
specialized apparatus to perform the required method steps. The required
structure for a variety of these machines will appear from the description
given below.
General System Configuration
FIG. 1 shows a typical computer-based system for databases according to the
present invention. Shown there is a computer 101 which comprises three
major components. The first of these is the input/output (I/O) circuit 102
which is used to communicate information in appropriately structured form
to and from the other parts of the computer 101. Also shown as a part of
computer 101 is the central processing unit (CPU) 103 and memory 104.
These latter two elements are those typically found in most general
purpose computers and almost all special purpose computers. In fact, the
several elements contained within computer 101 are intended to be
representative of this broad category of data processors. Particular
examples of suitable data processors to fill the role of computer 101
include machines manufactured by Sun Microsystems, Inc., Mountain View,
Calif. Other computers having like capabilities may of course be adapted
in a straightforward manner to perform the functions described below.
Also shown in FIG. 1 is an input device 105, shown in typical embodiment as
a keyboard. It should be understood, however, that the input device may
actually be a card reader, magnetic or paper tape reader, or other
well-known input device (including, of course, another computer). A mass
memory device 106 is coupled to the I/O circuit 102 and provides
additional storage capability for the computer 101. The mass memory may
include other programs and the like and may take the form of a magnetic or
paper tape reader or other well known device. It will be appreciated that
the data retained within mass memory 106, may, in appropriate cases, be
incorporated in standard fashion into computer 101 as part of memory 104.
In addition, a display monitor 107 is illustrated which is used to display
messages or other communications to the user. Such a display monitor may
take the form of any of several well-known varieties of CRT displays. A
cursor control 108 is used to select command modes and edit the input
data, such as, for example, the parameter to query the database, and
provides a more convenient means to input information into the system.
Process Description
The following description of a preferred embodiment of the present
invention will describe the source files as source code files of computer
programs. The means for generating the database files, referred to as the
"collector", is described as a part of a compiler which compiles the
source code into object code files. It will be evident to one skilled in
the art that the present invention may be applied to all types of text
files and is not limited to computer program source files. Furthermore,
the collector function may be combined with elements that perform other
functions, such as the compiler herein described, or the collector may
operate as an independent means.
Referring to FIG. 2, the database employed to illustrate the present
invention is shown. The database comprises at least one database component
file (referred to in FIG. 2 as having the suffix ".bd" which represents
the term "browser data") and an index file which is used to locate
information in the database component files. Each database component file
contains the symbol information and line identification information to
provide the capability of browsing or searching one source file in
response to a query. The symbols in the source text file may comprise
every word in the text file or select text which are identified according
to the symbol type. The symbols may be categorized and identified
according to the type of source file by employing an interface which
specifies the identification of the symbols, such as that described in
co-pending patent application U.S. Ser. No. 07/500,138, filed Mar. 27,
1990, entitled "User Extensible, Language Sensitive Database System".
A database component file is created for each source file and is stored in
the current working file directory. This is illustrated in FIG. 2.
Sub-directory Source 1 contains source files a.c. and b.c. A sub-directory
.sb is created which contains database files a.c.*.bd and b.c.*.bd and
index file Index1. Sub-directory .sb which is a sub-directory of directory
Source2, contains database files e.c.*.bd and f.c.*.bd and index file
Index2 which corresponds to source files e.c and f.c. contained in
directory Source 2. As explained in detail below, the "*" in the database
file name represents a hash value which is incorporated into the file name
to provide a unique file name to correspond to the contents of the source
file.
This is further illustrated by the example of FIGS. 3a and 3b. FIG. 3a
shows a text file which is a simple computer program written in the C
language comprising a "printf" statement and an "include" statement which
incorporates the file "stdio.h" into the program. The database generation
means, referred to as the "collector" and in the present example a part of
the compiler which compiles the computer program, generates the database
files shown in FIG. 3b. Shown in FIG. 3b are the database component files
foo.2rBQsT.bd, which is the database component file representative of the
linked executable file foo.c.luoYuw.bd, which is the database component
representative of the source file "foo.c", and the database component file
stdio.h.OyPdOs.bd for the source file "stdio.h", which was incorporated
into the program foo.c through the include statement.
Each database comopnent file name includes a hash value which, when
combined with the file name of the source file results in a unique file
name. The has value is computed as a function of the contents of the
source file wherein if the contents of the source file changes, the hash
code changes. For example the string "2rBQsT" in the database file name
foo.2rBQsT.bd, the string "luoYuw" in file database name foo.c.luoYuw.bd
and the string "OyPdOs" database file name stdio.h.OyPdOs.bd are the hash
values generated and incorporated into the database file names.
The database component file symbol reference comprises symbol
identification information and line identification information. The symbol
information consists of a symbol triple containing the symbol name, line
number in the source file where the symbol is located, and the symbol
type. The line identification information comprises a list of triples,
each triple identifying relative line numbers within the source file,
length of the line and hash value of the line. The hash value is computed
from the contents of the line of text (e.g. the sum of the bytes in the
line); thus, if the contents of the line are modified or the line is moved
because of the insertion or deletion of text, the hash value will
correspondingly change.
An illustration of the contents of a database component file for the
program of FIG. 3a is illustrated in FIG. 3c. The "symbol table section"
400 contains the name of the symbols and the location of the symbol in the
"semantic table section" 410. The semantic table section 410 contains a
triple for each use of each symbol, identifying the symbol name, the line
number in the source file where the symbol is located and the type of the
symbol. The line identification section 420, contains the line number,
length and hash value triples which correspond to the lines of text in the
source file.
The index file provides the means for querying or searching the database
component files for the occurrence of symbols which are the subject of the
query. In the present example, the index file contains a list of all
symbols used and the names of the database component files each of the
symbols is contained in.
A source comprises one or more text files. These text files may or may not,
depending upon the application, be related to one another. For example, a
source may consist of text representative of a document such as a book or
a magazine article. Separate text files may be created for the different
sections of the document, such as the title, introduction, abstract,
bibliography, as well as the body of the document. If the source is a
computer program, the source may comprise a single file containing all the
codes for subsequent compilation and execution of the program or the
source may be spread among a plurality of files wherein one file contains
the code of the main program and other files contain the source code for
sub-programs which are referenced in the main program in the form of
sub-routine calls or include statements, such as the "# include" statement
utilized in the C-programming language.
As each file containing code is compiled, the information to be
incorporated into the database component file (".bd file") is generated.
Prior to generating the database component file, a unique name is
generated for the database component file to be created. The name of the
database component file is derived from the name of the text file and a
hash value. The hash value is computed as a function off the contents of
the file such that if the contents of the text file changes, the has code
changes thereby distinguishing between the database component files for
different versions of the same text file.
In some instances, the same text files may frequently be incorporated into
a multiplicity of different sources. For example, with respect to computer
program source, the same text files containing declarations which
reference sub-programs may be incorporated into the text source files
containing the code of the maim program. To eliminate the duplication of
the same database component file in such instances, prior to generating a
database component file, the name of the database component file is
generated and compared to a list of currently existing database component
files. If the name of the database component file exists, the database
file is not regenerated and duplicated, because the existing database file
can be used for the source file. By eliminating duplicate database files,
processor overhead and memory used to store the database component file is
saved.
The hash value may be generated by any one of many ways which derive the
hash values from the contents of the database component file. For example,
the hash value used to form the database component file name can be
computed according to the sum of all the bytes contained in the file.
Preferably, the has value is a sum of various key pieces of information to
be contained in the database component file. For example, if the
information to be contained in the database component file is the
information shown in FIG. 3c, the hash value would be generated as
follows: a separate hash value is computed for each of the sections in the
file and the hash value incorporated into the file name is the sum of the
hash values for each of the sections in the file.
To generate the has values for each of the sections in the file, certain
information is selected from the section and summed together. For example,
the magic number (the first 2 or 4 bytes in a UNIX.RTM. file), source type
ID, major and minor version numbers of the file (e.g. version 2,1), line
indicator, case indicator (the case indicator is set if the case of
characters is not significant) and each character in the language name
string are summed to compute a hash value for the section. The hash value
for the source name section is generated from the ASCII value of each
character from the file name and the relative field , if the relative
field is set to a value of one (the relative field indicates whether the
file was identified by a relative path or absolute path). The hash value
for the referenced section is generated from the sum of each hash value
for each referenced file and the ASCII value of each character from the
name of each referenced file. The hash value for the symbol table section
is the sum of the ASCII value of each character from each string in the
symbol table section. The record type ID, line number and semantic tag for
each record in the semantic table section are summed together to generate
the hash value. In addition, the line length and hash value (determined
according to the sum of the bytes for the line) for each line in the line
ID section are summed and a value of one is added for each line that has
its inactive indicator flag set (the inactive indicator is used for
debugging tools) to generate the hash value for the line ID section of the
database component file.
This, the file name incorporating the hash value would be: "[source code
file name][has value].bd.". It is preferred for easier identification that
the suffix ".bd" is used to consistently identify those files which are
database component files.
To save memory space, simplify the file name generation process and to
simplify the query or browse process, it is preferred that the name of the
directory in which the database component file resides is not incorporated
into the file name. This is possible because each database component file
name is unique and relates to specific text file of the source. Therefore,
the query or search program simply searches file directories in the file
system until the unique database component file name which corresponds to
the text file name is found. To minimize the number of file directories to
search for database component files, it is preferred that a means is
provided which contains a listing of all directories in which database
component files are located. The query program will then search for
database component files only in those directories listed. Preferably, by
default, the query program will search only the current working directory.
Expansion of a search beyond the working directory is then indicated by a
specific file recognized by the browser to provide a list of the
directories of the file system to search.
Once the database component file(s) is created, an index file is generated
to provide an index into the database component file. The index file
contains the symbols (e.g. text) by which a query may be performed and a
list of the database component files in which symbol is found.
To query (query may be also referred to as search or browse) a database for
a symbol, the index file is reviewed to determine the database component
files of the database, and thus the corresponding text files of the
source, the symbol is located in. The database component files containing
the symbol are reviewed to retrieve the symbol information contained
therein which indicates the location of the symbol in the source text
files. The results of the query are then returned to the user. The
information returned may be in a variety of formats including in the form
of a listing of source text files and line numbers where the symbol is
located, the lines of text from the file in which the symbol is located or
a plurality of lines from the text file surrounding and including the line
in which the symbol is located.
Continuing with the present example from FIG. 1, if a specific symbol is
the subject of a query and is searched for in the Source1 sub-directory,
the index, Index1, will be reviewed to determine which database component
files the symbol is contained in. If the index file states that the symbol
is found in a.c.*.bd, the database component file is reviewed to retrieve
the symbol information containing the symbol name, line number and symbol
type as well as the line length and hash value. The text source file
corresponding to the database component file, that is a.c, is reviewed and
the line of text at the line number designated is retrieved for the user.
If the database component file and index file are generated and the source
file is subsequently modified, search errors will occur unless the
database component files and index file are also subsequently updated. To
alleviate the effect of an inconsistent database, line identification
information is included in the database component files. The line
identification information contains the line number, line length and a
hash value generated according to the contents of the line. Prior to the
retrieval of lines of text from the source text file, a hash value is
computed according to the text at the referenced line number and the line
length and computed hash value are respectively compared to the line
length and hash value stored in the database component file. If the values
are equal, the proper line of text has been found and is provided to the
user as a result of the query. If one or both values do not match, then
the source file has been changed subsequent to the generation of the
database file. An error message may then be generated telling the user
that the database file needs to be updated. Alternatively, if the line of
text has been moved to a different line in the source text file, it may
still be found by comparing the line length and hash value stored in the
database file to line lengths and generated hash values for other lines
from the source text file to find a match. Preferably, to provide a more
accurate match, the line lengths and generated hash values for the lines
of text above and below the line of text to be retrieved are compared to
the line lengths and hash values, representative of three sequential lines
of text, stored in the database component file. Thus, if the line lengths
and hash codes of the sequence of three lines of text match a sequence of
line | | |