|
|
|
| United States Patent | 5583762 |
| Link to this page | http://www.wikipatents.com/5583762.html |
| Inventor(s) | Shafer; Keith E. (Columbus, OH) |
| Abstract | A method for generating a grammar for a collection of sample documents or
records which are marked up under the SGML language. Once completed, the
grammar is reduced to provide a document type definition (DTD). In
constructing the initial or large corpus grammar, SGML tags are extracted
and missing tags are accounted for. The tags then are matched to develop a
tag structure from which a corpus grammar is built. Utilizing a sequence
of reduction procedures, the corpus grammar is reduced. |
|
|
|
Title Information  |
|
|
|
|
|
Drawing from US Patent 5583762 |
|
|
Generation and reduction of an SGML defined grammer |
|
|
|
|
|
| Publication Date |
December 10, 1996 |
|
|
|
|
|
| Filing Date |
August 22, 1994 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Title Information  |
|
|
References  |
|
|
| *references marked with an asterisk below are user-added references |
|
U.S. References |
|
|
|
|
|
|
U.S. References |
|
|
Foreign References |
|
|
|
|
|
|
Foreign References |
|
|
Other References |
|
|
|
|
|
|
Other References |
|
|
|
|
|
References  |
|
|
|
|
|
| Market Size |
|
Estimate the gross annual revenues of the relevant market
sector:
|
| | |
| |
|
|
| Market Share |
|
Estimate the percentage of the relevant market sector this invention will capture:
|
| | |
| |
|
|
| Reasonable Royalty |
|
What percentage of gross sales should the inventor or assignee be paid?
|
| | |
| |
|
|
|
Public's "Guesstimation" of Royalty Value
|
| Market Size | N/A | [No votes] | | x | Market Share | N/A | [No votes] | | x | Reasonable Royalty | N/A | [No votes] |
| | N/A | |
| |
|
|
|
|
|
|
|
|
|
|
|
|
Market Review  |
|
|
Technical Review  |
|
|
Claims  |
|
|
I claim:
1. The method for generating a corpus grammar for a collection of document
records with grammatical structure components identified by start and end
tags, comprising the steps of:
extracting said start and end tags from samples of said records and forming
a tag list therefrom;
matching each start tag of said list with a corresponding end tag with
respect to each of said records to derive matched tags, said matched tags
representing a tag structure of grammar elements paired in a parent-child
defined relationship; and
accumulating said grammar elements represented by said matched tags of said
tag structures as corresponding rules substantially exhibiting
hierarchical tree structures to establish a corpus grammar.
2. The method of claim 1 in which said step for extracting said start and
end tags includes the steps of:
identifying an initial start character of a said tag;
identifying a second character next occurring after said start character,
determining whether said second character is an end tag defining character,
identifying a next occurring end character,
adding a said start tag to said tag list when said second character is not
an end tag character, and
adding a said end tag to said tag list when said second character is an end
tag defining character.
3. The method of claim 2 including the steps of:
identifying a next start character occurring subsequent to said initial
start character,
determining if said next start character occurs leftwardly of said end
character; and
identifying the presence of a stray start character when said next start
character occurs leftwardly of said end character.
4. The method of claim 3 including the step of adding a text identifying
tag to said tag list when said next start character does not occur
leftwardly of said end character.
5. The method of claim 2 including the steps of:
determining whether said second character is an alphabetical character;
then carrying out said step determining whether said second character is an
end tag determining character, and
identifying a next occurring start character as an initial said start
character when said second character is determined not to be a said
alphabetical character and not to be a said end tag defining character.
6. The method of claim 1 in which said step for deriving matched tags
further includes the steps of:
acquiring a given tag from a sequence of tags of said tag list; determining
whether said given tag is a start tag; locating said given tag at the top
of a first-in-first-out stack when it is determined to be a start tag;
acquiring a next tag which is next in sequence to said given tag;
determining whether said next tag is an end tag or a start tag;
determining when said next tag is an end tag, whether said next tag is a
match with said given tag at a said stack in the relationship of end tag
to start tag; and
removing said given tag from said stack in the presence of a said match and
identifying its relation to said next tag.
7. The method of claim 6 including the steps of:
providing a hierarchical root list; and
adding said next tag to said root list when said next tag is determined to
be a start tag and no said given tag is located at the top of said stack.
8. The method of claim 6 including the step of identifying said next tag as
a hierarchical child at said given tag when said given tag is located at
the top of said stack and said next tag is determined to be a start tag.
9. The method of claim 8 including the step of locating said next tag which
is identified as a hierarchical child at the top of said stack.
10. The method of claim 6 including the steps of:
creating an empty end tag when said next tag is determined to be an end tag
which is not a match with said given tag at said stack; and
removing said given tag from said stack and identifying a start tag-end tag
relation between it and said empty end tag.
11. The method of claim 6 including the steps of:
creating an empty end tag when a said next tag is not available for said
step for acquiring a next tag; and
removing said given tag from said stack and identifying a start tag-end tag
relation between it and said empty end tag.
12. The method of claim 6 including the steps of:
creating an end tag which matches said given tag located at the top of said
stack when said next tag is determined to be an end tag which is not a
match with said given tag and a text identifying tag occurs next following
said given tag in said tag list; and
removing said given tag from said stack and identifying a start tag-end tag
between it and said created end tag.
13. The method of claim 6 including the steps of:
providing a hierarchical root list;
adding said next tag to said root list when said next tag is determined to
be a start tag and no said given tag is at the top of said stack;
identifying said next tag as a hierarchical child of said given tag when
said given tag is located at the top of said stack and said next tag is
determined to be a start tag; and
said step of accumulating said grammar elements includes the steps of:
designating the grammar element of a said next tag with in said root list
as a grammar root;
identifying the grammar element of the next occurring start tag of said tag
structure;
identifying the next said hierarchical child of said start tag having said
identified grammar element; and
adding the grammar element of said identified next hierarchical child to an
AND rule for said identified grammar element.
14. The method of claim 13 including the steps of:
creating a master root grammar element when there exists more than one
entry within said root list; and
adding each said grammar element of each said next tag within said root
list to an AND rule for said master root grammar element.
15. The method for reducing the number of rules of a sample grammar present
as an electronic compilation of grammar elements associated as AND and
element rules within a hierarchical document describing tree smacture,
comprising the steps of:
(a) acquiring a said grammar element with a rule, R, of a given type of
said tree structure from said sample grammar,
(b) applying a reduction guide selected recursive reduction procedure to ..
said rule R and subrules thereof of said acquired grammar element;
(c) removing any of said subrules when said applied reduction procedure has
eliminated them from said rule R of said acquired grammar element;
(d) acquiring a next said grammar element and reiterating the aforesaid
steps a-c.
16. The method of claim 15 in which said step for applying a recursive
reduction procedure includes the step of applying a single and empty
reduction effecting the removal of element rules combined as subrules with
said acquired grammar element rule, R.
17. The method of claim 15 in which said step for applying a recursive
reduction procedure includes the step of applying a collapse ANDs
reduction procedure wherein:
a determination is made as to whether the rule R of said acquired grammar
element and the subrules thereof are AND rules and that there are no
repetitions of said AND subrules; and
each subrule of said AND subrule is made to be said AND subrule.
18. The method of claim 17 in which said collapse ANDs reduction procedure
is carried out subsequent to the execution of a single and empty reduction
procedure effecting the removal of element rules combined as subrules with
said acquired grammar element rule, R.
19. The method of claim 15 in which said step for applying a recursive
reduction procedure includes the step of applying a collapse OR's
reduction procedure wherein:
a determination is made as to whether the rule R of said acquired grammar
element and the subrule NS thereof is an OR rule and whether the
repetition characteristic of said OR rule R subsumes the repetition
characteristics of said OR subrule NS; and
in the event that the said rule R and subrules NS are OR rules, and said
subsumption is available, then each subrule of said OR subrule NS is made
said OR subrule.
20. The method of claim 19 in which said collapse OR's reduction procedure
is carried out subsequent to the execution of a collapse AND's reduction
procedure.
21. The method of claim 15 in which said step for applying a recursive
reduction procedure includes the step of applying a repeating reduction
procedure wherein:
a determination is made as to whether the rule, R, of said acquired grammar
element is an AND rule or an OR rule;
when said rule, R, of said acquired grammar element is an AND rule, then
subrule B, the first subrule of R, is acquired with its repetition symbol,
and the subrule NS, the next rule of rule R is acquired with its
repetition symbol;
a determination is made whether the subrule B is the same as the subrule NS
excepting the said repetition symbol of each;
in the event subrule B is the same as subrule NS, said repetition symbol of
subrule B is combined as a peer with said repetition symbol of NS to
derive a resulting repetition symbol which is made the repetition symbol
of subrule B; and
subrule NS is removed.
22. The method of claim 21 wherein the steps set forth therein are iterated
for all subrules of rule R.
23. The method of claim 21 in which:
said rule, R of said acquired grammar element is determined to be an OR
rule;
then subrule C, a subrule of rule R is acquired with its repetition symbol,
and another subrule of rule R, subrule N is acquired with its repetition
symbol;
a determination is made whether the subrule C is the same as the subrule N
excepting the said repetition symbol of each;
in the event subrule C is the same as subrule N, said repetition symbol of
subrule C and said repetition symbol of subrule N are combined as
subordinates to derive a resulting repetition symbol which is made the
repetition symbol of subrule C; and
subrule N is removed.
24. The method of claim 21 in which said step of applying a repeating
reduction procedure is carried out subsequent to the execution of a
collapse OR's reduction procedure, a collapse AND's reduction procedure,
and a single and empty reduction procedure.
25. The method of claim 15 in which said step for applying a recursive
reduction procedure includes the step of applying a redundant reduction
procedure wherein:
a determination is made with respect to whether the rule R of said acquired
gammar element is an OR rule;
where rule R is determined to be an OR rule, then each subrule B thereof,
designated as a base subrule, is compared with each other subrule C
thereof designated as a compare subrule to determine if the subrule B
subsumes the subrule C; and
where subrule B subsumes subrule C, then subrule C is removed from the
subrules of rule R.
26. The method of claim 25 in which said redundant reduction procedure is
the last reduction carried out as part of said step for applying a
reduction procedure.
27. The method of claim 19 or 25 in which said step for applying a
reduction procedure includes the step of applying a PCDATA to OR's
reduction procedure wherein:
a determination is made as to whether the rule R of said acquired grammar
element and a subrule NS thereof at any depth within said tree structure
contain a text identifier tag;
all said subrules NS at any depth within said tree structure which are AND
subrules are changed to OR subrules and any repetition symbol of any said
subrule NS at any depth within said tree structure is changed to one when
said rule R or any said subrule NS contains a said text identifier tag;
then, said step of applying a said collapse OR's reduction is carried out
with respect to said rule R;
then, a repetition symbol of one or more is applied to said rule R; and
then, said step of applying a redundant reduction procedure is carried out
with respect to said rule R.
28. The method of claim 27 including the steps of:
determining whether said rule R is an element rule subsequent to said step
of applying a redundant reduction procedure; and
setting the repetition symbol of rule R to one where rule R is determined
to be an element rule.
29. The method of claim 27 in which said step of applying a PCDATA to OR's
reduction procedure is carried out subsequent to the execution of a
repeating reduction procedure, a collapse OR's reduction procedure, a
collapse AND's reduction procedure, and a single and empty reduction
procedure.
30. The method of claim 15 in which said step for applying a recursive
reduction procedure includes the step of applying an identical bases
reduction procedure which comprises the steps of:
determining whether the rule R of said acquired grammar element is an OR
rule;
when said rule R is determined to be an OR rule, then comparing each
subrule B thereof, designated as a base subrule, with each other subrule
NS thereof to determine if the subrule B has the same subrule bases as the
subrule NS;
where subrule B has the same subrule base as the subrule NS, then combining
the repetition symbols of subrule B and subrule NS as subordinates,
combining the base repetition symbols of subrule B and NS per base as
subordinates, placing all repetition symbols resulting from such
combinations in subrule B, and removing subrule NS.
31. The method of claim 15 in which said step for applying a recursive
reduction procedure includes the step of applying an off-by-one reduction
procedure which comprises the steps of:
determining whether the rule R of said acquired grammar element is an OR
rule;
providing an add rules list;
providing a remove rules list;
when said rule R is determined to be an OR rule, then acquiring each
subrule B thereof which is an AND rule and acquiring each other subrule C
thereof,
comparing each acquired subrule B with each acquired subrule C and
determining from said comparison whether they differ by one subrule M of
subrule B;
identifying said differing one base component when said acquired subrule B
and said acquired subrule C differ by one base component;
generating a new rule formed of acquired subrule B with said differing one
subrule M having a zero-based repetition symbol;
adding said new rule to said add rules list;
adding said acquired said rule B and said acquired subrule C to said remove
rules list;
when all said subrules B have been compared with all said subrules C,
removing said subrules of said remove rules list and adding said rules of
said add rules list as subrules of said rule R;
then emptying said remove rules list and said add rules list; and
repeating the above steps until said step for adding said new rule to said
add rules list does not occur.
32. The method of claim 31 in which said step of comparing each acquired
subrule B with each acquired subrule C includes the steps of:
determining whether subrule M of subrule B subsumes said subrule C when
said subrule C is not an AND rule;
when a said subrule M does not subsume said subrule C, then determining
whether said subrule M has a zero-based repetition symbol;
when said subrule M does not have a zero-based repetition symbol, then
identifying said subrule M;
then, carrying out said step for generating a new rule.
33. The method of claim 31 in which said step of comparing each acquired
subrule B with each acquired subrule C includes the steps of:
determining whether subrule M of subrule B subsumes subrule H of subrule C;
when said subrule M does subsume said subrule H, then the repetition symbol
of subrule M is saved as SM, said symbol of subrule M is set to the
subordinate combination repetition symbol of the repetition symbols of
subrule B and subrule M, the repetition symbols of subrule B and subrule
M, the repetition symbols of subrule H are saved as SH, and said symbols
of subrule H are set to the subordinate combination repetition symbol of
the repetition symbols of subrules C and H;
then, determining whether subrule M subsumes subrule H, subsequently
restoring the respective saved repetition symbols SM and SH; and
then carrying out said step of generating a new rule.
34. The method of claims 16, 17, 19, or 25 including the steps of:
determining whether rule R has only one subrule NS;
when rule R is determined to have only one subrule NS, then combining the
repetition symbols of rule R and subrule NS as subordinates, placing the
resulting repetition symbol on rule R, and setting the type of rule R to
the type of subrule NS; and
when the type of rule R has been set to the type of subrule NS, removing
subrule NS as a subrule of R and moving all subrules of subrule NS to rule
R.
35. The method of claim 25 or 32 or 33 in which a determination of
subsumption of two rules, said base subrule B and said compare rule C,
includes the steps of:
determining whether said base rule B is an element rule;
carrying out a first subsumption analysis with respect to compare rule C
when said base rule B is an element rule;
determining whether the base rule B is an AND rule;
carrying out a second subsumption analysis when said base rule B is an AND
rule and said compare rule C is an element rule;
carrying out a third subsumption analysis when said base rule B is an AND
and rule C and said compare rule is an AND rule;
carrying out a fourth subsumption analysis when said base rule B is an AND
rule and said compare rule C is an OR rule;
determining whether base rule B is an OR rule;
carrying out a fifth subsumption analysis when said base rule B is an OR
rule and said compare rule C is an element rule;
carrying out a fifth subsumption analysis when said base rule B is an OR
rule and said compare rule C is an AND rule; and
carrying out said sixth subsumption analysis when said base rule B is an OR
rule and said compare mile C is an OR rule.
36. The method of claim 35 in which said first subsumption analysis
includes the steps of:
determining whether compare rule C is an element rule;
when compare rule C is an element rule, determining whether base rule B and
compare rule C exhibit a common element; and
when compare rule C and base rule B exhibit a common element, and when said
repetition symbols of base rule B subsumes said repetition symbol of
compare rule C, then indicating the presence of said subsumption.
37. The method of claim 35 in which said second subsumption analysis
includes the steps of:
determining whether each subrule NS of base rule B is an element rule and
whether a said subrule NS and said compare rule C exhibit a common
element;
incrementing a comparison match count value NUM when a said subrule NS and
said compare rule C exhibit a common element;
indicating the presence of a valid subsuming comparison CMP when the said
repetition symbol of a said subrule NS is zero based and said comparison
match value NUM is not incremented;
incrementing a zero-based count value ZEROS when the said repetition symbol
of a said subrule NS is zero-based and said comparison match count value
NUM has been incremented;
incrementing a more-based count value, MORES when the said repetition
symbol of a said subrule NS is more-based and said comparison match count
value NUM has been incremented; and
then redetermining the presence of a valid comparison CMP when the said
comparison match value NUM less the count value, ZEROS is less than or
equal to one.
38. The method of claim 37 including the steps of:
confirming the redetermined presence of said valid comparison CMP when the
repetition symbol of compare rule C is zero or one and the match count
value NUM equals the zero-based count value, ZEROS or when the repetition
symbol of base rule B is zero-based;
confirming the redetermined presence of said valid comparison CMP when the
repetition symbol of compare rule C is one or more and the more-based
count value MORES is at least one, or when the repetition symbol of base
rule B is more-based; and
confirming the redetermined presence of said valid comparison CMP when the
repetition symbol of compare rule C is zero or more and the said
comparison match count value NUM equals the zero-based count value ZERO or
the repetition symbol of the base rule B is zero based, and the said
more-based count value MORES is at least one or the repetition symbol of
base rule B is more-based.
39. The method of claim 35 in which said third subsumption analysis
includes the steps of:
initially indicating the presence of a subsuming comparison CMP when the
number of subrules of said base rule B is greater than the number of
subrules of said compare rule C;
setting an indexible pointer J to sequentially identify the subrules M of
base rule B;
setting an indexible pointer K to sequentially identify the subrules H of
compare rule C;
carrying out a test upon each incrementation of pointers J and K for the
said presence of said comparison CMP and the condition that the value of
pointer J is less than or equal to the number of subrules M of base rule
B, and that the value of pointer K is less than or equal to the number of
subrules H of compare rule C;
when said test fails for the reason that the value of pointer K is not less
than or equal to the number of subrules H of compare rule C, then
indicating the presence of a valid comparison CMP when all remaining
subrules M of base rule B are zero-based;
when said test is true, carrying out a said first, second, third, fourth,
fifth, or sixth subsumption analysis for said subrule M as a base rule and
said subrule H as a compare rule to determine the presence of a subrule
comparison CMP;
in the presence of a said subrule comparison CMP, saving the repetition
symbols SM of subrule M and setting the repetition symbol of subrule M as
a subordinate combination of the repetition symbols of base rule B and
subrule M;
in the presence of said subrule comparison CMP, saving the repetition
symbol SH of subrule H and setting the repetition symbol of subrule H as a
subordinate combination of the repetition symbols of compare rule C and
subrule H;
then, carrying out a said first, second, third, fourth, fifth, or sixth
subsumption analysis for said subrule M as a base rule and said subrule H
as a compare rule;
then, in the presence of a subsuming comparison CMP of said subrule M and
said subrule H, setting the repetition symbol of subrule M to said
repetition symbol SM and the repetition symbol of subrule H to said
repetition symbol SH; and
indicating the presence of a subsuming comparison CMP when the repetition
symbol of subrule M is zero-based, in the presence of a previous failure
of the comparison CMP.
40. The method of claim 35 in which said fourth subsumption analysis
includes the steps of:
determining the presence of the condition wherein any repetition symbol of
said base rule B subsumes the repetition symbol of the compare rule C; and
in the presence of said condition, carrying out a said first, second,
third, fourth, fifth, or sixth subsumption analysis for said base rule B
and each subrule of said compare rule C.
41. The method of claim 35 in which said fifth subsumption analysis
includes the steps of:
saving the repetition symbol of each subrule M of said base rule B as SM,
and setting the repetition symbol of each subrule M to the subordinate
combination of the repetition symbol of base rule B and said subrule M;
carrying out a said first, second, third, fourth, fifth, or sixth
subsumption analysis for a said subrule M as a base rule and said compare
rule C; and
identifying an initial subsuming comparison CMP from said subsumption
analysis as a valid comparison, CMP and setting said repetition symbol SM
as the repetition symbol of corresponding subrule M.
42. The method of claim 35 in which said sixth subsumption analysis
includes the steps of:
saving the repetition symbol of each subrule H of said comparison rule C as
SH, and setting the repetition symbol of each said subrule H to the
subordinate combination of the repetition symbol of compare rule C and
subrule H;
saving the repetition symbol of each subrule M of said base rule B as SM,
and setting the repetition symbol of each said subrule M to a subordinate
combination of the repetition symbol of base rule B and said subrule M;
carrying out a said first, second, third, fourth, fifth or sixth
subsumption analysis for each said subrule M with respect to each said
subrule H until a valid subsuming comparison CMP is determined;
setting said repetition symbol SM as the repetition symbol for the
corresponding subrule M; and
setting said repetition symbol SH as the repetition symbol for the
corresponding subrule H.
43. The method for generating a corpus grammar for a collection of document
records with grammatical structure components identified by start and end
tags, and for reducing the number of rules of said grammar, comprising the
steps of:
(a) extracting said start and end tags from samples of said records and
forming a tag list therefrom;
(b) matching each start tag of said list with a corresponding end tag with
respect to each of said records to derive matched tags, said matched tags
representing a tag structure of grammar elements paired in a parent-child
defined relationship;
(c) accumulating said grammar elements represented by said matched tags of
said tag structures as corresponding AND and element rules substantially
exhibiting hierarchical tree structures to establish a corpus grammar,
(d) acquiring a said grammar element with a rule, R of a given type of said
tree structure from said corpus grammar,
(e) applying a reduction guide selected recursive reduction procedure to
said rule R and subrules thereof of said acquired grammar element;
(f) removing any of said subrules when said applied reduction procedure has
eliminated them from said rule R of said acquired grammar element;
(g) acquiring a next said grammar element and reiterating the aforesaid
steps d-f. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND
Storing document-based information in the memory of a computer installation
requires a predesigned system of document order or control. Developing
such order is a science of software referred to generally as a database
management system (DBMS). DMBSs store information as records carrying
facts as to document content and structure. Retrieving information from a
collection of records, such as a database, differs among system designers.
Document-based databases traditionally have been stored in a hierarchy
that resembles an inverted family tree with relationships of root, or
parent, child and the like. Alternately, a network database, while
resembling a hierarchical model, establishes links between various levels
of one or more tree structures. A popular system typically employed with
spreadsheet forms of programs is a relational database.
In order to develop a document type hierarchically designed database, it is
necessary to compile and simplify information as to document structure.
This information can be called a grammar or document type definition
(DTD). Creating a DTD is non-trivial, substantial effort and corresponding
cost being invested in their development. For example, one major database
service organization developed over 10,000 DTDs representing over 10,000
different types of documents in building its databases. Their creation
often has been compared to writing a computer program. See generally, the
following publications:
(1) Joan Knoerdel, SGML is more than generic coding, EPSIG News, pp 3-4,
December, 1987
(2) SGML: A usage overview. Electronic Documents, 2(10:23-32
(3) Editor's rebuttal. EPSIG News, p. 4, May 1988
(4) Ludo VanVooren, Implementing SGML: Where do you start? <TAG>, The SGML
Newsletter, pp 5-7, February 1990
Several vendors now offer consulting services to write and maintain DTDs.
The hierarchical DTD, in general, is a condensed assemblage of AND, OR, and
element rules functioning to identify document structure elements
including, for example, title, author, and paragraph. These structural
components typically are identified or "marked-up" utilizing a somewhat
complex meta-language referred to as Standard Generalized Mark-up Language
or "SGML". Utilizing SGML methodology, document publishers insert start
and end tags to identify the structural components of the documents they
publish from electronic media. Text material similarly is identified.
Without such tagging, only human intervention and reading of the "raw"
text will find the structural components such as author, title, and the
like. With SGML techniques, a start tag typically is formed of a < symbol
followed by a tag name which, in turn, is followed by a > symbol.
Correspondingly, the end tag has the same structure with the addition of a
slash following the < symbol. An example of a simple SGML mark-up may be
shown as follows:
______________________________________
<record>
<name>Keith Shafer</name>
<title>Research Scientist</title>
<mailcode>MC 410</mailcode>
<ext>x5049</ext>
</record>
______________________________________
The corresponding grammar for the above tagged material is as follows:
______________________________________
<RECORD> ::= (<NAME><TITLE>
<MAILCODE><EXT>);
<NAME> ::= <#PCDATA>;
<TITLE> ::= <#PCDATA>;
<MAILCODE> ::= <#PCDATA>;
<EXT> ::= <#PCDATA>;
______________________________________
Note that adjacent to the identification of record there are the ANDed
grammar elements, name, title, mailcode, and extension. Components
containing text are tagged, for example, with the notation: "PCDATA". A
large grammar or a compilation, described heroin as a "corpus grammar" may
have thousands of rules and hundreds of grammar elements. Even though the
corpus grammar is so extensive, it is still of value to those needing to
develop a reduced grammar or DTD.
The SGML standard is available as:
(5) Information processing--text and office systems--standard generalized
markup language (SGML). International Organization for Standardization.
Ref. No. ISO 8879-1986., September 1969.
Several overviews of SGML, SGML resources and related textual needs are
available. See, for instance:
(6) Steven J. DeRose, David G. Durand, Elli Mylonas, and Allen H. Renear.
What is text, really? Journal of Computing in Higher Education, 1(2):3-26,
1990.
(7) Michael Farrell. Text markup, SGML, and text databases. EPSIG News,
5(4):19-20, 1993.
(8) Erik Naggum. Sgml Faq.0.0. ftp://ftp.ifi.uio.no/pub/ SGML/FAQ.0.0,
January 1992
(9) Erik Naggum. SGML general information. ftp://ftp.ifi. uio.
no/pub/SMGL/general-info, January 1992
(10) Haviland Wright. SGML frees information. Byte, pp 279-287, June 1992
Several books have been written about SGML, a more popular one of which is:
(11) Eric van Herwijnen. Practical SGML. Kluwer Academic Publishers,
Boston/Dordrecht/London, second edition, 1994
The reader's attention additionally is directed to a bibliography of SGML
papers, books, products and the like, which also include
abstracts/opinions and many of the listed items.
This bibliography is identified as:
(12) Robin Cover. Standard Generalized Markup Language ISO 8879:1986 (SGML)
annotated bibliography and list of resources.
ftp://ftp.ifi/uio.no/pub/SGML/bibliography, January, 1992
A DTD of the corpus grammar shown above may be provided as follows:
______________________________________
<!DOCTYPE RECORD[
<!ELEMENT RECORD
(NAME, TITLE,
MAILCODE, EXT)>
<!ELEMENT NAME
#PCDATA>
<!ELEMENT TITLE
#PCDATA>
<!ELEMENT MAILCODE
#PCDATA>
<!ELEMENT EXT
#PCDATA>
<!ENTITY #DEFAULT "
*** UNDEFINED ENTITY
REFERENCE***">
______________________________________
From the foregoing, substantial relief in terms of labor requirements and
costs can be foreseen in the electronics information industry with the
development of a technique for automatically generating a corpus grammar
and then additionally for reducing the extent of the corpus grammar or an
overly extensive DTD to a grammar of practical size.
SUMMARY
The present invention is addressed to a method for generating a grammar for
a collection of sample document records and to a process for reducing the
number of rules of such grammars with substantial improvement in
efficiency, cost, and speed. In generating a corpus or sample grammar, the
start and end tags, for example, SGML tags, are extracted from samples of
the document records to create what is termed herein as a tag list. A
matching procedure then is carried out wherein each start tag of the list
is matched with a corresponding end tag, such matching serving to identify
a tag structure of grammar elements with a parent-child pairing. A unique
utilization of a computer stack is employed to derive the hierarchical
parent-child relationships of the resultant tag structure. The grammar
elements represented by the matched tags of the tag structure then are
accumulated as corresponding rules exhibiting hierarchical tree structures
representing a sample grammar.
While development of the sample or corpus grammar is a valuable aspect of
the invention, further procedures may be employed to reduce or condense
its size. This reduction acquires and operates discretely upon each
grammar element of the hierarchical tree structure. Then, a determination
is made as to whether each such acquired grammar element is combined with
a rule of a given form of tree structure. Where that is the case, then any
of a selected sequence of reduction procedures from a reduction guide
assemblage of such procedures is applied, each of the sequence of
procedures being designed to aid in the carrying out of a next subsequent
procedure. In general, these procedures remove any subrule of the rule
established with the grammar element where such subrule is not required.
The identification and sequence of these reduction procedures are, for
example, as follows: single and empty reduction; collapse ANDs, collapse
ORs; a repeating reduction; PCDATA to ORs reduction; identical bases
reduction, off-by-one reduction; other reduction; and redundant reduction.
Another feature of the invention is to provide a method for generating a
corpus grammar for a collection of document records with grammatical
structure components identified by start and end tags, which comprises the
steps of:
extracting the start and end tags from samples of the records and forming a
tag list therefrom;
matching each start tag of the list with a corresponding end tag with
respect to each of the records to derive matched tags. These matched tags
represent a tag structure of grammar elements paired in a parent-child
defined relationship; and
accumulating the grammar elements represented by the matched tags of the
tag structures as corresponding rules substantially exhibiting
hierarchical tree structures to establish a corpus or sample grammar.
Another feature of the invention is to provide a method for reducing the
number of rules of a sample grammar present as an electronic compilation
of grammar elements associated as AND and element rules within a
hierarchical document describing tree structure which comprises the steps
of:
(a) acquiring a grammar element with a rule, R, of a given type of the tree
structure from the sample grammar;
(b) applying a reduction guide selected reduction procedure to the rule R
and subrules thereof of the acquired grammar,
(c) removing any subrule when the applied reduction procedure has
eliminated them from the rule R of the acquired grammar element; and
(d) acquiring a next grammar element and reiterating the aforesaid steps
a-c.
The invention further features a method combining the above two procedures
of generating a grammar and then reducing it to derive a DTD of efficient
and practical size.
The invention, accordingly, comprises the method possessing the steps which
are exemplified in the following detailed disclosure.
For a fuller understanding of the nature and objects of the invention,
reference should be had to the following detailed description taken in
connection with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram showing a method for developing a tag list,
forming a tag structure, forming a corpus grammar and reducing the grammar
according to the invention;
FIG. 2 is a block diagrammatic representation of a method for generating
SGML tagged information which is employed with the process of the
invention;
FIG. 3 is a flow chart showing the method according to the invention for
extracting tags from SGML tagged information;
FIG. 4 is a flow chart showing the method for matching tags from a tag list
in accordance with the method of the invention;
FIG. 5 is a flow chart showing a technique for building a grammar from a
tag structure in accordance with the invention;
FIG. 6 is a flow chart showing a general technique for reducing a grammar
in accordance with the invention;
FIG. 7 is a general flow chart showing the method for applying a reduction
guide to a grammar element's rules as described generally in FIG. 6;
FIG. 8 is a flow chart showing the method for carrying out a single and
empty reduction as described generally in FIG. 7;
FIG. 9 is a flow chart showing the reduction method for collapsing AND
rules as described generally in FIG. 7;
FIG. 10 is a flow chart describing the method for collapsing OR rules as
referred to generally in FIG. 7;
FIG. 11 is a flow chart showing the method for carrying out a repeating
reduction as described generally in connection with FIG. 7;
FIG. 12 is a flow chart showing the method for carrying out a PCDATA to OR
rules reduction as described generally in connection with FIG. 7;
FIG. 13 is a flow chart showing the method for carrying out an identical
basis reduction as described generally in FIG. 7;
FIG. 14 is a flow chart showing the method for carrying out an off-by-one
reduction as described generally in connection with FIG. 7;
FIG. 15 is a flow chart of a routine for developing a new rule which is
called from the flow chart of FIG. 14;
FIG. 16 is a flow chart for a routine for developing a new rule which is
called from the flow chart of FIG. 14;
FIG. 17 is a flow chart for a redundant reduction routine described in
connection with FIG. 7;
FIG. 18 is a flow chart for a reduction routine which is called in
connection with the flow charts of FIGS. 8-11, 13, 14, and 17;
FIG. 19 is a flow chart showing a comparison procedure which is called from
the flow charts of FIGS. 15-17;
FIG. 20 is a flow chart of a routine which is called from the flow chart of
FIG. 19;
FIG. 21 is a flow chart of a routine which is called from the flow chart of
FIG. 19;
FIG. 22 is a flow chart of a routine which is called from the flow chart of
FIG. 19;
FIG. 23 is a flow chart of a routine which is called from the flow chart of
FIG. 19;
FIG. 24 is a flow chart of a routine which is called from the flow chart of
FIG. 19; and
FIG. 25 is a flow chart of a routine which is called from the flow chart of
FIG. 19.
DETAILED DESCRIPTION
Referring to FIG. 1, a general overview of the process of the invention is
illustrated in block diagrammatic form. In general, the grammar builder
will receive document or text material in electronic form as represented
at block 100. For the most part, this electronic information will have
been tagged by procedures which may have been essentially carried out by
hand, but usually are developed by a software-based translation system.
This tagging process is represented at block 200 and, preferably, the
result of the tagging process as identified at block 300 is an SGML tagged
body of information which, for the purpose of the invention, will have a
sampling size. That sampling size may vary depending upon the diversity of
documents encountered in any record collection. As is apparent, for most
databases, the number of records involved will be one of essentially
continuously increasing number. Thus, the sample should be of adequate
extent so as to remain valid for the anticipated record collection. At
this juncture, the electronic information is tagged but there is no
grammar or DTD describing the documents involved. Next, as represented at
block 400, the documents are opened electronically and the tags are
extracted into a list That tag list is represented at block 500.
As represented at block 600, the tag list then is examined to assure that
the sequence of tags is adequate to define a structure. For example, an
error may occur where a start tag is | | |