WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Generation and reduction of an SGML defined grammer    
United States Patent5583762   
Link to this pagehttp://www.wikipatents.com/5583762.html
Inventor(s)Shafer; Keith E. (Columbus, OH)
AbstractA method for generating a grammar for a collection of sample documents or records which are marked up under the SGML language. Once completed, the grammar is reduced to provide a document type definition (DTD). In constructing the initial or large corpus grammar, SGML tags are extracted and missing tags are accounted for. The tags then are matched to develop a tag structure from which a corpus grammar is built. Utilizing a sequence of reduction procedures, the corpus grammar is reduced.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5583762
Generation and reduction of an SGML defined grammer - US Patent 5583762 Drawing
Generation and reduction of an SGML defined grammer
Inventor     Shafer; Keith E. (Columbus, OH)
Owner/Assignee     OCLC Online Library Center, Incorporated (Dublin, OH)
Patent assignment
All assignments
Publication Date     December 10, 1996
Application Number     08/295,259
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     August 22, 1994
US Classification     715/532 715/513
Int'l Classification     G06F 017/22 G06F 017/27
Examiner     Kulik; Paul V.
Assistant Examiner    
Attorney/Law Firm     Mueller and Smith, LPA
Address
Parent Case    
Priority Data    
USPTO Field of Search     395/700 395/600 364/419.1 364/419.19
Patent Tags     generation reduction sgml defined grammer
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5506985
Motoyama
715/523
Apr,1996

[0 after 0 votes]
5487165
Tsay
707/102
Jan,1996

[0 after 0 votes]
5293473
Hesse
715/529
Mar,1994

[0 after 0 votes]
5285526
Bennett, III
715/516
Feb,1994

[0 after 0 votes]
5276793
Borgendale
715/513
Jan,1994

[0 after 0 votes]
5140052
Franklin
521/129
Aug,1992

[0 after 0 votes]
5079700
Kozoll

Jan,1992

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


I claim:

1. The method for generating a corpus grammar for a collection of document records with grammatical structure components identified by start and end tags, comprising the steps of:

extracting said start and end tags from samples of said records and forming a tag list therefrom;

matching each start tag of said list with a corresponding end tag with respect to each of said records to derive matched tags, said matched tags representing a tag structure of grammar elements paired in a parent-child defined relationship; and

accumulating said grammar elements represented by said matched tags of said tag structures as corresponding rules substantially exhibiting hierarchical tree structures to establish a corpus grammar.

2. The method of claim 1 in which said step for extracting said start and end tags includes the steps of:

identifying an initial start character of a said tag;

identifying a second character next occurring after said start character,

determining whether said second character is an end tag defining character,

identifying a next occurring end character,

adding a said start tag to said tag list when said second character is not an end tag character, and

adding a said end tag to said tag list when said second character is an end tag defining character.

3. The method of claim 2 including the steps of:

identifying a next start character occurring subsequent to said initial start character,

determining if said next start character occurs leftwardly of said end character; and

identifying the presence of a stray start character when said next start character occurs leftwardly of said end character.

4. The method of claim 3 including the step of adding a text identifying tag to said tag list when said next start character does not occur leftwardly of said end character.

5. The method of claim 2 including the steps of:

determining whether said second character is an alphabetical character;

then carrying out said step determining whether said second character is an end tag determining character, and

identifying a next occurring start character as an initial said start character when said second character is determined not to be a said alphabetical character and not to be a said end tag defining character.

6. The method of claim 1 in which said step for deriving matched tags further includes the steps of:

acquiring a given tag from a sequence of tags of said tag list; determining whether said given tag is a start tag; locating said given tag at the top of a first-in-first-out stack when it is determined to be a start tag;

acquiring a next tag which is next in sequence to said given tag;

determining whether said next tag is an end tag or a start tag;

determining when said next tag is an end tag, whether said next tag is a match with said given tag at a said stack in the relationship of end tag to start tag; and

removing said given tag from said stack in the presence of a said match and identifying its relation to said next tag.

7. The method of claim 6 including the steps of:

providing a hierarchical root list; and

adding said next tag to said root list when said next tag is determined to be a start tag and no said given tag is located at the top of said stack.

8. The method of claim 6 including the step of identifying said next tag as a hierarchical child at said given tag when said given tag is located at the top of said stack and said next tag is determined to be a start tag.

9. The method of claim 8 including the step of locating said next tag which is identified as a hierarchical child at the top of said stack.

10. The method of claim 6 including the steps of:

creating an empty end tag when said next tag is determined to be an end tag which is not a match with said given tag at said stack; and

removing said given tag from said stack and identifying a start tag-end tag relation between it and said empty end tag.

11. The method of claim 6 including the steps of:

creating an empty end tag when a said next tag is not available for said step for acquiring a next tag; and

removing said given tag from said stack and identifying a start tag-end tag relation between it and said empty end tag.

12. The method of claim 6 including the steps of:

creating an end tag which matches said given tag located at the top of said stack when said next tag is determined to be an end tag which is not a match with said given tag and a text identifying tag occurs next following said given tag in said tag list; and

removing said given tag from said stack and identifying a start tag-end tag between it and said created end tag.

13. The method of claim 6 including the steps of:

providing a hierarchical root list;

adding said next tag to said root list when said next tag is determined to be a start tag and no said given tag is at the top of said stack;

identifying said next tag as a hierarchical child of said given tag when said given tag is located at the top of said stack and said next tag is determined to be a start tag; and

said step of accumulating said grammar elements includes the steps of:

designating the grammar element of a said next tag with in said root list as a grammar root;

identifying the grammar element of the next occurring start tag of said tag structure;

identifying the next said hierarchical child of said start tag having said identified grammar element; and

adding the grammar element of said identified next hierarchical child to an AND rule for said identified grammar element.

14. The method of claim 13 including the steps of:

creating a master root grammar element when there exists more than one entry within said root list; and

adding each said grammar element of each said next tag within said root list to an AND rule for said master root grammar element.

15. The method for reducing the number of rules of a sample grammar present as an electronic compilation of grammar elements associated as AND and element rules within a hierarchical document describing tree smacture, comprising the steps of:

(a) acquiring a said grammar element with a rule, R, of a given type of said tree structure from said sample grammar,

(b) applying a reduction guide selected recursive reduction procedure to .. said rule R and subrules thereof of said acquired grammar element;

(c) removing any of said subrules when said applied reduction procedure has eliminated them from said rule R of said acquired grammar element;

(d) acquiring a next said grammar element and reiterating the aforesaid steps a-c.

16. The method of claim 15 in which said step for applying a recursive reduction procedure includes the step of applying a single and empty reduction effecting the removal of element rules combined as subrules with said acquired grammar element rule, R.

17. The method of claim 15 in which said step for applying a recursive reduction procedure includes the step of applying a collapse ANDs reduction procedure wherein:

a determination is made as to whether the rule R of said acquired grammar element and the subrules thereof are AND rules and that there are no repetitions of said AND subrules; and

each subrule of said AND subrule is made to be said AND subrule.

18. The method of claim 17 in which said collapse ANDs reduction procedure is carried out subsequent to the execution of a single and empty reduction procedure effecting the removal of element rules combined as subrules with said acquired grammar element rule, R.

19. The method of claim 15 in which said step for applying a recursive reduction procedure includes the step of applying a collapse OR's reduction procedure wherein:

a determination is made as to whether the rule R of said acquired grammar element and the subrule NS thereof is an OR rule and whether the repetition characteristic of said OR rule R subsumes the repetition characteristics of said OR subrule NS; and

in the event that the said rule R and subrules NS are OR rules, and said subsumption is available, then each subrule of said OR subrule NS is made said OR subrule.

20. The method of claim 19 in which said collapse OR's reduction procedure is carried out subsequent to the execution of a collapse AND's reduction procedure.

21. The method of claim 15 in which said step for applying a recursive reduction procedure includes the step of applying a repeating reduction procedure wherein:

a determination is made as to whether the rule, R, of said acquired grammar element is an AND rule or an OR rule;

when said rule, R, of said acquired grammar element is an AND rule, then subrule B, the first subrule of R, is acquired with its repetition symbol, and the subrule NS, the next rule of rule R is acquired with its repetition symbol;

a determination is made whether the subrule B is the same as the subrule NS excepting the said repetition symbol of each;

in the event subrule B is the same as subrule NS, said repetition symbol of subrule B is combined as a peer with said repetition symbol of NS to derive a resulting repetition symbol which is made the repetition symbol of subrule B; and

subrule NS is removed.

22. The method of claim 21 wherein the steps set forth therein are iterated for all subrules of rule R.

23. The method of claim 21 in which:

said rule, R of said acquired grammar element is determined to be an OR rule;

then subrule C, a subrule of rule R is acquired with its repetition symbol, and another subrule of rule R, subrule N is acquired with its repetition symbol;

a determination is made whether the subrule C is the same as the subrule N excepting the said repetition symbol of each;

in the event subrule C is the same as subrule N, said repetition symbol of subrule C and said repetition symbol of subrule N are combined as subordinates to derive a resulting repetition symbol which is made the repetition symbol of subrule C; and

subrule N is removed.

24. The method of claim 21 in which said step of applying a repeating reduction procedure is carried out subsequent to the execution of a collapse OR's reduction procedure, a collapse AND's reduction procedure, and a single and empty reduction procedure.

25. The method of claim 15 in which said step for applying a recursive reduction procedure includes the step of applying a redundant reduction procedure wherein:

a determination is made with respect to whether the rule R of said acquired gammar element is an OR rule;

where rule R is determined to be an OR rule, then each subrule B thereof, designated as a base subrule, is compared with each other subrule C thereof designated as a compare subrule to determine if the subrule B subsumes the subrule C; and

where subrule B subsumes subrule C, then subrule C is removed from the subrules of rule R.

26. The method of claim 25 in which said redundant reduction procedure is the last reduction carried out as part of said step for applying a reduction procedure.

27. The method of claim 19 or 25 in which said step for applying a reduction procedure includes the step of applying a PCDATA to OR's reduction procedure wherein:

a determination is made as to whether the rule R of said acquired grammar element and a subrule NS thereof at any depth within said tree structure contain a text identifier tag;

all said subrules NS at any depth within said tree structure which are AND subrules are changed to OR subrules and any repetition symbol of any said subrule NS at any depth within said tree structure is changed to one when said rule R or any said subrule NS contains a said text identifier tag;

then, said step of applying a said collapse OR's reduction is carried out with respect to said rule R;

then, a repetition symbol of one or more is applied to said rule R; and then, said step of applying a redundant reduction procedure is carried out with respect to said rule R.

28. The method of claim 27 including the steps of:

determining whether said rule R is an element rule subsequent to said step of applying a redundant reduction procedure; and

setting the repetition symbol of rule R to one where rule R is determined to be an element rule.

29. The method of claim 27 in which said step of applying a PCDATA to OR's reduction procedure is carried out subsequent to the execution of a repeating reduction procedure, a collapse OR's reduction procedure, a collapse AND's reduction procedure, and a single and empty reduction procedure.

30. The method of claim 15 in which said step for applying a recursive reduction procedure includes the step of applying an identical bases reduction procedure which comprises the steps of:

determining whether the rule R of said acquired grammar element is an OR rule;

when said rule R is determined to be an OR rule, then comparing each subrule B thereof, designated as a base subrule, with each other subrule NS thereof to determine if the subrule B has the same subrule bases as the subrule NS;

where subrule B has the same subrule base as the subrule NS, then combining the repetition symbols of subrule B and subrule NS as subordinates, combining the base repetition symbols of subrule B and NS per base as subordinates, placing all repetition symbols resulting from such combinations in subrule B, and removing subrule NS.

31. The method of claim 15 in which said step for applying a recursive reduction procedure includes the step of applying an off-by-one reduction procedure which comprises the steps of:

determining whether the rule R of said acquired grammar element is an OR rule;

providing an add rules list;

providing a remove rules list;

when said rule R is determined to be an OR rule, then acquiring each subrule B thereof which is an AND rule and acquiring each other subrule C thereof,

comparing each acquired subrule B with each acquired subrule C and determining from said comparison whether they differ by one subrule M of subrule B;

identifying said differing one base component when said acquired subrule B and said acquired subrule C differ by one base component;

generating a new rule formed of acquired subrule B with said differing one subrule M having a zero-based repetition symbol;

adding said new rule to said add rules list;

adding said acquired said rule B and said acquired subrule C to said remove rules list;

when all said subrules B have been compared with all said subrules C, removing said subrules of said remove rules list and adding said rules of said add rules list as subrules of said rule R;

then emptying said remove rules list and said add rules list; and

repeating the above steps until said step for adding said new rule to said add rules list does not occur.

32. The method of claim 31 in which said step of comparing each acquired subrule B with each acquired subrule C includes the steps of:

determining whether subrule M of subrule B subsumes said subrule C when said subrule C is not an AND rule;

when a said subrule M does not subsume said subrule C, then determining whether said subrule M has a zero-based repetition symbol;

when said subrule M does not have a zero-based repetition symbol, then identifying said subrule M;

then, carrying out said step for generating a new rule.

33. The method of claim 31 in which said step of comparing each acquired subrule B with each acquired subrule C includes the steps of:

determining whether subrule M of subrule B subsumes subrule H of subrule C;

when said subrule M does subsume said subrule H, then the repetition symbol of subrule M is saved as SM, said symbol of subrule M is set to the subordinate combination repetition symbol of the repetition symbols of subrule B and subrule M, the repetition symbols of subrule B and subrule M, the repetition symbols of subrule H are saved as SH, and said symbols of subrule H are set to the subordinate combination repetition symbol of the repetition symbols of subrules C and H;

then, determining whether subrule M subsumes subrule H, subsequently restoring the respective saved repetition symbols SM and SH; and

then carrying out said step of generating a new rule.

34. The method of claims 16, 17, 19, or 25 including the steps of:

determining whether rule R has only one subrule NS;

when rule R is determined to have only one subrule NS, then combining the repetition symbols of rule R and subrule NS as subordinates, placing the resulting repetition symbol on rule R, and setting the type of rule R to the type of subrule NS; and

when the type of rule R has been set to the type of subrule NS, removing subrule NS as a subrule of R and moving all subrules of subrule NS to rule R.

35. The method of claim 25 or 32 or 33 in which a determination of subsumption of two rules, said base subrule B and said compare rule C, includes the steps of:

determining whether said base rule B is an element rule;

carrying out a first subsumption analysis with respect to compare rule C when said base rule B is an element rule;

determining whether the base rule B is an AND rule;

carrying out a second subsumption analysis when said base rule B is an AND rule and said compare rule C is an element rule;

carrying out a third subsumption analysis when said base rule B is an AND and rule C and said compare rule is an AND rule;

carrying out a fourth subsumption analysis when said base rule B is an AND rule and said compare rule C is an OR rule;

determining whether base rule B is an OR rule;

carrying out a fifth subsumption analysis when said base rule B is an OR rule and said compare rule C is an element rule;

carrying out a fifth subsumption analysis when said base rule B is an OR rule and said compare rule C is an AND rule; and

carrying out said sixth subsumption analysis when said base rule B is an OR rule and said compare mile C is an OR rule.

36. The method of claim 35 in which said first subsumption analysis includes the steps of:

determining whether compare rule C is an element rule;

when compare rule C is an element rule, determining whether base rule B and compare rule C exhibit a common element; and

when compare rule C and base rule B exhibit a common element, and when said repetition symbols of base rule B subsumes said repetition symbol of compare rule C, then indicating the presence of said subsumption.

37. The method of claim 35 in which said second subsumption analysis includes the steps of:

determining whether each subrule NS of base rule B is an element rule and whether a said subrule NS and said compare rule C exhibit a common element;

incrementing a comparison match count value NUM when a said subrule NS and said compare rule C exhibit a common element;

indicating the presence of a valid subsuming comparison CMP when the said repetition symbol of a said subrule NS is zero based and said comparison match value NUM is not incremented;

incrementing a zero-based count value ZEROS when the said repetition symbol of a said subrule NS is zero-based and said comparison match count value NUM has been incremented;

incrementing a more-based count value, MORES when the said repetition symbol of a said subrule NS is more-based and said comparison match count value NUM has been incremented; and

then redetermining the presence of a valid comparison CMP when the said comparison match value NUM less the count value, ZEROS is less than or equal to one.

38. The method of claim 37 including the steps of:

confirming the redetermined presence of said valid comparison CMP when the repetition symbol of compare rule C is zero or one and the match count value NUM equals the zero-based count value, ZEROS or when the repetition symbol of base rule B is zero-based;

confirming the redetermined presence of said valid comparison CMP when the repetition symbol of compare rule C is one or more and the more-based count value MORES is at least one, or when the repetition symbol of base rule B is more-based; and

confirming the redetermined presence of said valid comparison CMP when the repetition symbol of compare rule C is zero or more and the said comparison match count value NUM equals the zero-based count value ZERO or the repetition symbol of the base rule B is zero based, and the said more-based count value MORES is at least one or the repetition symbol of base rule B is more-based.

39. The method of claim 35 in which said third subsumption analysis includes the steps of:

initially indicating the presence of a subsuming comparison CMP when the number of subrules of said base rule B is greater than the number of subrules of said compare rule C;

setting an indexible pointer J to sequentially identify the subrules M of base rule B;

setting an indexible pointer K to sequentially identify the subrules H of compare rule C;

carrying out a test upon each incrementation of pointers J and K for the said presence of said comparison CMP and the condition that the value of pointer J is less than or equal to the number of subrules M of base rule B, and that the value of pointer K is less than or equal to the number of subrules H of compare rule C;

when said test fails for the reason that the value of pointer K is not less than or equal to the number of subrules H of compare rule C, then indicating the presence of a valid comparison CMP when all remaining subrules M of base rule B are zero-based;

when said test is true, carrying out a said first, second, third, fourth, fifth, or sixth subsumption analysis for said subrule M as a base rule and said subrule H as a compare rule to determine the presence of a subrule comparison CMP;

in the presence of a said subrule comparison CMP, saving the repetition symbols SM of subrule M and setting the repetition symbol of subrule M as a subordinate combination of the repetition symbols of base rule B and subrule M;

in the presence of said subrule comparison CMP, saving the repetition symbol SH of subrule H and setting the repetition symbol of subrule H as a subordinate combination of the repetition symbols of compare rule C and subrule H;

then, carrying out a said first, second, third, fourth, fifth, or sixth subsumption analysis for said subrule M as a base rule and said subrule H as a compare rule;

then, in the presence of a subsuming comparison CMP of said subrule M and said subrule H, setting the repetition symbol of subrule M to said repetition symbol SM and the repetition symbol of subrule H to said repetition symbol SH; and

indicating the presence of a subsuming comparison CMP when the repetition symbol of subrule M is zero-based, in the presence of a previous failure of the comparison CMP.

40. The method of claim 35 in which said fourth subsumption analysis includes the steps of:

determining the presence of the condition wherein any repetition symbol of said base rule B subsumes the repetition symbol of the compare rule C; and

in the presence of said condition, carrying out a said first, second, third, fourth, fifth, or sixth subsumption analysis for said base rule B and each subrule of said compare rule C.

41. The method of claim 35 in which said fifth subsumption analysis includes the steps of:

saving the repetition symbol of each subrule M of said base rule B as SM, and setting the repetition symbol of each subrule M to the subordinate combination of the repetition symbol of base rule B and said subrule M;

carrying out a said first, second, third, fourth, fifth, or sixth subsumption analysis for a said subrule M as a base rule and said compare rule C; and

identifying an initial subsuming comparison CMP from said subsumption analysis as a valid comparison, CMP and setting said repetition symbol SM as the repetition symbol of corresponding subrule M.

42. The method of claim 35 in which said sixth subsumption analysis includes the steps of:

saving the repetition symbol of each subrule H of said comparison rule C as SH, and setting the repetition symbol of each said subrule H to the subordinate combination of the repetition symbol of compare rule C and subrule H;

saving the repetition symbol of each subrule M of said base rule B as SM, and setting the repetition symbol of each said subrule M to a subordinate combination of the repetition symbol of base rule B and said subrule M;

carrying out a said first, second, third, fourth, fifth or sixth subsumption analysis for each said subrule M with respect to each said subrule H until a valid subsuming comparison CMP is determined;

setting said repetition symbol SM as the repetition symbol for the corresponding subrule M; and

setting said repetition symbol SH as the repetition symbol for the corresponding subrule H.

43. The method for generating a corpus grammar for a collection of document records with grammatical structure components identified by start and end tags, and for reducing the number of rules of said grammar, comprising the steps of:

(a) extracting said start and end tags from samples of said records and forming a tag list therefrom;

(b) matching each start tag of said list with a corresponding end tag with respect to each of said records to derive matched tags, said matched tags representing a tag structure of grammar elements paired in a parent-child defined relationship;

(c) accumulating said grammar elements represented by said matched tags of said tag structures as corresponding AND and element rules substantially exhibiting hierarchical tree structures to establish a corpus grammar,

(d) acquiring a said grammar element with a rule, R of a given type of said tree structure from said corpus grammar,

(e) applying a reduction guide selected recursive reduction procedure to said rule R and subrules thereof of said acquired grammar element;

(f) removing any of said subrules when said applied reduction procedure has eliminated them from said rule R of said acquired grammar element;

(g) acquiring a next said grammar element and reiterating the aforesaid steps d-f.
 Description Submit all comments and votes
 


BACKGROUND

Storing document-based information in the memory of a computer installation requires a predesigned system of document order or control. Developing such order is a science of software referred to generally as a database management system (DBMS). DMBSs store information as records carrying facts as to document content and structure. Retrieving information from a collection of records, such as a database, differs among system designers. Document-based databases traditionally have been stored in a hierarchy that resembles an inverted family tree with relationships of root, or parent, child and the like. Alternately, a network database, while resembling a hierarchical model, establishes links between various levels of one or more tree structures. A popular system typically employed with spreadsheet forms of programs is a relational database.

In order to develop a document type hierarchically designed database, it is necessary to compile and simplify information as to document structure. This information can be called a grammar or document type definition (DTD). Creating a DTD is non-trivial, substantial effort and corresponding cost being invested in their development. For example, one major database service organization developed over 10,000 DTDs representing over 10,000 different types of documents in building its databases. Their creation often has been compared to writing a computer program. See generally, the following publications:

(1) Joan Knoerdel, SGML is more than generic coding, EPSIG News, pp 3-4, December, 1987

(2) SGML: A usage overview. Electronic Documents, 2(10:23-32

(3) Editor's rebuttal. EPSIG News, p. 4, May 1988

(4) Ludo VanVooren, Implementing SGML: Where do you start? <TAG>, The SGML Newsletter, pp 5-7, February 1990

Several vendors now offer consulting services to write and maintain DTDs.

The hierarchical DTD, in general, is a condensed assemblage of AND, OR, and element rules functioning to identify document structure elements including, for example, title, author, and paragraph. These structural components typically are identified or "marked-up" utilizing a somewhat complex meta-language referred to as Standard Generalized Mark-up Language or "SGML". Utilizing SGML methodology, document publishers insert start and end tags to identify the structural components of the documents they publish from electronic media. Text material similarly is identified. Without such tagging, only human intervention and reading of the "raw" text will find the structural components such as author, title, and the like. With SGML techniques, a start tag typically is formed of a < symbol followed by a tag name which, in turn, is followed by a > symbol. Correspondingly, the end tag has the same structure with the addition of a slash following the < symbol. An example of a simple SGML mark-up may be shown as follows:

______________________________________ <record> <name>Keith Shafer</name> <title>Research Scientist</title> <mailcode>MC 410</mailcode> <ext>x5049</ext> </record> ______________________________________

The corresponding grammar for the above tagged material is as follows:

______________________________________ <RECORD> ::= (<NAME><TITLE> <MAILCODE><EXT>); <NAME> ::= <#PCDATA>; <TITLE> ::= <#PCDATA>; <MAILCODE> ::= <#PCDATA>; <EXT> ::= <#PCDATA>; ______________________________________

Note that adjacent to the identification of record there are the ANDed grammar elements, name, title, mailcode, and extension. Components containing text are tagged, for example, with the notation: "PCDATA". A large grammar or a compilation, described heroin as a "corpus grammar" may have thousands of rules and hundreds of grammar elements. Even though the corpus grammar is so extensive, it is still of value to those needing to develop a reduced grammar or DTD.

The SGML standard is available as:

(5) Information processing--text and office systems--standard generalized markup language (SGML). International Organization for Standardization. Ref. No. ISO 8879-1986., September 1969.

Several overviews of SGML, SGML resources and related textual needs are available. See, for instance:

(6) Steven J. DeRose, David G. Durand, Elli Mylonas, and Allen H. Renear. What is text, really? Journal of Computing in Higher Education, 1(2):3-26, 1990.

(7) Michael Farrell. Text markup, SGML, and text databases. EPSIG News, 5(4):19-20, 1993.

(8) Erik Naggum. Sgml Faq.0.0. ftp://ftp.ifi.uio.no/pub/ SGML/FAQ.0.0, January 1992

(9) Erik Naggum. SGML general information. ftp://ftp.ifi. uio. no/pub/SMGL/general-info, January 1992

(10) Haviland Wright. SGML frees information. Byte, pp 279-287, June 1992

Several books have been written about SGML, a more popular one of which is:

(11) Eric van Herwijnen. Practical SGML. Kluwer Academic Publishers, Boston/Dordrecht/London, second edition, 1994

The reader's attention additionally is directed to a bibliography of SGML papers, books, products and the like, which also include abstracts/opinions and many of the listed items.

This bibliography is identified as:

(12) Robin Cover. Standard Generalized Markup Language ISO 8879:1986 (SGML) annotated bibliography and list of resources. ftp://ftp.ifi/uio.no/pub/SGML/bibliography, January, 1992

A DTD of the corpus grammar shown above may be provided as follows:

______________________________________ <!DOCTYPE RECORD[ <!ELEMENT RECORD (NAME, TITLE, MAILCODE, EXT)> <!ELEMENT NAME #PCDATA> <!ELEMENT TITLE #PCDATA> <!ELEMENT MAILCODE #PCDATA> <!ELEMENT EXT #PCDATA> <!ENTITY #DEFAULT " *** UNDEFINED ENTITY REFERENCE***"> ______________________________________

From the foregoing, substantial relief in terms of labor requirements and costs can be foreseen in the electronics information industry with the development of a technique for automatically generating a corpus grammar and then additionally for reducing the extent of the corpus grammar or an overly extensive DTD to a grammar of practical size.

SUMMARY

The present invention is addressed to a method for generating a grammar for a collection of sample document records and to a process for reducing the number of rules of such grammars with substantial improvement in efficiency, cost, and speed. In generating a corpus or sample grammar, the start and end tags, for example, SGML tags, are extracted from samples of the document records to create what is termed herein as a tag list. A matching procedure then is carried out wherein each start tag of the list is matched with a corresponding end tag, such matching serving to identify a tag structure of grammar elements with a parent-child pairing. A unique utilization of a computer stack is employed to derive the hierarchical parent-child relationships of the resultant tag structure. The grammar elements represented by the matched tags of the tag structure then are accumulated as corresponding rules exhibiting hierarchical tree structures representing a sample grammar.

While development of the sample or corpus grammar is a valuable aspect of the invention, further procedures may be employed to reduce or condense its size. This reduction acquires and operates discretely upon each grammar element of the hierarchical tree structure. Then, a determination is made as to whether each such acquired grammar element is combined with a rule of a given form of tree structure. Where that is the case, then any of a selected sequence of reduction procedures from a reduction guide assemblage of such procedures is applied, each of the sequence of procedures being designed to aid in the carrying out of a next subsequent procedure. In general, these procedures remove any subrule of the rule established with the grammar element where such subrule is not required. The identification and sequence of these reduction procedures are, for example, as follows: single and empty reduction; collapse ANDs, collapse ORs; a repeating reduction; PCDATA to ORs reduction; identical bases reduction, off-by-one reduction; other reduction; and redundant reduction.

Another feature of the invention is to provide a method for generating a corpus grammar for a collection of document records with grammatical structure components identified by start and end tags, which comprises the steps of:

extracting the start and end tags from samples of the records and forming a tag list therefrom;

matching each start tag of the list with a corresponding end tag with respect to each of the records to derive matched tags. These matched tags represent a tag structure of grammar elements paired in a parent-child defined relationship; and

accumulating the grammar elements represented by the matched tags of the tag structures as corresponding rules substantially exhibiting hierarchical tree structures to establish a corpus or sample grammar.

Another feature of the invention is to provide a method for reducing the number of rules of a sample grammar present as an electronic compilation of grammar elements associated as AND and element rules within a hierarchical document describing tree structure which comprises the steps of:

(a) acquiring a grammar element with a rule, R, of a given type of the tree structure from the sample grammar;

(b) applying a reduction guide selected reduction procedure to the rule R and subrules thereof of the acquired grammar,

(c) removing any subrule when the applied reduction procedure has eliminated them from the rule R of the acquired grammar element; and

(d) acquiring a next grammar element and reiterating the aforesaid steps a-c.

The invention further features a method combining the above two procedures of generating a grammar and then reducing it to derive a DTD of efficient and practical size.

The invention, accordingly, comprises the method possessing the steps which are exemplified in the following detailed disclosure.

For a fuller understanding of the nature and objects of the invention, reference should be had to the following detailed description taken in connection with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing a method for developing a tag list, forming a tag structure, forming a corpus grammar and reducing the grammar according to the invention;

FIG. 2 is a block diagrammatic representation of a method for generating SGML tagged information which is employed with the process of the invention;

FIG. 3 is a flow chart showing the method according to the invention for extracting tags from SGML tagged information;

FIG. 4 is a flow chart showing the method for matching tags from a tag list in accordance with the method of the invention;

FIG. 5 is a flow chart showing a technique for building a grammar from a tag structure in accordance with the invention;

FIG. 6 is a flow chart showing a general technique for reducing a grammar in accordance with the invention;

FIG. 7 is a general flow chart showing the method for applying a reduction guide to a grammar element's rules as described generally in FIG. 6;

FIG. 8 is a flow chart showing the method for carrying out a single and empty reduction as described generally in FIG. 7;

FIG. 9 is a flow chart showing the reduction method for collapsing AND rules as described generally in FIG. 7;

FIG. 10 is a flow chart describing the method for collapsing OR rules as referred to generally in FIG. 7;

FIG. 11 is a flow chart showing the method for carrying out a repeating reduction as described generally in connection with FIG. 7;

FIG. 12 is a flow chart showing the method for carrying out a PCDATA to OR rules reduction as described generally in connection with FIG. 7;

FIG. 13 is a flow chart showing the method for carrying out an identical basis reduction as described generally in FIG. 7;

FIG. 14 is a flow chart showing the method for carrying out an off-by-one reduction as described generally in connection with FIG. 7;

FIG. 15 is a flow chart of a routine for developing a new rule which is called from the flow chart of FIG. 14;

FIG. 16 is a flow chart for a routine for developing a new rule which is called from the flow chart of FIG. 14;

FIG. 17 is a flow chart for a redundant reduction routine described in connection with FIG. 7;

FIG. 18 is a flow chart for a reduction routine which is called in connection with the flow charts of FIGS. 8-11, 13, 14, and 17;

FIG. 19 is a flow chart showing a comparison procedure which is called from the flow charts of FIGS. 15-17;

FIG. 20 is a flow chart of a routine which is called from the flow chart of FIG. 19;

FIG. 21 is a flow chart of a routine which is called from the flow chart of FIG. 19;

FIG. 22 is a flow chart of a routine which is called from the flow chart of FIG. 19;

FIG. 23 is a flow chart of a routine which is called from the flow chart of FIG. 19;

FIG. 24 is a flow chart of a routine which is called from the flow chart of FIG. 19; and

FIG. 25 is a flow chart of a routine which is called from the flow chart of FIG. 19.

DETAILED DESCRIPTION

Referring to FIG. 1, a general overview of the process of the invention is illustrated in block diagrammatic form. In general, the grammar builder will receive document or text material in electronic form as represented at block 100. For the most part, this electronic information will have been tagged by procedures which may have been essentially carried out by hand, but usually are developed by a software-based translation system. This tagging process is represented at block 200 and, preferably, the result of the tagging process as identified at block 300 is an SGML tagged body of information which, for the purpose of the invention, will have a sampling size. That sampling size may vary depending upon the diversity of documents encountered in any record collection. As is apparent, for most databases, the number of records involved will be one of essentially continuously increasing number. Thus, the sample should be of adequate extent so as to remain valid for the anticipated record collection. At this juncture, the electronic information is tagged but there is no grammar or DTD describing the documents involved. Next, as represented at block 400, the documents are opened electronically and the tags are extracted into a list That tag list is represented at block 500.

As represented at block 600, the tag list then is examined to assure that the sequence of tags is adequate to define a structure. For example, an error may occur where a start tag is