WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Data processing system and method for random access formatting of a portion of a large hierarchical electronically published document with descriptive markup    

Get related patents on CD
United States Patent5644776   
Link to this pagehttp://www.wikipatents.com/5644776.html
Inventor(s)DeRose; Steven (East Providence, RI); Vogel; Jeffrey (Providence, RI)
AbstractA data processing system and method for generating a representation of an electronic document, for indexing the electronic document, for navigating the electronic document using its representation and for displaying the electronic document on an output device. The system and method are used with electronic documents having descriptive markup which describes the content or meaning of the document rather than its appearance. Such documents may be represented by a tree. Each markup element defines a node or element in a tree. The tree is represented by providing a unique identifier for each element and for accessing a descriptor of the element. An element descriptor preferably includes indications of the parent, first child, last child, left sibling, right sibling, type name and text location for the element. The document representation is used to facilitate navigation of the text for constructing navigational aids such as table of contents and full text indexing. A document is also provided with a style sheet for specifying desired formatting characteristics for each type of element in the document. To display the document, a suitable starting point is found on the basis of a selected starting point. The document is displayed beginning with the suitable starting point and the format characteristics for each element displayed are retrieved from the style sheet and applied to the text of the displayed element.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History Custom Search
Drawing from US Patent 5644776
Data processing system and method for random access formatting of a

     portion of a large hierarchical electronically published document with

     descriptive markup - US Patent 5644776 Drawing
Data processing system and method for random access formatting of a portion of a large hierarchical electronically published document with descriptive markup
Inventor     DeRose; Steven (East Providence, RI); Vogel; Jeffrey (Providence, RI)
Owner/Assignee     INSO Providence Corporation (Providence, RI)
Patent assignment
All assignments
Company News
Publication Date     July 1, 1997
Application Number     08/480,611
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     June 7, 1995
US Classification    
Int'l Classification    
Examiner     Black; Thomas G.
Assistant Examiner     Homere; Jean R.
Attorney/Law Firm     Wolf, Greenfield & Sacks, P.C.
Address
Parent Case     This application is a division of application Ser. No. 08/419,051, filed Apr. 7, 1995, now U.S. Pat. No. 5,557,722 which is a file-wrapper continuation of application Ser. No. 07/733,204, filed Jul. 19, 1991, entitled DATA PROCESSING SYSTEM AND METHOD FOR REPRESENTING, GENERATING A REPRESENTATION OF AND RANDOM ACCESS RENDERING OF ELECTRONIC DOCUMENTS, abandoned.
Priority Data    
USPTO Field of Search    
Patent Tags     data processing random access formatting a portion large hierarchical electronically published document with descriptive markup
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B

[0 market size comments]
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%

[0 market share comments]
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%

[0 reasonable royalty comments]
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

[0 Guesstimation of Royalty Value Comments]
License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
[0 license availability comments]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
[0 owner/assignee comments]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

[0 competitive advantage comments]
Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

[0 commercial alternatives comments]
 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A data processing system for random access formatting of a portion of a large hierarchical electronically published document, the electronically published document having descriptive markup defining a plurality of hierarchical elements, wherein each element has a type name, and has at least one of an ancestor element, a child element, a left sibling element, a right sibling element and unformatted text content, the data processing system using a representation of the electronically published document, including, for each element, an indication of any ancestor element, child element, and left and right sibling element and comprising:

means for receiving an indication of a starting point within the electronically published document; and

means for identifying one of the plurality of hierarchical elements within the electronically published document as an initial element containing the indicated starting point using the representation of the electronically published document;

means for traversing the representation of the electronically published document to select elements of the electronically published document beginning with the initial element; and

means for formatting only the text content contained in the selected elements.

2. The data processing system of claim 1, wherein the representation of the electronically published document includes, for each element, an indication of any text content contained within the element.

3. A method for random access formatting of a portion of a large hierarchical electronically published document, the electronically published document having descriptive markup defining a plurality of hierarchical elements, wherein each element has a type name, and has at least one of an ancestor element, a child element, a left sibling element, a right sibling element, and unformatted text content, the method using a representation of the electronically published document, including, for each element, an indication of any ancestor element, child element, and left and right sibling element and comprising the steps of:

receiving an indication of a starting point within the electronically published document;

identifying one of the plurality of hierarchical within the electronically published document as an initial element containing the indicated starting point using the representation of electronically published document;

traversing the representation of the electronically published document to select elements of the electronically published document beginning with the initial element; and

formatting only the text content contained in the selected elements.

4. The method of claim 3, wherein the representation of the electronically published document includes, for each element, an indication of any text content contained within the element.

5. A data processing system for random access formatting of a portion of a large hierarchical electronically published document, the electronically published document having descriptive markup defining a plurality of hierarchical elements, wherein each element has a type name, and has at least one of an ancestor element, a child element, a left sibling element, a right sibling element and unformatted text content, the data processing system using a representation of the electronically published document, including, for each element, an indication of any ancestor element, child element, and left and right sibling element and comprising:

means for requesting a portion of the electronically published document using an indication of a starting point within the electronically published document;

means for receiving only the requested portion of the electronically published document including selected elements containing the indicated starting point; and

means for formatting only the text content of the received portion of the electronically published document by applying a format specification corresponding to the received selected elements.

6. The data processing system of claim 5, wherein the representation of the electronically published document includes, for each element, an indication of any text content contained within the element.

7. A method for random access formatting of a portion of a large hierarchical electronically published document, the electronically published document having descriptive markup defining a plurality of hierarchical elements, wherein each element has a type name, and has at least one of an ancestor element, a child element, a left sibling element, a right sibling element and unformatted text content, the method using a representation of the electronically published document, including, for each element, an indication of any ancestor element, child element, and left and right sibling element and comprising the steps of:

sending a request for a portion of the electronically published document using an indication of a starting point within the electronically published document;

receiving only the requested portion of the electronically published document including selected elements containing the indicated starting point; and

formatting only the text content of the received portion of the electronically published document by applying a format specification corresponding to the received selected elements.

8. The method of claim 7, wherein the representation of the electronically published document includes, for each element, an indication of any text content contained within the element.
 Description Submit all comments and votes
 


FIELD OF THE INVENTION

This invention relates generally to methods and apparatus for formatting documents in electronic or other non-paper media, and more specifically, for generating representations, indexing and rendering on a computer screen of electronic documents. More particularly, this invention relates to indexing and rendering of electronic documents, especially electronic books, having descriptive markup and hierarchical content.

BACKGROUND OF THE INVENTION

Because of the increased availability and use of computers and improved methods of communication between them, it has become common to use non-paper media for transmitting and storing documents. Such media include magnetic and optical disks, tapes and other storage systems. Documents developed and transmitted in such form (hereinafter called electronic documents) are often also viewed on computer display devices and need to be rendered, or displayed, on a computer screen or other output device, in a readable, or formatted form. These systems have become popular for and are particularly useful with very large documents which may be used by many people. Such documents include large system manuals, engineering designs, and the like.

Many currently available computer systems format and display electronic documents, such as word processors having "what-you-see-is-what-you-get" (WYSIWYG) displays, hypertext systems, and desktop publishing systems. These systems permit one view, or display, of a document at a time. However, currently available systems include formatting specifications in the internal, electronic representation of a document and require re-formatting of the whole document if a different format, such as hiding or emphasizing different portions of the text, is desired. Thus, state-of-the-art display processors are not used to their fullest capabilities.

Moreover, most current systems, specifically information-retrieval systems, consider text as a stream of graphic display instructions rather than as a hierarchy of various types of objects which have formatting properties which may be changed. Without the ability to change formatting properties of a document, the document is less useful. For example, the document may not be transferrable between different types of computer systems. Furthermore, even those systems which allow changes to formatting properties of a document require time proportional to the document length for re-formatting. Although this amount of time may be acceptable for small documents, such delays become objectionable during the display of very large documents.

Electronic documents are often developed and viewed with systems having tools for assisting navigation within the document. Such tools include full text indexing and retrieval (i.e. searching) engines, and, particularly for large documents, tables of contents similar to those for printed books.

Full text indexing and retrieval engines normally index every word found in a document and record the number of occurrences of a word and its location(s) within the document. However, most current systems only identify the total number of occurrences of a word at one level, or division of a document. For example, a system may record the total for a book, or the total for each paragraph in a book. Some systems, however, report totals for a few selected levels within a document, but not cumulative totals over all levels of a document. Other systems report whether a word occurs in one level of a document, such as a paragraph (by indicating "yes" or "no"), and cumulate the number of paragraphs in which the word occurs rather than the number of occurrences of the word. These systems fail to take full advantage of more advanced document structures to enable a user to find relevant portions of a document.

It is also common to use a thesaurus, Boolean logic, and context-based retrieval mechanisms along with such indexing and retrieval engines. However, engines with such mechanisms do little to improve the determination of the relevance of portions of a document if separated from document structure. Moreover, such additional searching procedures, especially those which incorporate a thesaurus, require additional setup and time which may be objectionable to a user.

Tables of contents are also used to assist navigation of a document in current systems; however these systems lack more advanced structures which further assist a user in finding relevant portions of a document.

As described above, current systems have failed to provide the fullest capability for a user to navigate readily an electronic document and to manipulate such a document on a variety of output devices in an efficient manner. This failure is due primarily to the conception of text formatting as a sequence of formatting instructions, and to the representation of an electronic document resulting from such a conception. For example, in current systems, format specifications are normally integrated with a document to create a document containing a sequence of display instructions. These format specifications also normally include pagination. However, with electronic and other systems which do not depend on paper, pagination is neither necessary nor desirable. Such systems fail to separate the text content from the text form.

Accordingly, it is an object of the present invention to provide a data processing system and method which permits simultaneously displaying multiple views of various portions of an electronic document, each having its own (possibly distinct) format specification.

It is another object of the present invention to provide a data processing system and method of rendering documents which treats text in a manner separate from formatting properties.

It is a further object of the present invention to provide a data processing system and method for rendering an electronic document which allows changes to the specified format of the document and displays the document with the changed format from a selected viewing location immediately without reformatting the whole document.

It is another object of the present invention to provide a data processing system and method for indexing electronic documents which reports, for selected words, the number of occurrences of that word within each section and subsection of the document.

It is another object of the present invention to provide a data processing system and method for enhancing the ability of a user to determine the relevant portions of a document.

It is another object of the present invention to provide a data processing system and method for generating a representation of an electronic document which enables immediate display and formatting of the document for multiple views, improved determination of relevant portions of the document, simple selection of portions of the document for viewing, and the attachment of private and public annotations.

SUMMARY OF THE INVENTION

In view of the foregoing and other objects of the present invention, there is provided a data processing system and method for generating a representation of an electronic document, for indexing the electronic document to generate the representation, for navigating the electronic document using its representation and for displaying the electronic document, formatted according to a style sheet, on an output device.

The system and method of the present invention is most useful with electronic documents having descriptive markup, such as the Standard Generalized Markup Language (SGML). Descriptive markup is used to denote the function or meaning of portions of the content of the document (such as a "chapter") and normally not the appearance (such as "centered").

Such electronic documents may be understood as a tree-like structure. An element, or node, of the tree is defined by the markup in the electronic document. An element thus may have a parent element, a child element, a left sibling element and/or a right sibling element. An element may also contain text. The text content may be considered to be a child element of its containing element.

Each element in an electronic document is assigned a type name according to its markup. The type name may also include the type names of a parent element and of previous parent elements, thus indicating the context in which an element occurs in the document. Such a type name is called a qualified name. The type name identifying the complete context of an element is the fully-qualified name.

A document is also provided with one or more style sheets for specifying format characteristics for its display. A style sheet includes format characteristics for type names of elements in the document. Not all fully-qualified, qualified or non-qualified names need to have specified format characteristics. Format characteristics include font styles and size, margins and other details relating to appearance and behaviour of the document. The style sheets are normally stored separately from their corresponding document.

In order to display a document on an output device in a manner that enables a user to navigate it readily and to find, as much as possible, its relevant portions, a suitable representation of the document is needed. This representation depends on the capabilities desired to be provided to a user, and should allow quick access to document elements.

One capability provided by the system and method of the present invention is rapid random access to any given point within a document and display of the document from that point. The document tree is traversed starting at the randomly accessed point) from a randomly accessed point in a manner which allows a pre-function to be applied to each element before any of that elements child elements are visited, and which allows a post-function to be applied to each element after its child elements are visited. The pre- and post-functions correspond respectively to the beginning of a markup element (i.e. its start tag) and the end of a markup element (i.e. its end tag). Format specifications are retrieved from the style sheet according to the (possibly qualified) type name of each element visited in the traversal. Format specifications for an element are applied to content of the element. To facilitate traversal of the document from a selected element, there is provided a representation of the document including, for each element, a field for storing the type name of the element and preferably fields for storing an indication of any parent element, any right sibling element, and, for each element having a child element, at least the first child element.

An additional capability provided by the system and method of the present invention is display of a document beginning from a point which occurs after a line break, or other breaking point, in the format. A randomly accessed element may occur in a document at a location which may not be placed after a line break when formatted and displayed. For a more aesthetically pleasing display, the document is traversed backwards to find an element before or after which a breaking point occurs. The breaking point may be found either by examining the format specifications of elements preceding the randomly accessed element or by examining the type of these elements. Preferably, the first suitable element preceding the randomly accessed element is found, with formatting and display of the text of the document beginning with this element. To facilitate the backwards traversal of a document, the representation of a document further includes a field for storing an indication of any left sibling element, and, for each markup element having a child element, a field for storing an indication of the last child element.

A further capability provided by the system and method of the present invention is inheritance of format characteristics. An element selected for display may be displayed according to format specifications of its parent element. Thus, the format specifications for its parent and any preceding parent elements are retrieved and evaluated before traversal of the element. By using a stack, evaluated format specifications for higher levels of a tree may be retained during traversal of the document for display. Caching mechanisms may also be used for retaining evaluated format specifications.

Another capability provided by the system and method of the present invention is scrolling of a displayed document. Scrolling of a displayed document results in the retrieval of the elements which immediately precede or follow the currently displayed element. The document may then be displayed beginning with this preceding or following element. To facilitate the retrieval of these elements, the representation of the document further includes unique element identifiers assigned to each element in the document. The fields of the element directory store the unique identifiers for their corresponding elements. These element identifiers are preferably numbers which are sequentially assigned to the elements in the document in the order that these elements appear in the document.

In another aspect of the present invention, there is provided a system and method for generating a table of contents for a document for assisting a user in finding relevant portions of a document and in selecting a starting point for its display. The document is traversed from the start of the document in a manner similar to the display traversal. During the traversal, for each element of the document which has a title, a record is established. This record includes the element identifier of the element and the element identifier of the next element of the same or higher level in the document. An indication of the level of the element in the tree, or of how many of the elements which contain it have titles, may also be stored. The table of contents may be displayed in a manner similar to the actual document. Preferably, the table of contents may be expanded and contracted by a user to display different levels within a document, such as done by standard outline processors.

In another aspect of the present invention there is provided a system and method for hierarchical full text indexing of a document and means operative in response to a search request for combining results of the index with other document navigation tools, such as a table of contents. Standard full text index and retrieval engines may be used to index text in a document. According to the present invention, the results of the index for a word are stored in a record which includes the element identifiers for text elements in which the word occurs and the number of occurrences of the word in each of those elements. Before results of the index for a word are displayed, in combination with the table of contents, results for all elements in the document are cumulated. Cumulation may be performed at the time of full text index processing or when a search request is made.

In order to provide the above capabilities of the present invention, there is provided an element directory for representing the document, which combines the representations described separately above. The element directory is an array of element descriptors, wherein each descriptor corresponds to an element of the document. The elements are preferably assigned unique element identifiers which are used to access the corresponding element descriptors. An element descriptor for an element preferably includes a field for each of the following types of elements with which it may be associated: a parent element, a first child element, a last child element, a left sibling element and a right sibling element. The element descriptor also preferably includes a field for indicating the fully-qualified type name of the element and a pointer field for any text content or attributes of the element. Furthermore, each element descriptor preferably has the same size.

The fully-qualified type name field is also preferably constant in size for each element descriptor. This constancy may be obtained by generating a fully-qualified name table and indicating in the element descriptor for an element the location in the table of its fully-qualified name. Preferably, the fully-qualified name table is a compressed list and the location of a name in the list is indicated by its offset in the list and the length of the name.

Another aspect of the present invention is the generation of an element directory and a fully-qualified name table from an electronic document having descriptive markup. This process of generating the representation of the document involves parsing the document to identify where markup elements begin and end and which elements contain text. When the document is sequentially parsed, the order that elements appear in a document may be readily determined. Consequently, sequential, numeric, unique identifiers may be assigned to each element of the document. The relationship among elements, such as parent, child and sibling elements, also may be readily determined. Thus the element descriptors may be constructed.

The process of generating the document representation and hierarchical full text indexing of the present invention may be performed by the publisher of the electronic document. A reader of the document may cause the display processes to be performed on this document. When a user selects a document, the table of contents is displayed. The user is also provided with the option of searching for a selected word in a manner which is familiar to those skilled in the art, such as by using "menus".

In another aspect of the present invention a user may also make annotations to a viewed document or maintain lists of selected starting points.

Annotations may be made using standard mechanisms, common in hypertext systems, such as "webs". Such annotations, according to the present invention, include the element identifier of the element of a document to which an annotation is attached. The process of displaying a document searches the web for annotations attached to an element to display an indication to the user that such an annotation exists.

Lists of selected starting points, commonly called "bookmarks," may also be generated by a process which records at least the element identifier of the selected elements. This process may be operative in response to the selection and subsequent display of an element by a user. These lists may also be displayed to allow a user to select a starting point for displaying the document.

All of the annotations and other lists preferably are stored in the same format. This format is preferably readily portable and the same as the format for the document itself. That is, the same descriptive markup language is used to construct these lists.

BRIEF DESCRIPTION OF THE DRAWINGS

The operation and advantages of the present invention will be more fully understood from the detailed description below, which should be read in conjunction with the accompanying drawings, in which:

FIG. 1 is an illustration of a data processing system in which the present invention may be utilized;

FIG. 2 is a block diagram of the data processing system of FIG. 1;

FIG. 3 is a diagrammatic illustration of the hierarchical structure of an example document with which the present invention may be used;

FIG. 4 is an illustration of a sample document with descriptive markup;

FIG. 5 is a diagrammatic illustration of the hierarchical structure of the sample document of FIG. 4;

FIG. 6 is an illustration of an element directory of the present invention, with example values corresponding to the document of FIGS. 4 and 5;

FIG. 7 is an illustration of a fully-qualified name table of the present invention with example values corresponding to the document of FIG. 4;

FIG. 8 is a flowchart describing how a document is parsed to construct the element directory and fully-qualified name table;

FIG. 9 is an illustration of a frequency record of the present invention for full text indexing;

FIG. 10 is a flowchart describing how the frequency record for a word is constructed;

FIG. 11 is a flowchart describing how hierarchical full text indexing is performed for a whole document;

FIGS. 12-14 are example display views as produced by the system and process of the present invention;

FIG. 15 illustrates a preferred embodiment of a style sheet for use with the present invention;

FIG. 16 is a flowchart describing how the table of contents for a document is constructed;

FIG. 17A-C are flowcharts describing how a document is rendered according to the present invention;

FIGS. 18A-C are flowcharts describing how the starting point for rendering is determined;

FIGS. 19A-B are flowcharts describing how a depth-first search is performed on a document;

FIG. 20 is a flowchart describing how an element of a document is traversed, or searched;

FIG. 21 is a flowchart describing how annotations are rendered according to the present invention;

FIG. 22 is a diagrammatic illustration of how annotations, bookmarks, history logs, and directive tasks are attached to a document by the system of the present invention.

DETAILED DESCRIPTION

A data processing system in which the present invention may be used is depicted in FIG. 1. The data processing system 30 includes a main processing unit 32 having a mass storage device 34, such as a disk drive. The mass storage device 34 may be internal (not shown) or external (as shown) to the main unit 32. The data processing system also includes an output device such as a monitor, or graphic display 36 and, optionally, printer 38. The main unit 32, combined with display 36, is preferably programmed to enable multiple simultaneous views, popularly known as "windows", which facilitates providing the user with multiple views of a document. A current embodiment of the invention employs, as the data processing system 30, a Sun-4.TM. workstation running SunOS.TM. Release 4.1 (a trademark of Sun Microsystems, Inc.) or higher. The workstation also includes an X Window Systems.TM. (a trademark of the Massachusetts Institute of Technology) server and a ICCCM Compliant Window Manager as the program or process for enabling multiple views. A Release 4 X Server is recommended for best performance. Further details on installation of an embodiment of the present invention on such a system is provided in the attached Appendix A. Appendix A is a manual entitled "DynaText.TM. System Installation and System Administration Guide Release 1.0". This document should not be printed.

FIG. 2 shows further detail of the structure of the data processing system 30. The main unit 32 includes a processing and arithmetic unit 40 and a memory unit 42 connected to the processing unit via a bus 44. Mass storage 34 is also connected to the memory unit and processing unit via the bus 44 along with the output devices 36 and 38. The memory unit 42 preferably has 8 MB of random-access memory (RAM) and 16 MB of virtual memory. It has also been found that 2 MB RAM is suitable for a data processing system including an IBM-PC compatible machine.

The data processing system may be configured to perform the process of the present invention using a typical programming language such as the "C++" programming language. It should be apparent to those skilled in the art that the present invention is not limited to a specific programming language or data processing system and that other appropriate programming languages and other appropriate data processing systems could also be used.

The system of the present invention receives as its input a document, represented in electronic form, which includes text content, descriptive markup and possibly non-text content. Electronic documents include, but are not limited to, electronic books and operation manuals for large systems, such as for airplane maintenance, etc. The descriptive markup of an input document is interpretable as an ordered hierarchy of content objects, such as illustrated in FIG. 3. That is, the descriptive markup defines a structure including a set of elements which, when taken together, form a tree or similar hierarchical object. A markup element describes the function or meaning, rather than the appearance, of the text which it includes. Elements representing only appearance or format characteristics may be used, but are non-optimal.

In such a document, an element, e.g. element 50 of FIG. 3, may have a parent element (52), a first child element (54), a last child element (56), a left sibling element (58), and a right sibling element (60). In the example just described, a right sibling of element 50 does not exist in the document, and is therefore defined by "nil", or some non-element identifier. Similarly, if an element does not have first or last children elements, left sibling element, or a parent element, the corresponding values are also defined to be `nil` or some other non-element identifier. The text content elements 68 of a document are normally found as the leaves of a tree.

A document may also include other types of elements which do not describe function, meaning or appearance of the text. These types of elements include cross-referencing elements 62 which may be used to link relevant sections of a document or even separate documents. Artwork elements 64 may be used to point to non-text objects, such as graphic raster files, which also may be separate electronic documents.

An example of a descriptive markup language for electronic documents is specified by ISO standard 8879: Standard Generalized Markup Language, or, "SGML". This standard is described in "Information Processing--Text and Office Systems--Standard Generalized Markup Language (SGML)," by the International Organization for Standardization, ISO 8879-1986(E), which is hereby incorporated by reference. Documents in SGML may be created using standard text editors, such as SoftQuad Author/Editor.TM., which is commercially available from SoftQuad, Inc., of Toronto, Ontario, Canada. The "Scribe" word processing language ia a similar document markup language. Other suitable markup languages may also be used.

The preferred embodiment of the present invention provides the capability for rendering documents which comply with the SGML standard. Such documents are preferred because of the acceptance of the standard by publishers and government agencies. SGML compliant documents may be made from other types of documents using commercially available systems. A simple exemplary SGML compliant document is provided in FIG. 4. This example is used to illustrate the process and data structures of the present invention and is not limiting, as the system of the present invention may be used readily with arbitrarily large documents. An SGML document includes markup tags which may be described as start tags, end tags, or empty tags. An empty tag may be understood as being both a start tag and an end tag. In this sample document of FIG. 4, start tag 45 begins a markup element. An end tag, such as end tag 47, ends the corresponding markup element. Elements having start and end tags occurring between the start and end tags of another element (as tags 46 and 48 are between tags 45 and 47) are defined to be children or lower elements of the tree defined by the markup structure. Children at the same level beneath a parent are siblings.

Some of the tags in the descriptive markup of the document may also be empty tags such as tag 49 (FIG. 4). Such empty tags may be used for cross-referencing, referencing other documents, or for referencing graphic or other types of non-text information, etc. These tags often have attributes which are variables, such as "file", to which are assigned values, such as "myfig12". These attributes may be interpreted when the document is rendered to retrieve graphics files, etc. Normal start tags 45 may also include attributes which are often useful for marking text which is to be hidden for security or other reasons, or for attaching a unique identifier for an element for cross-referencing or other uses. For example, when a document is rendered, an attribute for a start tag may be examined, and if the attribute has a predetermined value, display of that material may be prevented or modified, thus providing security for a document.

FIG. 5 is a representation of the tree structure generated from the sample SGML document of FIG. 4. Reference numbers 70-89 have been assigned to the elements defined by the markup structure of the SGML document. It is preferable to assign sequential numbers, or element identifiers, to each element appearing in the document according to the order of appearance of these elements in the document. These element identifiers are used in the generation of the document representation of the present invention, the element directory 91 (FIG. 6), which is used to improve navigation of the document.

The data structure of FIG. 6, the element directory 91, is an array of element descriptors 90. Each element descriptor 90 represents an element of the document. In the preferred embodiment, an element descriptor 90 is easily retrieved from the array on the basis of the element identifier which is assigned to its corresponding element. The element descriptor 90 includes a field 92 for representing the parent of the element, a field 94 for representing the first child, a field 96 for representing the last child, a field 98 for representing a left sibling, a field 100 for representing a right sibling, a field 102 for representing the type of the element, and a field 104 for representing the location of text characters for a text chunk or the location of other data associated with the element such as attributes. Those fields which represent elements, such as parent, child and sibling elements, preferably contain the element identifiers assigned to those elements.

The above-described representation of an element descriptor may be further optimized for documents which are not modified after its element directory is generated. In this case, the element identifier of a first child of an element is always the immediately succeeding element identifier of that element. Thus, this field may be reduced to a one-bit representation, e.g. `1` may indicate that there is a first child and `0` that there are no children.

Another variation for the element directory 91 may include element descriptors 90 of variable size. Since a descriptor 90 may have a few NIL values, the size of the corresponding fields may be reduced. An element descriptor 90 may then be accessed from a file according to the offset or location in the file and length of the descriptor 90. Element identifiers assigned to element descriptors may be mapped to the values of the offset and length of their corresponding element descriptors. Such a modification may reduce the size of the element directors 91, but increases the time it takes to access an element descriptor.

In the example of FIG. 6, element descriptor 90 corresponds to element 70 of FIG. 5. Since element 70 does not have a parent element, parent field 92 includes a non-element value. Similarly, left and right sibling fields 98 and 100 also include non-element values. Field 102 includes a representation that element 70 is of the type, "book".

It is preferable that the size of element type field 102 remain constant across all element descriptors. In the preferred embodiment of the present invention the element type in field 102 is represented by a pointer to another data structure, or data file, called the fully-qualified name table. The fully-qualified name table is a list of element types encountered in the document. The pointer includes a representation of the offset, or location, of the element type in the fully-qualified name table and possibly the length of the type name.

A preferred embodiment of a fully-qualified name table is represented as a compressed list in FIG. 7. The list is compressed by representing as many sequential types as possible in a compressed form. That is, rather than having a list of:

"BOOK"

"BOOK,FRONTMATTER"

The list is compressed to "BOOK,FRONTMATTER". Thus, repeated occurrences of a type name may be eliminated. The table of FIG. 7 corresponds to the example document represented by FIGS. 4-6 and is to be understood as a stream of characters. Thus, as an example, field 102 for element 70 (of type "BOOK") would show an offset of 0 and a length of 4, since the first occurrence of "BOOK" is at the beginning of the table and has a length of four characters. Similarly, the entry for field 102 for element 76, i.e. the element whose parent is 70 and first child is 77, would have an offset of 47 and a length of 9, since the first occurrence of "BOOK, BODY" occurs at the 47th character in the table and is 9 characters long. Likewise, element 71 ("BOOK, FRONTMATTER") has an offset of 0 and a length of 16. Various other methods of representing a fully-qualified name for the element may be used, such as a list of fully-qualified names retrieved according to their placement in the list. However, the preferred embodiment should reduce the size of this table sufficiently to allow the fully-qualified name table to be loaded into RAM.

Referring now to FIG. 8, the process for generating an element directory, such as exemplified by FIG. 6, and a fully-qualified name table, as exemplified by FIG. 7, for a document having descriptive markup, will now be described.

The process of indexing a document, i.e. generating the element directory and other data structures, begins with Step 110 of initialization. In Step 110 of initialization, a variable, e.g. "EID", is set to provide an initial element identifier. In the preferred embodiment, this variable is set to 0. A stack, called the open-element stack, is created and is initially empty or supplied with a default element, e.g. "#ROOT", for the first element identifier. The qualified name for the current element is also held in this stack and is initially empty or "#ROOT". Three file objects are also created, in the step 110 of initialization, on the mass storage device 34. These file objects are called the element directory, the fully-qualified name table and the text content. The element directory, fully-qualified name table and text content of a document are written to these file objects, respectively, during the indexing process. Creating these file objects and writing to them are normally handled by instructions to the the operating system of the data processing system 30, to open a file and write to it.

After the Step 110 of initialization, Step 112 of retrieving a token from a parser is performed next. A suitable parser processes the provided electronic document and, for each markup tag for an element including start and end tags and for each text chunk, returns a token indicating the type of tag and its location in the document. Markup tags include start tags, corresponding end tags, and text chunks. There also may be tags representing empty elements, which are essentially combined start and end tags as described above in connection with FIG. 5. In the preferred embodiment, for documents in SGML, parsing is simplified if the provided document is in normalized, or "minimal", form. This form of an SGML document is defined by the standard mentioned above in section 15.1.2 thereof. Parsers and normalizers for SGML are well known. For example, the XGML.TM. Engine and the XGML.TM. Normalizer, both available from Exoterica Corporation of Ottawa, Ontario, Canada may be used for validating, parsing and normalizing SGML documents.

After a token is retrieved from the parser (in Step 112), it is determined in step 114 whether it is a start tag or an empty tag. The parser indicates the type of the token. If the token is a start or empty tag, a new element descriptor 90 (FIG. 6), is established in step 116, in the element directory 91 with an element identifier of "EID" plus 1. The parent field 92 for the new element descriptor 90 is set to the element identifier of the element on the top of the open-element stack. For the first element descriptor 90 established for a document, parent field is `0`, or other suitable initial value.

Processing continues with step 118 of saving attributes of the current token in the text file and placing the location in the text file of the attributes into the text location field 104 of the newly established element descriptor 90 for this element in the element directory 91.

Next, in step 120, the new element is established as the last child of the element at the top of the open element stack, which is its parent. That is, the element identifier of the new element is written in the last child field 96 of the element descriptor 90 of its parent. If another element is already listed as the last child of the top-of-stack element, this other element is attached as the left sibling (field 98) of the new element. Likewise, the new element is indicated to be the right sibling (field 100) of the old last child of the top-of-stack element. If no last child is indicated in the parent element descriptor 90, the new element is also indicated to be the first child (field 94) of the parent (top-of-stack) element. Also, as part of step 120, the element at the top of the open element stack, i.e. the parent element, is indicated to be the parent (field 92) of the new element. The first and last child fields 94 and 96 of the new element descriptor 90 are set to `NIL` or other non-element identifier value.

After the new element is established in the element directory 91, processing continues with step 122 of appending the tag to the current fully-qualified name, which is retrieved from the open element stack. The current name is attached using a reserved delimiter, e.g. a comma, or other character which is not a character used in any of the descriptive markup tags of a document. The fully-qualified name table (FIG. 7) is then searched (step 124) for the appended name. If the fully-qualified name is not found, as checked in step 126, the new name is added to the fully-qualified name table in step 128. Once it is verified, in steps 126 and 128, that the fully-qualified name for this new element is in the fully-qualified name table (FIG. 7), the pointer field 102 is then set to indicate the offset, or location, in the fully-qualified name table and the length of the name (step 130).

After the new element is established in the element directory 91, and the appropriate entry is made to the fully-qualified name table, processing continues with step 132 of determining whether the current tag is an empty tag. Since an empty tag is considered to include an end tag, if the test of step 132 returns true, the variable "EID" is incremented by 1, but this new element is not pushed onto (i.e. stored in) the open element stack (step 134). If the tag is not an empty tag, the element identifier of the new element and its fully-qualified name is pushed onto the open element stack and the variable "EID" is incremented by 1 in step 136. With some parsers, however, an empty tag produces both the start tag and corresponding end tag for the markup element. In this case, steps 132 and 134 may be omitted. Some descriptive markup systems other than SGML may not have "empty" tags, in which case steps 132 and 134 may also be omitted. If the parser does not produce separate start and end tags for an empty tag, steps 132 and 134 are necessary. After step 134 or step 136 of incrementing and pushing, processing continues with step 112, described above, of retrieving the next token from the parser.

Having described the processing of start tags, the processing of end tags and text chunks will now be described.

After step 112 of retrieving the next token, and if the next token is determined not to be a start or empty tag in step 114, processing continues with step 138 of determining whether the next token is an end tag. Since an end tag ends a descriptive markup element in a document, an end tag results in popping (i.e. removing the top element from) the open element stack in step 140 and returning to step 112 of retrieving the next token from the parser.

If the next token is neither a start tag, as determined in step 114, nor an end tag, as determined in step 138, processing continues with the step 141 of determining whether the next token is a text chunk. If the tag is determined not to be text, the token denotes the end of the file and processing concludes. Otherwise, processing continues with step 142 of establishing a new element descriptor 90 for the text chunk in the element directory 91, in a manner similar to step 116. In the preferred embodiment, the text chunk is attached as the last child of the element from the top of the open element stack, in a manner similar to step 120 of attaching a new element (step 144). The type name for the text chunk is also stored. The type name may be a reserved name, such as "#TEXT", in place of an offset and length into the fully-qualified name table. Thus, a type name for text elements need not be stored in that table. Optimally, one bit may be used to indicate that the element is or is not a