WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Method and apparatus for identifying textual documents and multi-mediafiles corresponding to a search topic    
United States Patent5742816   
Link to this pagehttp://www.wikipatents.com/5742816.html
Inventor(s)Barr; Thomas (Ft. Wash, MD); Husick; Lawrence A. (Wayne, PA); Krupit; Michael S. (Newtown, PA); Morgan; Howard (Villanova, PA); Weinberger; Marvin I. (Havertown, PA)
AbstractA method and apparatus for identifying textual documents and multi-media files corresponding to a search topic. A plurality of document records, each of which is representative of at least one textual document, are stored, and a plurality of multi-media records, each of which is representative of at least one of multi-media file, are also stored. The document records have text information fields associated therewith, each of the text information fields representing text from one of the plurality of textual documents. The multi-media records have multi-media information fields for representing only digital video or audio information and associated text fields, each of the associated text fields representing text associated with one of the multi-media information fields. A single search query corresponding to the search topic is received. The single search query is preferably in a natural language format. An index database is searched in accordance with the single search query to simultaneously identify document records and multi-media records related to the single search query. The index database has a plurality of search terms corresponding to terms represented by the text information fields and the associated text fields. The index database also includes a table for associating each of the document and multi-media records with one or more of the search terms. A search result list having entries representative of both textual documents and multi-media files related to the single search query is generated in accordance with the document records and the multi-media records identified by the index database search. Text corresponding to the search topic is retrieved by selecting entries from the search result list representing document records to be retrieved, and then retrieving text represented by the text information fields associated with the selected document records. Digital video or audio information corresponding to the search topic is retrieved by selecting entries from the search result list representing selected multi-media records to be retrieved, and then retrieving digital video or audio information represented by multi-media information fields associated with the selected multi-media records.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5742816
Method and apparatus for identifying textual documents and

     multi-mediafiles corresponding to a search topic - US Patent 5742816 Drawing
Method and apparatus for identifying textual documents and multi-mediafiles corresponding to a search topic
Inventor     Barr; Thomas (Ft. Wash, MD); Husick; Lawrence A. (Wayne, PA); Krupit; Michael S. (Newtown, PA); Morgan; Howard (Villanova, PA); Weinberger; Marvin I. (Havertown, PA)
Owner/Assignee     Infonautics Corporation (Wayne, PA)
Patent assignment
All assignments
Publication Date     April 21, 1998
Application Number     08/529,250
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     September 15, 1995
US Classification     707/3 707/104.1 715/501.1 715/515
Int'l Classification     G06F 017/30
Examiner     Lintz; Paul R.
Assistant Examiner    
Attorney/Law Firm     Reed Smith Shaw & McClay
Address
Parent Case    
Priority Data    
USPTO Field of Search     395/615 395/603 395/762 395/777 395/807 395/805
Patent Tags     identifying textual documents and multi-mediafiles corresponding search topic
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5630121
Braden-Harder
707/102
May,1997

[0 after 0 votes]
5594661
Bruner
725/87
Jan,1997

[0 after 0 votes]
5557785
Lacquit

Sep,1996

[0 after 0 votes]
5481664
Hiroya
715/500.1
Jan,1996

[0 after 0 votes]
5241671
Reed
707/104.1
Aug,1993

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A method for identifying textual documents and multi-media files corresponding to a search topic, comprising the steps of:

(A) storing document records each of which is representative of one of a plurality of textual documents, said document records having text information fields associated therewith, each of said text information fields representing text from one of said plurality of textual documents;

(B) storing multi-media records each of which is representative of one of a plurality of multi-media files, said multi-media records having multi-media information fields for representing only digital video or audio information and associated text fields, each of said associated text fields representing text associated with one of said multi-media information fields;

(C) receiving a single search query corresponding to said search topic;

(D) searching an index database in accordance with said single search query to simultaneously identify document records and multi-media records related to said single search query, said index database having a plurality of search terms corresponding to terms represented by said text information fields and said associated text fields, said index database including a table for associating each of said document and multi-media records with one or more of said search terms;

(E) generating a search result list having entries representative of both textual documents and multi-media files related to said single search query in accordance with said document records and said multi-media records identified in step (D);

(F) retrieving text corresponding to said search topic by selecting entries from said search result list representing selected document records to be retrieved, and then retrieving text represented by text information fields associated with said selected document records; and

(G) retrieving digital video or audio information corresponding to said search topic by selecting entries from said search result list representing selected multi-media records to be retrieved, and then retrieving digital video or audio information represented by multi-media information fields associated with said selected multi-media records.

2. The method of claim 1, wherein said document records and said multi-media records are formed from header files stored in a single common format on said database.

3. The method of claim 2, wherein said multi-media records include a plurality of still image records each of which is representative of a still image.

4. The method of claim 3, wherein said multi-media records include a plurality of motion video records each of which is representative of a sequence of motion video frames.

5. The method of claim 4, wherein said multi-media records include a plurality of digital audio records each of which is representative of a sequence of digital audio frames.

6. The method of claim 1, wherein step (E) further comprises the step of relevance ranking said document and multi-media records identified in step (D) by generating a relevance score corresponding to each of said entries in said search result list.

7. The method of claim 6, wherein step (E) further comprises the step of forming a relevance ordered search result list by ordering said entries in said search result list in accordance with said relevance ranking such that an entry with a highest relevance ranking represents a first entry on said relevance ordered search result list.

8. The method of claim 7, wherein entries corresponding to said document records identified in step (D) and entries corresponding to said multi-media records identified in step (D) are interspersed within said relevance ordered search result list.

9. The method of claim 1, wherein said single search query is in a natural language format.

10. An apparatus for identifying textual documents and multi-media files corresponding to a search topic, comprising:

(A) means for storing document records each of which is representative of one of a plurality of textual documents and multi-media records each of which is representative of one of a plurality of multi-media fries, said document records having text information fields associated therewith, each of said text information fields representing text from one of said plurality of textual documents, said multi-media records having multi-media information fields for representing only digital video or audio information and associated text fields, each of said associated text fields representing text associated with one of said multi-media information fields;

(B) means for receiving a single search query corresponding to said search topic;

(C) searching means, coupled to an index database and said means for receiving said single query, for searching said database in accordance with said single search query to simultaneously identify document records and multi-media records related to said single search query, said index database having a plurality of search terms corresponding to terms represented by said text information fields and said associated text fields, said index database including a table for associating each of said document and multi-media records with one or more of said search terms;

(D) search result list generation means, coupled to said searching means, for generating a search result list having entries representative of both textual documents and multi-media files related to said single search query in accordance with said document records and said multi-media records identified by said searching means;

(E) means for receiving signals representing selected document records and selected multi-media records identified on said search results list;

(F) first means for retrieving, from said means for storing, text represented by text information fields associated with said selected document records; and

(G) second means for retrieving, from said means for storing, digital video or audio information represented by multi-media information fields associated with said selected multi-media records.

11. The apparatus of claim 10, wherein said document records and said multi-media records are formed from header files stored in a single common format on said database.

12. The apparatus of claim 11, wherein said multi-media records stored on said database include a plurality of still image records each of which is representative of a still image.

13. The apparatus of claim 12, wherein said multi-media records stored on said database include a plurality of motion video records each of which is representative of a sequence of motion video frames.

14. The apparatus of claim 13, wherein said multi-media records stored on said database include a plurality of digital audio records each of which is representative of a sequence of digital audio frames.

15. The apparatus of claim 10, wherein said search result list generating means includes means for relevance ranking said document and multi-media records identified by said searching means by generating a relevance score corresponding to each of said entries in said search result list.

16. The apparatus of claim 15, wherein said result list generating means further comprises means for forming a relevance ordered search result list by ordering said entries in said search result list in accordance with said relevance ranking such that an entry with a highest relevance ranking represents a first entry on said relevance ordered search result list.

17. The apparatus of claim 16, wherein entries corresponding to said document records identified by said searching means and entries corresponding to said multi-media records identified by said searching means are interspersed within said relevance ordered search result list.

18. The apparatus of claim 10, wherein said single search query is in a natural language format.
 Description Submit all comments and votes
 


FIELD OF THE INVENTION

The present invention is directed to systems for identifying documents corresponding to a search topic or query. More particularly, the present invention is directed to an automated multi-user system for identifying and retrieving text and multi-media files related to a search topic from a database library composed of information from many various publisher sources.

BACKGROUND OF THE INVENTION

Information retrieval systems are designed to store and retrieve information provided by publishers covering different subjects. Both static information, such as works of literature and reference books, and dynamic information, such as newspapers and periodicals, are stored in these systems. Information retrieval engines are provided within prior art information retrieval system in order to receive search queries from users and perform searches through the stored information. It is an object of most information retrieval systems to provide the user with all stored information relevant to the query. However, many existing searching/retrieval systems are not adapted to identify the best or most relevant information yielded by the query search. Such systems typically return query results to the user in such a way that the user must retrieve and view every document returned by the query in order to determine which document(s) is/are most relevant. It is therefore desirable to have a document searching system which not only returns a list of relevant information to the user based on a query search, but also returns the list to the user in such a form that the user can readily identify which information returned from the search is most relevant to the query topic.

Existing systems for searching and retrieving files from databases based on user queries are directed primarily to the searching and retrieval of textual documents. However, there is a growing volume of multi-media information being published which is not textual. Such multi-media information corresponds, for example, to still images, motion video sequences and digital audio sequences, which may be stored and retrieved by digital computers. It would be desirable from the point of view of an individual using an information searching/retrieval system to be able to be able to query a library or database and identify not only text documents, but also multi-media files that are relevant to user's query. Moreover, it would be desirable if the searching system could return to the user not only a single list having both text and multi-media information relevant to the query search, but also a list which enabled the user to readily identify which of the text and multi-media files were most relevant to the query topic.

Each different publisher providing documents that may be retrieved by information retrieval systems typically uses its own information format to store and transmit its information files. Thus, an information searching/retrieval system which has a library database based upon information from many various publishers must be compatible with many different publisher formats. This compatibility requirement can serve to slow the performance of an information searching/retrieval system.

It is well known in the prior art of information retrieval systems to permit a user to specify a single subject of a number of subjects for searching. For example, a user may wish to search only sports literature, medical literature or art literature. This avoids unnecessary searching through database documents that are not relevant to the subject of interest to the user. In order to provide this capability, information retrieval systems must categorize documents received from publishers according to their subject prior to adding them to the database. Subjecting of incoming documents often requires an individual to read each incoming and make a determination regarding its subject. This process is very time consuming and expensive, as there is often a large number of incoming documents to be processed. The subjecting process may be farther complicated if certain documents should properly be categorized in more than one subject. It would be desirable to have an automated system for processing incoming documents which categorized each incoming document into one or more subjects, and which did not require an individual to read each incoming document and make a separate judgment categorizing the subject of such document.

When a user of an information searching/retrieval system enters a search query into the system, the query must be parsed. Based on the parsed query, a listing of stored documents relevant to the query is provided to the user for review. In the prior art, it is known to use semantic networks when parsing a query. Semantic networks make it possible to identify words not appearing in the query, but which correspond to or are associated with the words used in the query. The number of words used to search the database is then expanded by including the corresponding words or associated words identified by the semantic network in the search instructions. This procedure is used to increase the number of relevant documents located by the information searching/retrieval system Although semantic networks may be useful for finding additional relevant documents responsive to a query, it is believed that use of such networks also tends to increase the number of irrelevant documents located by the search. In fact, it is generally believed that the number of additional relevant documents identified through the use of semantic networks is roughly equal to the number of irrelevant documents which are also brought into the search results list as a result of the semantic network. It would be desirable to have a system for implementing a semantic network which maximized the number of relevant documents identified during the search, without substantially increasing the number of irrelevant documents found by the search.

Many publishers that provide documents to information retrieval systems require record-keeping in order to ensure accurate royalty payments. Record-keeping permits the publishers to determine the interest level in various documents produced by the publisher, and the demographics of users retrieving such documents. Thus, it would be desirable to have a searching/retrieval system that tracked not only how often each document stored in the system database was retrieved by users, but also the demographics of the users retrieving the documents and the query searches used to identify and retrieve such documents.

It is therefore an object of the present invention to provide a searching/retrieval system which can query a library or database and identify not only text documents, but also multi-media files stored on the library or database that are relevant to query.

It is a further object of the present invention to provide a searching/retrieval system that accepts a query and returns a single search results list having both text and multi-media information, which list is presented in a format that enables the user to readily identify which of the text and multi-media files are most relevant to the query topic.

It is a still further object of the present invention to provide a scalable computer architecture for implementing a searching/retrieval system which can query a database and identify text documents and multi-media files stored on the database that are relevant to query.

It is a still further object of the present invention to provide an information searching/retrieval system which has a library database based upon information from many various publishers, and which is compatible with many different publisher formats.

It is a still further object of the present invention to provide an information searching/retrieval system which has a library database based upon information from many various publishers, and wherein such information is stored in a central database in one or more common information formats.

It is a still further object of the present invention to provide an automated system for processing incoming documents to be stored on a library or database, which system categorizes each incoming document into one or more subjects, and which does not require an individual to read each incoming document and make a separate judgment categorizing the subject of such document.

It is a still further object of the present invention to provide a system for implementing a semantic network which maximizes the number of relevant documents identified during the query search, without substantially increasing the number of irrelevant documents found by the search.

It is a still further object of the present invention to provide a system for using a semantic network which maximizes the number of relevant documents identified during a query search by semantically expanding the search in response to the part of speech associated with each query term in the search.

It is a still further object of the present invention to provide a searching system that queries a database to determine text documents and multi-media fries relevant to the query, wherein weightings associated with proper nouns and slow words are adjusted prior to searching the database.

It is a further object of the present invention to provide a searching/retrieval system that accepts a query and returns a single search results list including document relevance values, wherein the document relevance values are independent of the number of terms in the query.

It is yet a still further object of the present invention to provide a searching/retrieval system that tracks not only how often each document stored in the system database was retrieved by users, but also the demographics of the users retrieving the documents and the query searches used to identify and retrieve such documents.

These and other objects and advantages of the invention will become more fully apparent from the description and claims which follow or may be learned by the practice of the invention.

SUMMARY OF THE INVENTION

The present invention is directed to a method and apparatus for identifying textual documents and multi-media files corresponding to a search topic. A plurality of document records, each of which is representative of at least one textual document, are stored, and a plurality of multi-media records, each of which is representative of at least one of multi-media file, are also stored. The document records have text information fields associated therewith, each of the text information fields representing text from one of the plurality of textual documents. The multi-media records have multi-media information fields for representing only digital video (i.e., still images or motion video image sequences), digital audio or graphics information, and associated text fields, each of the associated text fields representing text associated with one of the multi-media information fields. A single search query corresponding to the search topic is received. The single search query is preferably in a natural language format. An index database is searched in accordance with the single search query to simultaneously identify document records and multi-media records related to the single search query. The index database has a plurality of search terms corresponding to terms represented by the text information fields and the associated text fields. The index database also includes a table for associating each of the document and multi-media records with one or more of the search terms. A search result list having entries representative of both textual documents and multi-media files related to the single search query is generated in accordance with the document records and the multi-media records identified by the index database search. Text corresponding to the search topic is retrieved by selecting entries from the search result list representing document records to be retrieved, and then retrieving text represented by the text information fields associated with the selected document records. Digital video, audio or graphics information corresponding to the search topic is retrieved by selecting entries from the search result list representing selected multi-media records to be retrieved, and then retrieving digital video, audio or graphics information represented by multi-media information fields associated with the selected multi-media records.

In accordance with a further aspect, the present invention is directed to a computer-implemented method and apparatus for composing a composite document on a selected topic from a plurality of information sources by searching the plurality of information sources and identifying, displaying and copying files corresponding to the selected topic. A plurality of records, each of which is representative of at least one information file, are stored in a database. A single search query corresponding to the search topic is received. The database is searched in accordance with the single search query to identify records related to the single search query. A search result list is then generated having entries representative of information files identified during the database search, and the search result list is displayed in a first display window open on a user display. Signals representative of at least first and second selected entries from the search result list are received from the user, the first and second selected entries respectively corresponding to first and second information files. A second display window for displaying at least a portion of the first information file is opened on the user display, a third display window for displaying at least a portion of the second information file is opened on the user display, and a document composition window for receiving portions of the and second first information files is opened on the user display. The composite document is then composed by copying portions of the first and second information files from the second and third display windows, respectively, to the document composition window.

In accordance with a still further aspect, the present invention is directed to a split-server architecture for processing a search query provided by a user, and identifying and retrieving documents from a database corresponding to the search query. A session server is provided for receiving the search query from the user. The session server has at least a first processor coupled to the user over a communications channel. A query server is coupled to the session server. The query server has at least a second processor coupled to a first database having records representative of the documents to be searched. The query server includes means for receiving the search query from the session server, searching means for searching the first database to identify documents responsive to the search query, and means for sending search results information representative of the documents identified by the searching means from the query server to the session server. The session server includes means for sending the search query to the query server, means for receiving the search results information from the query server, means for sending a search results list representative of the search results information across the communications channel to the user, means for receiving a document retrieval request transmitted from the user over the communications channel means for retrieving a document in response to the retrieval request and transmitting a file representative of the document to the user over the communications channel, and means for incrementing an accounting record on an accounting database coupled to the session server, the accounting record representing a number of retrievals of the document by the session server.

In accordance with a still further aspect, the present invention is directed to a method for preparing input information having differing input formats from different information sources for storage in an information retrieval system having a database with a database index for retrieval of the input information from the database. First and second input information having differing input information formats are received. The input information in one format is converted from the input format to an information retrieval system format to provide reformatted information. The information from the other information format is converted into the information retrieval system format to provide further reformatted information, whereby the input information in the differing input formats is converted into a single information retrieval system format. The reformatted information is stored in the database according to the single information system retrieval format and retrieved from the database according to the single information retrieval system format.

In accordance with a still further aspect, the present invention is directed to a method for determining a part of speech of words in a sentence or sentence fragment. A hidden Markov model for determining the most likely part of speech for the words in the sentence or sentence fragment is provided, wherein the hidden Markov model has an initial transition matrix and a subsequent transition matrix for storing the probabilities of transitions from one part of speech to another. The initial matrix of the hidden Markov model is effectively removed by making the probabilities therein equal to each other to provide a modified hidden Markov model. The modified hidden Markov model is applied to the sequence of words to determine the most likely part of speech of words within a sentence fragment with increased accuracy.

In accordance with yet a further aspect, the present invention is directed to a method for storing input information in an information retrieval system database wherein a plurality of information subject categories are provided. A plurality of subject lexicons are provided, each subject lexicon of the plurality of subject lexicons corresponding to an information subject category of the plurality of information subject categories. Each subject lexicon contains information representative of its corresponding information subject category. The input information is compared with the subject lexicons and the input information is stored in a selected information subject category according to the comparing of the input information with the subject lexicons.

In accordance with yet a timber aspect, the present invention is directed to a method for storing information in an information retrieval system having a database for retrieval of the input information in response to a query. Text information representative of text is received for storing in the system Image information representative of an image is also received for storing in the system Additionally, image text information representative of text associated with the image information is received. The image information is stored in an image information format. The text information and the image text information are stored in a common text information format whereby the format of the stored text information is identical to the format of the stored image text information. The text information and image text information are searched in the common text information format and the text information and image text information are identified in response to a single query. The image information associated with the retrieved image text information is selected and the selected image information is retrieved whereby the text information and the image information are retrieved in accordance with the same query.

In accordance with still yet a further aspect, the present invention is directed to a method for searching a database of an information retrieval system in response to a query having at least one query word with a part of speech, for applying the query word to the database and selecting information from the database according to the query word. A semantic network is provided for determining expansion words to expand the search of the database in response to the query word. The part of speech of the selected query word is determined. The selected query word is applied to the semantic network to provide one or more query expansion words in response to the selected query word. The part of speech of the query expansion word is determined. The query expansion word is applied to the database in accordance with the part of speech of the selected query word and the part of speech of the query expansion word.

In accordance with a still further aspect, the present invention is directed to a method for performing a search of a database in an information retrieval system in response to a query having at least one query word with a query word weight and for applying the query word to the database and selecting information from the information retrieval system in accordance with the query word. A query word is selected and assigned a weight. The weight is adjusted depending on whether the query word is a proper noun or slow word. The adjusting can be an increase or a decrease in the weight. Information is selected from the information retrieval system in accordance with the adjusted weight.

In accordance with a still further aspect, the present invention is directed to a method for searching a database of an information retrieval system in response to a query having a query length of at least one word, for applying the query word to the database and selecting information from the database according to the query word. The query is received and the length of the query is determined. Information is selected from the database according to the query. The relevance of the selected information is determined according to matches between the query and the information. The determined relevance of the selected information is adjusted according to the length of the query.

In accordance with a further aspect, the present invention is directed to a method for searching an information retrieval system having a database containing a plurality of documents from a plurality of document sources in response to a query from a user. A document log table is provided for tabulating document information of documents selected by the user in response to a query from the user. The query is received from the user and a document is selected by the user in response to the received query. The document log table is adjusted in response to the selecting of the document. The adjusted log table can be used to determine royalties.

BRIEF DESCRIPTION OF THE DRAWINGS

In order that the manner in which the above-recited and other advantages and objects of the invention are obtained and can be appreciated, a more particular description of the invention briefly described above will be rendered by reference to a specific embodiment thereof which is illustrated in the appended drawings. Understanding that these drawings depict only a typical embodiment of the invention and are not therefore to be considered limiting of its scope, the invention and the presently understood best mode thereof will be described and explained with additional specificity and detail through the use of the accompanying drawings.

FIG. 1 is a simplified block diagram showing an information retrieval system in accordance with a preferred embodiment of the present invention.

FIG. 2 is a simplified process flow diagram illustrating a user session which may be performed with the information retrieval system shown in FIG. 1, in accordance with a preferred embodiment of the present invention.

FIG. 3 is a more detailed block diagram showing an information retrieval system in accordance with a preferred embodiment of the present invention.

FIG. 4 is a more detailed process flow diagram illustrating a user session which may be performed with the information retrieval system shown in FIG. 3, in accordance with a preferred embodiment of the present invention.

FIG. 4A is a diagram illustrating an exemplary search results list displayed in an open window on a user's personal computer, in accordance with a preferred embodiment of the present invention.

FIG. 4B is an exemplary diagram illustrating first and second open windows on a user's personal computer which respectively display text and video information corresponding to document and multi-media files selected by the user for retrieval, in accordance with a preferred embodiment of the present invention.

FIG. 4C is an exemplary diagram illustrating first and second open windows on a user's personal computer which respectively display text and video information corresponding to document and multi-media files selected by the user for retrieval, and a composite document window in which the user has built a composite document based on the text and video information in the first and second windows, in accordance with a preferred embodiment of the present invention.

FIG. 5 is a diagram illustrating preferred data structures for storing a document information directory table, a dependent image table, and publisher information table, in accordance with a preferred embodiment of the present invention.

FIG. 5A is a diagram illustrating a preferred data structure for implementing a document index database, in accordance with a preferred embodiment of the present invention.

FIG. 5B is a diagram illustrating a preferred data storage format for implementing an image/text database, in accordance with a preferred embodiment of the present invention.

FIG. 6 is a block diagram illustrating the operation of software systems for implementing the session and query managers shown in FIG. 4, in accordance with a preferred embodiment of the present invention.

FIG. 6A is a state flow diagram showing the operation of a session manager software system, in accordance with a preferred embodiment of the present invention.

FIG. 6B is a flow diagram showing the operation of a search engine software system, in accordance with a preferred embodiment of the present invention.

FIG. 7A is a block diagram of a hidden Markov model suitable for parsing full sentences.

FIG. 7B is a block diagram of a hidden Markov model for parsing sentence fragments, in accordance with a preferred embodiment of the present invention.

FIG. 8A is a table of relevance normalization values for normalizing relevance scores output by a search engine, in accordance with a preferred embodiment of the present invention.

FIG. 8B is a graph illustrating a system for normalizing relevance scores output by a search engine, in accordance with a preferred embodiment of the present invention.

FIG. 9 is a block diagram representation of the data preparation component of the information retrieval system of FIG. 3, in accordance with a preferred embodiment of the present invention.

FIG. 9A is a block diagram representation of data flows within the data preparation component of FIG. 9, in accordance with a preferred embodiment of the present invention.

FIG. 10 is a block diagram representation of an automatic subjecting system for automatically determining the subject category of input documents, in accordance with a preferred embodiment of the present invention.

FIG. 11 is a process flow representation of a method for generating subject lexicons for use in the automatic subjecting system of FIG. 10, in accordance with a preferred embodiment of the present invention.

FIG. 12 is a block diagram of a system for generating subject lexicons for use in the automatic subjecting system of FIG. 10, in accordance with a preferred embodiment of the present invention.

FIG. 13 is a representation of data structures within an accounting database, in accordance with a preferred embodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

Referring now to FIG. 1, there is shown a simplified block diagram illustrating an information retrieval system 100, in accordance with a preferred embodiment of the present invention. The information retrieval system 100 includes a user station 102 for searching information files which have been collected from various publisher sources 112 and stored in data center 110. The user station 102 includes a personal computer (PC) 104 and user software 106 which resides on PC 104. User software 106 includes a graphical user interface (shown generally in FIGS. 4A, 4B and 4C). The user station 102 provides search queries by way of a communications channel 108 (such as, for example, a large volume public network or the Internet) coupled to the data center 110. The data center 110 includes session server 114 which includes means for receiving a search query from user station 102, means for sending the search query to a query server 116, means for receiving search results information from the query server 116, means for sending a search results list representative of the search results information across communications channel 108 to the user station 102, means for receiving a document retrieval request transmitted from user station 102 over communications channel 108 to session server 114, and means for retrieving a document from database 118 in response to the retrieval request and transmitting a file representative of the document to user station 102 over communications channel 108. The query server 116 at data center 110 includes means for receiving a search query from the session server 114, searching means for searching a document index database 117 (shown in FIG. 3) to identify documents responsive to the search query, and means for sending search results information representative of the documents identified by the searching means from the query server 116 to the session server 114. Data center 110 also includes a library database 118 for storing text, image, audio or other multi-media information representative of files provided by a plurality of publishers 112. As explained more fully below, session server 114 retrieves (from library 118) documents identified by a search query and selected by a user of user station 102 for retrieval, and then transmits the selected documents to the user station 102 over channel 108.

Referring now to FIG. 2, there is shown a simplified process flow diagram illustrating a user session 200 which may be performed with information retrieval system 100 shown in FIG. 1, in accordance with a preferred embodiment of the present invention. In step 202 of user session 200, the user station 102 communicates to data center 110 (via channel 108) a description of the information that a user of user station 102 would like to identify at data center 110. More specifically, in step 202 the a user of user station 102 sends a "natural language search query" to data center 110. As described more fully below in connection with FIG. 4, the term "natural language search query" is used to refer to a question, sentence, sentence fragment, single word or term which describes (in natural language form) a particular topic or issue for which a user of user station 102 seeks to identify information. Based on the natural language query provided by user station 102, the query server 116 in data center 110 searches a document index database 117 (shown in FIGS. 3 and 5A) coupled to the query server, and a list of files responsive to the search query are returned to user station 102, as shown in step 204. Next, in step 206, the the user of user station 102 may select for retrieval one of the listed files identified by data center 110. In step 208, session server 114 in data center 110 retrieves the full text, i