WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Automatic tagging of documents and exclusion by content    
United States Patent6199081   
Link to this pagehttp://www.wikipatents.com/6199081.html
Inventor(s)Meyerzon; Dmitriy (Bellevue, WA), Nichols; William G. (Seattle, WA)
AbstractA computer-based method and system for processing data obtained from documents retrieved from a computer network during a gathering project is disclosed. Plugging in modular active and consumer plug-ins into the gathering project configures the information processing capability of the gathering process that retrieves the documents. The gathering process retrieves a copy of an electronic document from a server connected to the computer network and returns a document data stream that includes the retrieved document's data and its "properties." One or more active plug-ins plugged-in to the gathering process is used to add, delete or modify the properties in the document data stream based on the document's contents or properties. The modified document data stream is then passed to one or more consumer plug-ins that use the properties in the modified document data stream to process the document in some manner. An active plug-in can prevent any part of the document data stream from being forwarded to subsequent active or consumer plug-ins in the project. An active plug-in can also control the consumer plug-ins by instructing them to abort processing of a particular document after analyzing some of the document's contents while the document is being processed.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 6199081
Automatic tagging of documents and exclusion by content - US Patent 6199081 Drawing
Automatic tagging of documents and exclusion by content
Inventor     Meyerzon; Dmitriy (Bellevue, WA) , Nichols; William G. (Seattle, WA)
Owner/Assignee     Microsoft Corporation (Redmond, WA)
Patent assignment
All assignments
Publication Date     March 6, 2001
Application Number     09/107,225
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     June 30, 1998
US Classification     715/513 707/6 715/516 715/530
Int'l Classification    
Examiner     Feild; Joseph H.
Assistant Examiner    
Attorney/Law Firm     Christensen O'Connor Johnson Kindness PLLC
Address
Parent Case    
Priority Data    
USPTO Field of Search     707/512 707/513 707/1 707/3 707/5 707/10 707/6 707/516 707/530 345/335 709/217 709/218 709/219 709/217 709/218 709/219
Patent Tags     automatic tagging documents exclusion content
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
6094657
Hailpern et al.

Jul,2000

[0 after 0 votes]
6029161
Lang et al.

Feb,2000

[0 after 0 votes]
5999940
Ranger

Dec,1999

[0 after 0 votes]
5983214
Lang et al.

Nov,1999

[0 after 0 votes]
5974412
Hazlehurst et al.

Oct,1999

[0 after 0 votes]
5933822
Braden-Harder et al.

Aug,1999

[0 after 0 votes]
5899999
De Bonet

May,1999

[0 after 0 votes]
5875446
Brown et al.

Feb,1999

[0 after 0 votes]
5867799
Lang et al.

Feb,1999

[0 after 0 votes]
5870559
Leshem et al.

Feb,1999

[0 after 0 votes]
5864871
Kitain et al.

Jan,1999

[0 after 0 votes]
5855020
Kirsch et al.

Dec,1998

[0 after 0 votes]
5835722
Bradshaw et al.

Nov,1998

[0 after 0 votes]
5748954
Mauldin

May,1998

[0 after 0 votes]
5659732
Kirsch

Aug,1997

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows:

1. A computer-based method for processing data retrieved during a crawl of a computer network, comprising:

retrieving a document in a gathering project;

parsing the document into a document data stream including contents and properties;

piping the document data stream to at least one active plug-in, said active plug-in having the capability of modifying the document data stream by adding to, deleting from, or changing the content and/or properties of the document data stream; and

piping the document data stream to at least one consumer plug-in, said consumer plug-in having the ability to perform an action in response to the modification made by said active plug-in.

2. The method of claim 1, wherein the at least one active plug-in:

analyzes the document data stream; and

modifies the document data stream based on the analysis performed by the active plug-in.

3. The method of claim 2, wherein the active plug-in makes a plurality of modifications to the document data stream.

4. The method of claim 2, wherein making the modification to the document data stream comprises adding at least one property to the document data stream.

5. The method of claim 4, wherein the at least one consumer plug-in:

receives the document data stream after at least one modification has been made to the document data stream by the at least one active plug-in; and

performs an action in response to the modification made to the original document data stream by the at least one active plug-in.

6. The method of claim 5, wherein the action performed in the at least one consumer plug-in is to discard the document data stream.

7. The method of claim 5, wherein the action performed in the at least one consumer plug-in is to abort processing of the document data stream.

8. The method of claim 2, wherein making the modification to the document data stream comprises deleting at least one property from the document data stream.

9. The method of claim 8, wherein the at least one consumer plug-in:

receives the document data stream after at least one modification has been made to the document data stream by the at least one active plug-in; and

performs an action in response to the modification made to the original document data stream by the at least one active plug-in.

10. The method of claim 9, wherein the action performed in the at least one consumer plug-in is to discard the document data stream.

11. The method of claim 9, wherein the action performed in the at least one consumer plug-in is to abort processing of the document data stream.

12. The method of claim 2, wherein making the modification to the document data stream comprises modifying at least one existing property in the document data stream by deleting the existing property from the document data stream and inserting a substitute property in the document data stream.

13. The method of claim 12, wherein the at least one consumer plug-in:

receives the document data stream after at least one modification has been made to the document data stream by the at least one active plug-in; and

performs an action in response to the modification made to the original document data stream by the at least one active plug-in.

14. The method of claim 13, wherein the action performed in the at least one consumer plug-in is to discard the document data stream.

15. The method of claim 13, wherein the action performed in the at least one consumer plug-in is to abort processing of the document data stream.

16. The method of claim 2, wherein making the modification to the document data stream comprises deleting at least some of the contents from the document data stream.

17. The method of claim 2, wherein there is a plurality of active plug-ins that have been plugged-in to the gathering project in a sequence, each of said plurality of active plug-ins receiving as an input the document data stream as modified by an immediately preceding active plug-in in the sequence.

18. The method of claim 17, wherein the at least one consumer plug-in:

receives the document data stream after at least one modification has been made to the document data stream by the at least one active plug-in; and

performs an action in response to the modification made to the original document data stream by the at least one active plug-in.

19. The method of claim 18, wherein the action performed in the at least one consumer plug-in is to discard the document data stream.

20. The method of claim 18, wherein the action performed in the at least one consumer plug-in is to abort processing of the document data stream.

21. The method of claim 1, wherein the at least one consumer plug-in:

receives the document data stream after at least one modification has been made to the document data stream by the at least one active plug-in; and

performs an action in response to the modification made to the original document data stream by the at least one active plug-in.

22. The method of claim 21, wherein the action performed in the at least one consumer plug-in is to discard the document data stream.

23. The method of claim 21, wherein the action performed in the at least one consumer plug-in is to abort processing of the document data stream.

24. The method of claim 1, wherein the parsing of the document is accomplished using a filtering process that creates the document data stream, the document data stream comprising a uniform representation of a set of contents and properties contained in the document.

25. The method of claim 24, wherein the filtering process deletes certain contents and properties from the document data stream before piping the document data stream to the active plug-in.

26. The method of claim 25, wherein the certain contents and properties comprise formatting information.

27. The method of claim 26, wherein the filtering process is external to the gatherer process.

28. A computer-based method for processing data retrieved during a crawl of a computer network, comprising:

retrieving a document with a gatherer process; and

parsing the document into a document data stream including contents and properties; and

piping the document data stream to at least one active plug-in, said active plug-in having the capability of modifying the document data stream by adding to, deleting from, or changing the content and/or properties of the document data stream; and

piping the document data stream to at least one consumer plug-in, said consumer plug-in having the ability to perform an action in response to the modification made by said active plug-in; and

not forwarding at least some of the document data stream to a subsequent plug-in based on the analysis performed by the active plug-in.

29. The method of claim 28, wherein the entire document data stream associated with the document is not forwarded to the subsequent plug-in.

30. The method of claim 28, wherein the subsequent plug-in is another active plug-in.

31. The method of claim 28, wherein the subsequent plug-in is a consumer plug-in.

32. The method of claim 28, wherein the parsing of the document is accomplished with a filtering process that creates the document data stream, the document data stream comprising a uniform representation of a set of contents and properties contained in the document.

33. The method of claim 32, wherein the filtering process deletes certain contents and properties from the document data stream before piping the document data stream to the active plug-in.

34. The method of claim 33, wherein the certain contents and properties comprise formatting information.

35. The method of claim 34, wherein the filtering process is external to the gatherer process.

36. A computer-readable medium having computer-executable instructions for retrieving and processing information from a computer network, wherein retrieving and processing information from a computer network includes performing a Web crawl, wherein performing a Web crawl comprises:

retrieving an electronic document copy from the computer network in a gathering project, the electronic document copy having text chunks and properties;

passing the electronic document copy to an active plug-in that analyzes the electronic document copy;

making at least one change to the electronic document copy with the active plug-in, wherein making the change comprises:

adding a property to the electronic document copy if the active plug-in determines that a property should be added;

deleting a property from the electronic document copy if the active plug-in determines that a property should be deleted;

modifying a property of the electronic document copy if the active plug-in determines that a property should be modified;

deleting a text chunk from the electronic document copy if the active plug-in determines that the text chunk should be deleted; and

passing the electronic document copy that has been changed by the active plug-in to a consumer plug-in that processes the electronic document copy responsive to the change made in the electronic document copy by the active plug-in.

37. The computer-readable medium of claim 36, wherein there is a plurality of active plug-ins that have been plugged-in to the gathering project in a sequence, each of said plurality of active plug-ins receiving as an input the electronic document copy including the change made by an immediately preceding active plug-in in the sequence.

38. The computer-readable medium of claim 36, wherein there is a plurality of consumer plug-ins that have been plugged-in to the gathering project, each of said plurality of consumer plug-ins receiving as an input the electronic document copy including the change made by the active plug-in.

39. The computer-readable medium of claim 36, wherein the active plug-in and the consumer plug-in are objects that are visible in a distributed namespace, wherein a distributed namespace is an area in a computer memory that associates an identifier for the object with a location in the computer memory, the distributed namespace being accessible by a plurality of computers that are networked into a distributed system.

40. A system for retrieving and processing information stored on a computer, the system comprising:

an information retrieval component that retrieves information from the computer;

at least one modular active plug-in component, each of the modular active plug-in components being associated with the information retrieval component in a sequence that defines the order that the information is passed to each of the at least one modular active plug-in components; and

each of the at least one modular active plug-in components being capable of analyzing and modifying the information before passing the information to a next active plug-in component in the sequence.

41. The system of claim 40, further comprising:

at least one modular consumer plug-in component, each of the at least one modular consumer plug-in components being associated with the information retrieval component so that each of the at least one modular consumer plug-in components receives the information as it has been modified by each of the at least one active plug-in components.

42. The system of claim 41, wherein:

at least one of the at least one consumer plug-in components processes the information in a manner responsive to information that has been added by the at least one active plug-in component.

43. The system of claim 41, wherein:

at least one of the at least one consumer plug-in components processes the information in a manner responsive to information that has been deleted by the at least one active plug-in component.

44. The system of claim 41, wherein:

at least one of the at least one consumer plug-in components processes the information in a manner responsive to information that has been modified by the at least one active plug-in component.

45. A system for retrieving information from a computer network having a plurality of electronic documents stored thereon, wherein each electronic document corresponds to a corresponding document address specification that provides information for locating the electronic document, the system comprising:

means for retrieving a plurality of electronic documents, each electronic document having content comprising data and meta-tags;

first computer-executable instructions implemented as an active plug-in object for conducting an analysis of the content of each electronic document and means for modifying the meta-tags associated with each electronic document responsive to the analysis; and

second computer-executable instructions implemented as a consumer plug-in object for processing each electronic document responsive to the meta-tags associated with each electronic document.

46. The system of claim 45, wherein:

the first executable instructions implemented as an active plug-in object may be modularly inserted and withdrawn from the means for retrieving a plurality of electronic documents.

47. The system of claim 46, wherein:

the second executable instructions implemented as a consumer plug-in object may be modularly inserted and withdrawn from the means for retrieving a plurality of electronic documents.

48. The system of claim 47, wherein:

there are a plurality of active plug-in objects that are a part of the means for retrieving a plurality of electronic documents.

49. The system of claim 48, wherein:

there are a plurality of consumer plug-in objects that are a part of the means for retrieving a plurality of electronic documents.
 Description Submit all comments and votes
 


FIELD OF THE INVENTION

The present invention relates to the field of software and, in particular, to methods and systems for retrieving data from network sites and processing that data according to its content.

BACKGROUND OF THE INVENTION

In recent years, there has been a tremendous proliferation of computers connected to a global network known as the Internet. A "client" computer connected to the Internet can download digital information from "server" computers connected to the Internet. Client application software executing on client computers typically accepts commands from a user and obtains data and services by sending requests to server applications running on server computers connected to the Internet. A number of protocols are used to exchange commands and data between computers connected to the Internet. The protocols include the File Transfer Protocol (FTP), the Hypertext Transfer Protocol (HTTP), the Simple ail Transfer Protocol (SMTP), and the "Gopher" document protocol.

The HTTP protocol is used to access data on the World Wide Web, often referred to as "the Web." The World Wide Web is an information service on the Internet providing documents and links between documents. The World Wide Web is made up of numerous Web sites around the world that maintain and distribute Web documents. A Web site may use one or more Web server computers that store and distribute documents in one of a number of formats including the Hypertext Markup Language (HTML).

A HTML document contains text and tags. HTML documents may also contain metadata and metatags. Metadata is data about data and metatags define the meta-data. Examples of metatags that identify meta-data are "author," "language," and "character set." HTML documents may also include tags that contain embedded "links" or "hyperlinks" that reference other data or documents located on the same or another Web server computer. The HTML documents and the document referenced in the hyperlinks may include text, graphics, audio, or video in various formats.

A Web browser is a client application that communicates with server computers via HTTP, FTP, and Gopher protocols. Web browsers receive Web documents from the network and present them to a user. Internet Explorer, available from Microsoft Corporation, Redmond, Wash., is an example of a popular Web browser application.

An intranet is a local area network containing Web servers and client computers operating in a manner similar to that of the World Wide Web described above. Typically, all of the computers on an intranet are contained within a company or organization.

Web crawlers are computer programs that automatically retrieve numerous Web documents from one or more Web sites. A Web crawler processes the received data, preparing the data to be subsequently processed by other programs. For example, a Web crawler may use the retrieved data to create an index of documents available over the Internet or an intranet. A "search engine" can later use the index to locate Web documents that satisfy a specified search criteria.

It is desirable to have a mechanism in the crawler that allows the crawler to feed to client applications, like an indexing engine, a stream of data not directly present in the "crawled" documents. Preferably, such a mechanism would have the ability to modify data retrieved from Web documents with active components in order to allow the retrieved data to be processed more efficiently and accurately by the client application. The mechanism of the invention would also preferably have the ability to exclude a document from being indexed based on its content and properties. The present invention is directed to providing such a mechanism.

SUMMARY OF THE INVENTION

The present invention discloses a method and system for modifying a document data stream obtained by a gatherer process when an electronic document is retrieved from a computer. The gatherer process retrieves Web documents from Web servers that are connected to a computer network commonly known as the Word Wide Web. Preferably, the Web crawler employs a filtering process to retrieve the document and to parse the document into a document data stream comprising contents and properties. For instance, when an HTML document is retrieved, the filtering process converts the document's text and tags to a uniform representation of the document's contents and properties. The document retrieval performed by the present invention is not limited to HTML documents. Many different document formats may be filtered to produce a uniform representation of contents and properties that are processed by the invention in the manner described below.

In accordance with the present invention, the retrieved contents and properties of a document are contained in a document data stream that is sequentially piped through one or more active plug-in components. The active plug-in components modify the document data stream by adding, deleting, or modifying the contents and properties of the document data stream. Active plug-ins are modeled in the invention as modular components, or "plug-ins," that in an actual embodiment of the invention are software objects that can be plugged-in to a configuration entity called a gathering project. After the document data stream has been modified by the active plug-ins, the modified document's data stream is piped to one or more consumer plug-ins. A consumer plug-in is an application that processes the modified document data stream. The processing conducted by the consumer plug-in may be influenced by the modifications made to the original document data stream by the active plug-ins.

Both active plug-ins and consumer plug-ins can be mixed and matched and plugged-in to the gathering project according to the goals of the project. Active plug-ins are inserted before any consumer plug-ins so that they may modify the original document data stream in a way that makes the document data stream more useful to the consumer plug-ins that follow the active plug-in in the gathering project. The gathering project can also be configured not to use any active plug-ins, in which case all data contained in the original document data stream will be piped directly to the consumer plug-ins that are plugged-in to the project.

In accordance with other aspects of this invention, the gatherer process is an enhanced Web crawler that has one or more configuration entities called gathering projects. Each gathering project has its own transaction log, history map, plug-in list, and crawl restriction rules that the gatherer process uses to "crawl" Web documents that are stored on a plurality of Web servers connected to the World Wide Web. When the gatherer process retrieves a document, the gatherer process receives a copy of the content of the document, which may include data such as text, images, sound, and embedded properties.

An example of a client application that makes use of embedded properties is a Web browser that reads HTML tags embedded in a Web document to format the document and to specify hyperlinks to other Web documents. In addition to tags that provide formatting information, the document may also contain meta-tags, which are used to define meta-data in the document. For instance, a meta-tag "Author" may identify meta-data in the document that identifies the author of the document. Tags may either conform to "markup languages" such as HTML, SGML, XML and VRML, which are widely known to those skilled in the art, or tags can be defined as "extensions" to a markup language and embedded in documents for the use of specific client applications. An example of a client application that recognizes an extended set of property definitions is the Internet Explorer, a Web browser available from Microsoft Corporation, Redmond, Wash.

When the gatherer process retrieves a Web document, it first uses a filtering process to retrieve the Web document according to the appropriate protocol. The filter process then converts the text and tags retrieved from the document into a uniform representation of the document's contents and properties. The filtering process can return contents and properties from many other document formats other than HTML, such as Microsoft Word documents, email messages, and SQL database records. The filtering process strips out any extra information stored in the document that does not belong to its contents or properties. For instance, the filtering process discards tags that include formatting information such as paragraphs, fonts, styles, etc., that are used by the Web browser to render and display the document to a user.

After filtering, the gatherer process pipes the document's contents and properties in a document data stream to the plug-ins listed in the gathering project's plug-ins list. The filtered document data stream consists of a series of "chunks" that contain either content or properties. Unless explicitly stated otherwise, reference made to either content or properties herein will be understood to imply the other (i.e., a reference to properties implicitly also refers to content).

The gatherer process iterates through a list of one or more active plug-ins that it sequentially pipes the document data stream through. Each active plug-in has the capability to modify the document data stream by, for instance, deleting properties from the document data stream, inserting properties into the document data stream, deleting properties from the document data stream, or modifying properties in the document data stream. Because each active plug-in receives the document data stream as it has been modified by a previous active plug-in in the list, the modifications made by the active plug-ins are cumulative. Thus, the processing by the active plug-ins may themselves be influenced by changes to the document data stream already made by another active plug-in. The active plug-in makes it appear to a consumer plug-in that a property that the active plug-in has inserted, deleted or modified in the document data stream like it was an original part of the retrieved document. This includes the ability of an active plug-in to insert properties into the document data stream that are intended for the use of specific consumer plug-ins.

After the gatherer process has piped the document data stream through the active plug-ins, it pipes the resulting modified document data stream to one or more consumer plug-ins. A consumer plug-in is "read only" in that it cannot modify the document data stream like an active plug-in. Since the consumer plug-in processes the modified document data stream, the active plug-ins can tailor the original document data stream into a form that is more useful to the consumer plug-in. The document data stream is more "useful" to the consumer plug-in if the consumer plug-in can process the document data stream, as modified by the active plug-ins, more effectively or more efficiently than the consumer plug-in would have been able to process the original document data stream as it was retrieved from the document. For instance, an active plug-in can insert a property into the document data stream of a document that a consumer plug-in should process, while removing inaccurate or deceptive properties from the document data stream that could cause the consumer plug-in to process the document in a way that it really should not. In the case of a consumer plug-in that builds an index, the modified document data stream improves the quality of the index built by the consumer plug-in by enabling the consumer plug-in to process the information more accurately based on the modified content provided by the active plug-ins. As will become apparent in the discussion below, the active plug-ins allow an automated customizable way for editing, annotating, and/or censoring of documents that are provided to consumer plug-ins. Automating these functions is advantageous because of the immense volume of documents contained on Web sites connected to the Internet, an intranet, and other computer networks. The method and system of the present invention also advantageously provides a way to alter the contents of the retrieved documents without affecting the original documents that the administrator of the site running the gatherer does not own or have access rights to.

Since the active plug-ins alter the content of the document data stream, the order in which they are piped the information by the gatherer process is important because the active plug-ins each receive the document data stream as it has been modified by all previous active plug-ins through which it has traveled. The invention also makes it possible to delete data from the document data stream so that some or all portions of a retrieved document are not to be forwarded to subsequent active plug-ins and consumer plug-ins. Those knowledgeable in the art will understand that the description herein of "piping the data" is used metaphorically and is actually done by passing reference pointers to locations in memory, COM interfaces, or by some other method well known in the art. The COM interface protocol is available from Microsoft Corporation, Redmond, Wash.

In accordance with further aspects of this invention, the gatherer, the active plug-ins, and the consumer plug-ins are modeled as objects. The objects have interfaces that are visible in a global namespace, or a distributed namespace in a distributed system, which permits objects to communicate with each other, to read each other's properties and use each other's methods. This property makes the invention extensible because the modular nature of objects allows the active plug-ins and the consumer plug-ins to be "plugged-in" to the gathering process, as needed, to support the goals of a given Web crawling project. This property also means that the objects comprising the gatherer process, the active plug-ins, and the consumer plug-ins need not be executed on the same computer, but can be stored and linked together across a network from any server computer that has access to the distributed namespace.

In accordance with still further aspects of this invention, the active plug-ins are able to control the gatherer process by sending messages with the requested action to the gatherer object.

As will readily be appreciated from the foregoing summary, the present invention provides a method and system for retrieving a document data stream from a Web document residing on a Web server, and modifying that document data stream using active plug-ins before passing the document data stream to consumer plug-ins for final processing. Other features of the invention are described below.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same becomes better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein:

FIG. 1 is a block diagram of a general-purpose computer system for implementing the present invention;

FIG. 2 is a block diagram illustrating a network architecture, in accordance with the present invention;

FIG. 3 is a block diagram illustrating an architecture of a gatherer project implemented by a gatherer process, in accordance with the present invention;

FIG. 4 is a functional flow diagram illustrating the piping of document data to active and consumer plug-ins, in accordance with the present invention;

FIG. 5 is a functional flow diagram illustrating the use of active plug-ins, in accordance with the present invention;

FIG. 6 is a functional flow diagram illustrating the use of consumer plug-ins, in accordance with the present invention;

FIG. 7 is a functional flow diagram illustrating some of the interfaces and messages used by the present invention; and

FIG. 8 is a functional flow diagram illustrating the method that a plug-in uses to control the gatherer process.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The present invention is a mechanism for obtaining and processing information pertaining to Web documents that reside on one or more server computers. A server computer is referred to as a Web site, and the process of locating and retrieving digital data from Web sites is referred to as "Web crawling." The computer program performing the Web crawling is called a gatherer process. A gatherer process retrieves Web documents by visiting the Uniform Resource Locators (URLs) associated with Web documents that have been placed in a queue referred to as a transaction log. Before the URL is inserted into the transaction log, the URL is compared to the exclusion rules for the project. These rules define the scope of the crawl and define the range of URLs that are added to the transaction log. While the following discussion describes the invention in terms of crawling the World Wide Web, the present invention is not limited to the Internet, an intranet, or the Web and may be used in any application where electronic information is retrieved from a local computer or a computer network.

Besides retrieving documents originally placed in the transaction log, the gatherer process recursively gathers document URLs referenced in hyperlinks in retrieved documents and inserts those URLs (subject to the gathering rules) into the transaction log so that they also will be retrieved by the gatherer process during the gathering project, or Web crawl. The Web crawl is complete when every URL in the transaction log has been visited. A URL can be thought of as an address on the network where the Web document is located. If the gatherer process is able to retrieve the Web document at the URL listed in the transaction log, the document data is retrieved and processed as discussed in detail below. As used herein, the term "Web document" or "document" refers to all data resources available to the gatherer process during the Web crawl. Examples of data resources are files, HTML documents, database rows, mail messages and meta-documents such as file system directories and mail folders.

In accordance with the present invention, the gatherer process executes on a computer, preferably a general-purpose personal computer. FIG. 1 and the following discussion are intended to provide a brief, general description of a suitable computing environment in which the invention may be implemented. Although not required, the invention will be described in the general context of computer-executable instructions, such as program modules, being executed by a personal computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will appreciate that the invention may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. The invention may also be practiced in distributed computing environments where