WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
System and method for associating an extensible set of data with documents downloaded by a web crawler    

Get related patents on CD
United States Patent6351755   
Link to this pagehttp://www.wikipatents.com/6351755.html
Inventor(s)Najork; Marc Alexander (Palo Alto, CA); Heydon; Clark Allan (San Francisco, CA)
AbstractA web crawler downloads documents from among a plurality of host computers. The web crawler enqueues document addresses in a data structure called the Frontier. The Frontier generally includes a set of queues, with all document addresses sharing a respective common host component being stored in a respective common one of the queues. Multiple threads substantially concurrently process the document addresses in the queues. The web crawler includes a set of tools for storing an extensible set of data with each document address in the Frontier. These tools enable the applications to which the web crawler passes downloaded documents to store a record of information associated with each download, where each record of information includes an extensible set of name/value pairs specified by the applications. The applications also determine how many records of information to retain for each document, when to delete records of information, and so on. In another aspect of the present invention, the Frontier include a set of parallel "priority queues," each associated with a distinct priority level. Queue elements for documents to be downloaded are assigned a priority level, and then stored in the corresponding priority queue. Queue elements are then distributed from the priority queues to a set of underlying queues in accordance with their relative priorities. The threads then process the queue elements in the underlying queues.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History Custom Search
Drawing from US Patent 6351755
System and method for associating an extensible set of data with documents

     downloaded by a web crawler - US Patent 6351755 Drawing
System and method for associating an extensible set of data with documents downloaded by a web crawler
Inventor     Najork; Marc Alexander (Palo Alto, CA); Heydon; Clark Allan (San Francisco, CA)
Owner/Assignee     Alta Vista Company (Palo Alto, CA)
Patent assignment
All assignments
Company News
Publication Date     February 26, 2002
Application Number     09/433,006
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     November 2, 1999
US Classification     715/501.1 715/513
Int'l Classification     G06F 017/21
Examiner     Hong; Stephen S.
Assistant Examiner    
Attorney/Law Firm     Pennie & Edmonds LLP
Address
Parent Case    
Priority Data    
USPTO Field of Search     707/3 707/4 707/5 707/102 707/501 707/513
Patent Tags     associating extensible set data documents downloaded web crawler
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
6145003
Sanu

Nov,2000

[0 after 0 votes]
6094649
Bowen

Jul,2000

[0 after 0 votes]
6038610
Belfiore
719/310
Mar,2000

[0 after 0 votes]
6006217
Lumsden
707/2
Dec,1999

[0 after 0 votes]
5944783
Nieten

Aug,1999

[0 after 0 votes]
5875446
Brown
707/3
Feb,1999

[0 after 0 votes]
5832494
Egger

Nov,1998

[0 after 0 votes]
5748954
Mauldin
707/10
May,1998

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B

[0 market size comments]
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%

[0 market share comments]
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%

[0 reasonable royalty comments]
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

[0 Guesstimation of Royalty Value Comments]
License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
[0 license availability comments]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
[0 owner/assignee comments]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

[0 competitive advantage comments]
Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

[0 commercial alternatives comments]
 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A method of performing a continuous crawl for locating and downloading documents from among a plurality of host computers, comprising:

(a) obtaining at least one referring document set that includes addresses of one or more referred documents; each referred document address including a host component;

(b) enqueuing queue elements in a plurality of queues, each queue element denoting one of the referred document addresses; each queue element including a download history comprising zero or more records;

(c) substantially concurrently operating a plurality of threads;

(d) while operating each thread, repeatedly performing steps of:

(d1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequeuing the identified queue element;

(d2) adding a record to the queue element;

(d3) executing at least one application program, distinct from a web crawler application that performs the downloading and dequeuing, for processing the downloaded document, the at least one application program including instructions that store name/value pairs in the record added to the queue element, wherein the name of each name/value pair is specified by the at least one application program and the value of each name/value is determined by the at least one application program; and

(d4) storing the queue element, including the added record, in a predefined data structure for further processing.

2. The method of claim 1, wherein the at least one application program includes instructions for deleting a subset of the records in the queue element in accordance with predefined record deletion criteria.

3. The method of claim 2, wherein the at least one application program includes instructions for determining a number corresponding to how many records are in the queue element, and the predefined record deletion criteria include the number of records in the queue element exceeding a threshold value.

4. The method of claim 1, wherein the at least one application program includes instructions for reading the name/value pairs in at least one of the queue element and for conditionally performing an action based on the value in at least one of the name/value pairs read by the at least one application program.

5. The method of claim 1, wherein

the plurality of queues includes a plurality of parallel priority level queues, each having a distinct associated download priority level, the download priority level corresponding to a probability of the queue elements enqueued in the associated priority level queue therein being processed by the threads; and

step d4 includes determining a download priority level for the document associated with the queue element as a function of the download history of the queue element.

6. The method of claim 5, wherein

the name/value pairs stored in each of the records in the queue element include at least one content based value which can be compared with a corresponding content based value in another of the records to determine whether the document's content changed between the downloads of the document corresponding to the records in which the content based values are stored;

step d4 includes determining a download priority level for the document associated with the queue element as a function of whether the content based value in a last one of the records in the queue element is not equal to the corresponding content based value in an earlier one of the records in the queue element.

7. The method of claim 6, wherein the content base value is a checksum of the contents of the document corresponding to the queue element.

8. The method of claim 5, wherein

the name/value pairs stored in each of the records in the queue element include a purported expiration date and time; and

step d4 includes comparing the purported expiration date and time with at least one other date and time value and assigning the queue element a download priority level in accordance with an outcome of the comparison.

9. The method of claim 1, wherein the name/value pairs stored in each of the records by the at least one application program are dynamically extensible by the at least one application program.

10. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising:

an enqueuing module that, when executed by the computer system, obtains at least one referring document that includes addresses of one or more referred documents, each referred document address including a host component corresponding to a host computer, and enqueues queue elements in a plurality of queues, each queue element denoting one of the referred document addresses; each queue element including a download history comprising zero or more records; and

a dequeuing module that is substantially concurrently executed by each of a plurality of threads so as to process the referred document addresses in the queues; the dequeuing module, when executed by a respective one of the threads, repeatedly performs the functions of

(a1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequeuing the identified queue element;

(a2) adding a record to the queue element;

(a3) executing at least one application program, distinct from the enqueuing module and dequeuing module, for processing the downloaded document, the at least one application program including instructions that store name/value pairs in the record added to the queue element, wherein the name of each name/value pair is specified by the at least one application program and the value of each name/value is determined by the at least one application program; and

(a4) storing the queue element, including the added record, in a predefined data structure for further processing.

11. The computer program product of claim 10, wherein said enqueuing module is configured to use at least one of the downloaded referred documents as a new referring document.

12. The computer program product of claim 10, wherein the at least one application program includes instructions for deleting a subset of the records in the queue element in accordance with predefined record deletion criteria.

13. The computer program product of claim 10, wherein the at least one application program includes instructions for determining a number corresponding to how many records are in the queue element, and the predefined record deletion criteria include the number of records in the queue element exceeding a threshold value.

14. The computer program product of claim 10, wherein the at least one application program includes instructions for reading the name/value pairs in at least one of the records in the queue element and for conditionally performing an action based on the value in at least one of the name/value pairs read by the at least one application program.

15. The computer program product of claim 10, wherein

the plurality of queues includes a plurality of parallel priority level queues, each having a distinct associated download priority level, the download priority level corresponding to a probability of the queue elements enqueued in the associated priority level queue therein being processed by the threads; and

the computer program product includes instructions for determining a download priority level for the document associated with the queue element as a function of the download history of the queue element.

16. The computer program product of claim 15, wherein

the name/value pairs stored in each of the records in the queue element include at least one content based value which can be compared with a corresponding content based value in another of the records to determine whether the document's content changed between the downloads of the document corresponding to the records in which the content based values are stored; and

the computer program product includes instructions for determining a download priority level for the document associated with the queue element as a function of whether the content based value in a last one of the records in the queue element is not equal to the corresponding content based value in an earlier one of the records in the queue element.

17. The computer program product of claim 16, wherein the content base value is a checksum of the contents of the document corresponding to the queue element.

18. The computer program product of claim 15, wherein

the name/value pairs stored in each of the records in the queue element include a purported expiration date and time; and

the computer program product includes instructions for comparing the purported expiration date and time with at least one other date and time value and assigning the queue element a download priority level in accordance with an outcome of the comparison.

19. The computer program product of claim 10, wherein the name/value pairs stored in each of the records by the at least one application program are dynamically extensible by the at least one application program.

20. A computer system for downloading documents from among a plurality of host computers, comprising:

a plurality of threads of execution;

an enqueuing module that, when executed by the computer system, obtains at least one referring document that includes addresses of one or more referred documents, each referred document address including a host component corresponding to a host computer, and enqueues queue elements in a plurality of queues, each queue element denoting one of the referred document addresses; each queue element including a download history comprising zero or more records; and

a dequeuing module that is substantially concurrently executed by each of the plurality of threads so as to process the referred document addresses in the queues; the dequeuing module, when executed by a respective one of the threads, repeatedly performs the functions of:

(a1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequeuing the identified queue element;

(a2) adding a record to the queue element;

(a3) executing at least one application program, distinct from the enqueuing module and dequeuing module, for processing the downloaded document, the at least one application program including instructions that store name/value pairs in the record added to the queue element, wherein the name of each name/value pair is specified by the at least one application program and the value of each name/value is determined by the at least one application program; and

(a4) storing the queue element, including the added record, in a predefined data structure for further processing.

21. The computer system of claim 20, wherein said enqueuing module is configured to use at least one of the downloaded referred documents as a new referring document.

22. The computer system of claim 20, wherein the at least one application program includes instructions for deleting a subset of the records in the queue element in accordance with predefined record deletion criteria.

23. The computer system of claim 20, wherein the at least one application program includes instructions for determining a number corresponding to how many records are in the queue element, and the predefined record deletion criteria include the number of records in the queue element exceeding a threshold value.

24. The computer system of claim 20, wherein the at least one application program includes instructions for reading the name/value pairs in at least one of the records in the queue element and for conditionally performing an action based on the value in at least one of the name/value pairs read by the at least one application program.

25. The computer system of claim 20, wherein

the plurality of queues includes a plurality of parallel priority level queues, each having a distinct associated download priority level, the download priority level corresponding to a probability of the queue elements enqueued in the associated priority level queue therein being processed by the threads; and

the at least one application program includes instructions for determining a download priority level for the document associated with the queue element as a function of the download history of the queue element.

26. The computer system of claim 25, wherein

the name/value pairs stored in each of the records in the queue element include at least one content based value which can be compared with a corresponding content based value in another of the records to determine whether the document's content changed between the downloads of the document corresponding to the records in which the content based values are stored; and

the at least one application program includes instructions for determining a download priority level for the document associated with the queue element as a function of whether the content based value in a last one of the records in the queue element is not equal to the corresponding content based value in an earlier one of the records in the queue element.

27. The computer system of claim 25, wherein the content base value is a checksum of the contents of the document corresponding to the queue element.

28. The computer system of claim 25, wherein

the name/value pairs stored in each of the records in the queue element include a purported expiration date and time; and

the at least one application program includes instructions for comparing the purported expiration date and time with at least one other date and time value and assigning the queue element a download priority level in accordance with an outcome of the comparison.

29. The computer system of claim 20, wherein the name/value pairs stored in each of the records by at the least one application program are dynamically extensible by the at least one application program.
 Description Submit all comments and votes
 


The present invention relates to a system and method for accessing documents, called web pages, on the world wide web (WWW) and, more particularly, to a method for associating an extensible set of data with each document downloaded by a web crawler.

BACKGROUND OF THE INVENTION

Documents on interconnected computer networks are typically stored on numerous host computers that are connected over the networks. For example, so-called "web pages" are stored on the global computer network known as the Internet, which includes the world wide web. Each web page on the world wide web has a distinct address called its uniform resource locator (URL), which identifies the location of the web page. Most of the documents on the world wide web are written in standard document description languages (e.g., HTML, XML). These languages allow an author of a document to create hypertext links to other documents. Hypertext links allow a reader of a web page to quickly move to other web pages by clicking on their respective links. These links are typically highlighted in the original web page. A web page containing hypertext links to other web pages generally refers to those pages by their URL's. Links in a web page may refer to web pages that are stored in the same or different host computers.

A web crawler is a program that automatically finds and downloads documents from host computers in networks such as the world wide web. When a web crawler is given a set of starting URL's, the web crawler downloads the corresponding documents, extracts any URL's contained in those downloaded documents and downloads more documents using the newly discovered URL's. This process repeats indefinitely or until a predetermined stop condition occurs. As of 1999 there were approximately 500 million web pages on the world wide web and the number is continuously growing; thus, web crawlers need efficient data structures to keep track of downloaded documents and any discovered addresses of documents to be downloaded.

Collecting Information About Documents Downloaded by a Web Crawler

After a document is downloaded by the web crawler, the web crawler may extract and store information about the downloaded page. For instance, the web crawler may determine if the downloaded page contains any new URL's not previously known to the web crawler, and may enqueue those URL's for later processing. In addition, pages downloaded by the web crawler may be processed by a sequence of processing modules. For instance, one processing module might determine whether the document has already been included in a web page index, and whether the page has changed by more than a predefined amount since its entry in the web page index was last updated. Another processing module might add or update a document's entry in the web page index. Yet another processing module might look for information of a specific type in the downloaded documents, extract the information and store it in a directory or other data structure.

During the course of processing a downloaded document, various data can be collected about it. Examples include the date and time of the download, how long it took to perform the download, whether the download was successful, the document's size, its MIME type, the date and time it was last modified, its expiration date and time, and a checksum of its contents. These data can be used for a variety of purposes, including, but not limited to:

passing information from one processing module to a later processing module in a processing pipeline;

collecting statistics about the downloaded documents; and

in the context of a continuous web crawler, the collected data can be used as a basis for determining when a document should next be downloaded (refreshed).

After a document has been processed, its associated data can be saved to disk and analyzed off line.

A continuous web crawler is one that automatically refreshes a database of information about the pages it has downloaded. A web page can have an assigned or purported expiration date and time, which indicates when the page should be assumed to be no longer valid. Furthermore, a web crawler can be configured to assume that certain types of pages, such as pages on certain types of web sites, cannot be valid for more that a particular length of time. Thus, pages on a news web site might be assumed to be valid for only a few hours, while pages of an online encyclopedia might be assumed to be valid for a much longer time, such as month.

In the context of a continuous web crawler, it may be advantageous to record not only the data associated with a document's most recent download, but also with its previous downloads. How complete a document download history to keep may vary depending on the user's requirements.

The Scooter (a trademark of AltaVista Company) web crawler saves a fixed set of data for each document it discovers and downloads, namely, the document's URL, the number of attempts that have been made to download it, the date and time of the last download attempt, the HTTP status code of the last download, and the document's last modification date and time.

The Sphinx web crawler developed by Bharat and Miller allows document classifiers to associate name/value pairs with a downloaded page. However, Sphinx discards any name/value pairs associated with a document once the document has been processed. Moreover, the values must be strings, not values of arbitrary types.

It would be desirable to provide a much more flexible mechanism that enables application programs that process downloaded pages to determine what information to save for each document downloaded. In that way the data structure for storing such information would be dynamically determined, and the manner in which that information is used would be dynamically determined, without having to customize the code of the web crawler for each application.

Prioritizing Document Downloads

Every web crawler must maintain a data structure or set of data structures reflecting the set of URL's that still must be downloaded. In this document, that set of data structures is called "the Frontier." The crawler repeatedly selects a URL from the Frontier, downloads the corresponding document, processes the downloaded document, and then either removes the URL from the Frontier or reschedules it for downloading again at a later time. The latter scheme is used for so-called "continuous" web crawlers.

When selecting a URL from the Frontier, the inventors have determined that it would often be desirable for the crawler to preferentially select certain URL's over others so as to maximize the quality of the information processed by the other applications to which the web crawler passes downloaded documents. For instance, the web crawler may pass downloaded pages to a document indexer. An index of documents on an Intranet or the Internet will be more accurate or higher quality if the documents of most interest to the users of the index have been preferentially updated so as to make sure that those documents are accurately represented in the index. To accomplish this, the web crawler might preferentially select URL's on web servers with known high quality content. Alternately, heuristics might be used to gauge page quality. For instance, shorter URL's might be considered to be better candidates than longer URL's.

In the context of a continuous web crawler, it may be desirable to prefer URL's on web servers whose content is known to change rapidly, such as news sites. It may be desirable to prefer newly-discovered URL's over those that have been previously processed. Among the previously processed URL's, it may be advantageous to prefer URL's whose content has changed between the previous two downloads over URL's whose content has not changed, and to prefer URL's with shorter expiration dates over those with longer expiration dates.

Maintaining Freshness of Documents Downloaded by a Continuous Web Crawler

As alluded to earlier, web crawlers are traditionally used to collect documents from the world wide web, as well as from Intranets, for some purpose, the most common of which is to build an index for a search engine. However, since many of the documents on the web and on Intranets change over time, at any given point in time, some fraction of any web index will contain stale content.

There are two obvious approaches to refreshing an index. One is to perform repeated complete or "scratch" crawls to rebuild the index from scratch. The disadvantage of this approach is that many of the documents may not have changed between the two scratch crawls, in which case valuable computer resources will be wasted unnecessarily refetching and processing documents. Another approach is to perform a more targeted crawl, but it is difficult to know a priori which documents need to be refetched, since the web does not include an invalidation mechanism. That is, the only way to discover that a page has changed is to query its web server.

Therefore it would be desirable to have a mechanism for keeping the results of a crawl up to date, using a continuous crawl that is somehow biased toward pages that are most likely to have been changed since the last time the crawler fetched them.

SUMMARY OF THE INVENTION

A web crawler downloads documents from among a plurality of host computers. The web crawler enqueues document addresses in a data structure called the Frontier. The Frontier generally includes a set of queues, with all document addresses sharing a respective common host component being stored in a respective common one of the queues. Multiple threads substantially concurrently process the document addresses in the queues.

The web crawler includes a set of tools for storing an extensible set of data with each document address (URL) in the Frontier. These tools enable the applications to which the web crawler passes downloaded documents to store a record of information associated with each download, where each record of information includes a set of name/value pairs specified by the applications. The applications also determine how many records of information to retain for each URL, when to delete records of information, and so on.

In another aspect of the present invention, the Frontier includes a set of parallel "priority queues," each associated with a distinct priority level. Queue elements for URL's to be downloaded are assigned a priority level, and then stored in the corresponding priority queue. Queue elements are then distributed from the priority queues to a set of underlying queues in accordance with their relative priorities. The threads then process the queue elements in the underlying queues.

In yet another aspect of the present invention, the web crawler performs a continuous crawl. The URL element for each downloaded document is assigned a priority level and then reinserted into the Frontier, in the priority queue corresponding to the assigned priority level. The priority level is determined as a function of the extensible set of data stored with the queue element. Each queue element for a newly found URL is also assigned a priority level. That priority level is based on the fact that it is a newly found URL and may also be based on properties of the URL itself, or the web page on which the URL was found.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a distributed computer system illustrating an exemplary embodiment of the invention.

FIG. 2 is a block diagram illustrating an first exemplary embodiment of the invention.

FIG. 3 is a block diagram of a queue element stored in the Frontier data structures of the first exemplary embodiment.

FIGS. 4 and 5 are flow charts depicting the first exemplary embodiment of the invention.

FIGS. 6 and 7 are block diagrams illustrating the Frontier data structures used in a second exemplary embodiment of the invention.

FIGS. 8A and 8B are flow charts depicting the second exemplary embodiment of the invention.

FIG. 9 is block diagram illustrating the Frontier data structures used in a third exemplary embodiment of the invention.

FIG. 10 illustrates a table used in the third exemplary embodiment.

FIG. 11 is a block diagram of an ordered set data structure and procedures used to access the ordered set in the third exemplary embodiment of the invention.

FIGS. 12, 13, 14, 15 and 16 are flow charts depicting the third exemplary embodiment of the invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

FIG. 1 shows an exemplary embodiment of a distributed computer system 100. The distributed computer system 100 includes a web crawler 102 connected to a network 103 through a network interconnection 110. The network 103 may be a global communication network, such as the Internet, or a private network, sometimes called an Intranet. Examples of the network interconnection 110 include switches, routers, etc.

The network 103 includes web servers 112 and a service known as a domain name system 114. It may also optionally include a web page indexing system 116. The web servers 112 store web pages. The domain name system 114 is a distributed database that provides the mapping between Internet protocol (IP) addresses and host names. The domain name system 114 is a distributed system because no single site on the Internet has the domain name mapping information for all the web servers in the network. Each site participating in the domain name system 114 maintains its own database of information and runs a server program that other systems across the network can query. The domain name system 114 provides the protocol that allows clients and servers to communicate with each other. Any application may look up the IP address (or addresses) corresponding to a given host name or the host name corresponding to a given IP address in the domain name system 114. An application accesses the domain name system 114 through a resolver. The resolver contacts one or more name servers to perform a mapping of a host name to the corresponding IP address, or vice versa. A given host name may be associated with more than one IP address because a host may have multiple interfaces, with each interface of the host having a unique IP address. Also, a host may be replicated on multiple computers, each having its own IP address, but providing access to the same information.

The web page indexing system 116 includes an index of words used on the world wide web and addresses of the web pages that use each word. Such indexing systems are maintained by various search engines, such as the AltaVista search engine. The domain name system 114 and the web page indexing system 116 may be accessed by the web crawler 102 in the process of downloading web pages from the world wide web.

The web crawler 102 includes a communications interface 104, one or more central processing units (CPU's) 106, a clock circuit 107 for keeping track of the current time, an operator interface 108 (which may be remotely located on another computer) and memory 118. In the preferred embodiment, the communications interface 104 is able to handle overlapping communication requests. The memory 118 includes:

a multitasking operating system 120;

an Internet access procedure 122 for fetching web pages as well as communicating with the domain name system 114;

a multiplexer (mux) procedure 124 used by threads 130 for dequeuing URL's from the queues 128;

a demultiplexer (demux) procedure 126 used by the threads for enqueuing URL's on the queues 128;

a set of queues 128, also called the "Frontier," for storing addresses of web pages to be downloaded;

threads 130 for downloading web pages from the servers 112, and processing the downloaded web pages;

a host-to-queue assignment table 132 for recording dynamic assignments of host identifiers to the queues 128;

a heap or other ordered set data structure 134 for storing information about queues waiting to be serviced by threads;

a set of heap procedures 136 for adding a queue to, and for selecting a queue from the ordered set data structure 134;

a set of Queue Element handling procedures 138 for adding and deleting records of information to queue elements, and for adding and deleting name/value pairs to those records of information;

one or more URL priority determination procedures 140 for assigning a priority level to a queue element associated with a URL; and

one or more document processing applications 141, which process documents downloaded by the web crawler.

The document processing applications include instructions 139 for determining the value of various parameters (e.g., metadata sent by the host server from which the documents were downloaded) and storing corresponding name/value pairs in the download history portion of the queue elements corresponding to the downloaded documents.

In the third exemplary embodiment, discussed below, the host-to-queue assignment table 132 is used and updated by the demux and mux procedures 126, 124. In the first and second exemplary embodiments the assignment table 132 is not used.

In some of the exemplary embodiments the number of queues exceeds the number of threads, and in those embodiments the number of queues is preferably at least twice the number of threads; in some embodiments the number of queues exceeds the number of threads by a factor of three to ten. The number of threads is generally determined by the computational resources of the web crawler, while the number of queues is determined by setting a queue-to-thread ratio parameter when the web crawler is configured.

Given a set of URL's, the web crawler 102 enqueues the URL's into appropriate queues 128. Multiple threads 130 are used to dequeue URL's out of the queues 128, to download the corresponding documents or web pages from the world wide web and to extract any new URL's from the downloaded documents. Any new URL's are enqueued into the queues 128. This process repeats indefinitely or until a predetermined stop condition occurs, such as when all URL's in the queues have been processed and thus all the queues are empty. In continuous web crawler embodiments, there is no such stop condition. Multiple threads 130 are used to simultaneously enqueue and dequeue URL's from multiple queues 128. During the described process, the operating system 120 executes an Internet access procedure 122 to access hosts on the network through the communications interface 104.

FIG. 2 illustrates the relationships between a set of "m" first-in-first-out (FIFO) queues 128 and the demux and mux procedures 126, 124 in a first exemplary embodiment of the present invention. When a new URL is discovered, the new URL is passed to the demux 126. The demux 126 enqueues the new URL into an appropriate queue based on a predetermined policy. In the preferred embodiments, URL's having the same associated host component will be enqueued into the same queue. However, other URL to queue assignment policies could also be used. When a thread 130 is ready to dequeue from one of the queues 128, the head URL in the queue assigned to that thread is dequeued from that queue by the mux 124 and is passed to the thread for processing.

Queue Elements with Extensible Set of Download History Data

FIG. 3 illustrates a queue element data structure 142, also called the URL entry data structure, which is the data structure used to represent each URL in the Frontier, represented in this embodiment by queues 128. Each queue element 142 includes a URL value 144, and a list (i.e., an ordered set) of information records 148. Each record 148 includes one or more name/value pairs 149 for a particular download of the document corresponding to the URL 144, where the names identify parameters and the values are the corresponding values for those parameters. In addition to the records 148, the queue element 142 may also include a header 146 for retaining cumulative download history information, such as a count of the number of downloads of the corresponding document by the web crawler, a count of the number of download attempts, and the like. This information could also be kept in the records, with increasing count values being stored in successive records 148. The list of records associated with a URL together comprise the URL's download history.

The set of queue element handling procedures 138 that can be used by the web crawler, and more particularly by document processing applications 141 which process the pages downloaded by the web crawler, include but are not limited to the following:

Size( ) returns the number of records in the list, for the currently selected queue element;

Get(i) returns the record at position i in the list;

Delete(i) removes the record at position i from the list; compacting the list accordingly;

Add(record) inserts the given record at the front of the list; as well as procedures that operate on a particular record, including:

Lookup(name) returns the value from the name/value pair, if a matching pair is found;

Set(name, value) adds a name/value pair to the record consisting of the given name and given value, and replaces any previous pair with the identical name;

Delete(name) removes the name/value pair with the given name from the record, if a matching pair is found; and

Enumerate( ) returns a list of the name/value pairs in the record.

As will be described in more detail below, when a queue element is removed from the Frontier, a new empty record is added to its download history, representing the imminent download attempt. The document identified by the queue element's URL is downloaded and processed. During the course of processing a document, all records of the corresponding queue element's download history may be inspected, and name/value pairs may be set in the element's newly added record.

In the case of a continuous crawl, the queue element is reinserted into the Frontier. Before the queue element is reinserted, one or more of its records maybe removed. If no records are removed, the document's complete download history is kept. Other alternatives include, but are not limited to: keeping the "p" most recent records; keeping a uniform sample of records (e.g., for every third download); keeping a random sample of records (e.g., each record might be kept with a probabi