|
Claims  |
|
|
What is claimed is:
1. A method of performing a continuous crawl for locating and downloading
documents from among a plurality of host computers, comprising:
(a) obtaining at least one referring document set that includes addresses
of one or more referred documents; each referred document address
including a host component;
(b) enqueuing queue elements in a plurality of queues, each queue element
denoting one of the referred document addresses; each queue element
including a download history comprising zero or more records;
(c) substantially concurrently operating a plurality of threads;
(d) while operating each thread, repeatedly performing steps of:
(d1) identifying a queue element in a selected one of the queues,
downloading a referred document corresponding to a referred document
address in the identified queue element, and dequeuing the identified
queue element;
(d2) adding a record to the queue element;
(d3) executing at least one application program, distinct from a web
crawler application that performs the downloading and dequeuing, for
processing the downloaded document, the at least one application program
including instructions that store name/value pairs in the record added to
the queue element, wherein the name of each name/value pair is specified
by the at least one application program and the value of each name/value
is determined by the at least one application program; and
(d4) storing the queue element, including the added record, in a predefined
data structure for further processing.
2. The method of claim 1, wherein the at least one application program
includes instructions for deleting a subset of the records in the queue
element in accordance with predefined record deletion criteria.
3. The method of claim 2, wherein the at least one application program
includes instructions for determining a number corresponding to how many
records are in the queue element, and the predefined record deletion
criteria include the number of records in the queue element exceeding a
threshold value.
4. The method of claim 1, wherein the at least one application program
includes instructions for reading the name/value pairs in at least one of
the queue element and for conditionally performing an action based on the
value in at least one of the name/value pairs read by the at least one
application program.
5. The method of claim 1, wherein
the plurality of queues includes a plurality of parallel priority level
queues, each having a distinct associated download priority level, the
download priority level corresponding to a probability of the queue
elements enqueued in the associated priority level queue therein being
processed by the threads; and
step d4 includes determining a download priority level for the document
associated with the queue element as a function of the download history of
the queue element.
6. The method of claim 5, wherein
the name/value pairs stored in each of the records in the queue element
include at least one content based value which can be compared with a
corresponding content based value in another of the records to determine
whether the document's content changed between the downloads of the
document corresponding to the records in which the content based values
are stored;
step d4 includes determining a download priority level for the document
associated with the queue element as a function of whether the content
based value in a last one of the records in the queue element is not equal
to the corresponding content based value in an earlier one of the records
in the queue element.
7. The method of claim 6, wherein the content base value is a checksum of
the contents of the document corresponding to the queue element.
8. The method of claim 5, wherein
the name/value pairs stored in each of the records in the queue element
include a purported expiration date and time; and
step d4 includes comparing the purported expiration date and time with at
least one other date and time value and assigning the queue element a
download priority level in accordance with an outcome of the comparison.
9. The method of claim 1, wherein the name/value pairs stored in each of
the records by the at least one application program are dynamically
extensible by the at least one application program.
10. A computer program product for use in conjunction with a computer
system, the computer program product comprising a computer readable
storage medium and a computer program mechanism embedded therein, the
computer program mechanism comprising:
an enqueuing module that, when executed by the computer system, obtains at
least one referring document that includes addresses of one or more
referred documents, each referred document address including a host
component corresponding to a host computer, and enqueues queue elements in
a plurality of queues, each queue element denoting one of the referred
document addresses; each queue element including a download history
comprising zero or more records; and
a dequeuing module that is substantially concurrently executed by each of a
plurality of threads so as to process the referred document addresses in
the queues; the dequeuing module, when executed by a respective one of the
threads, repeatedly performs the functions of
(a1) identifying a queue element in a selected one of the queues,
downloading a referred document corresponding to a referred document
address in the identified queue element, and dequeuing the identified
queue element;
(a2) adding a record to the queue element;
(a3) executing at least one application program, distinct from the
enqueuing module and dequeuing module, for processing the downloaded
document, the at least one application program including instructions that
store name/value pairs in the record added to the queue element, wherein
the name of each name/value pair is specified by the at least one
application program and the value of each name/value is determined by the
at least one application program; and
(a4) storing the queue element, including the added record, in a predefined
data structure for further processing.
11. The computer program product of claim 10, wherein said enqueuing module
is configured to use at least one of the downloaded referred documents as
a new referring document.
12. The computer program product of claim 10, wherein the at least one
application program includes instructions for deleting a subset of the
records in the queue element in accordance with predefined record deletion
criteria.
13. The computer program product of claim 10, wherein the at least one
application program includes instructions for determining a number
corresponding to how many records are in the queue element, and the
predefined record deletion criteria include the number of records in the
queue element exceeding a threshold value.
14. The computer program product of claim 10, wherein the at least one
application program includes instructions for reading the name/value pairs
in at least one of the records in the queue element and for conditionally
performing an action based on the value in at least one of the name/value
pairs read by the at least one application program.
15. The computer program product of claim 10, wherein
the plurality of queues includes a plurality of parallel priority level
queues, each having a distinct associated download priority level, the
download priority level corresponding to a probability of the queue
elements enqueued in the associated priority level queue therein being
processed by the threads; and
the computer program product includes instructions for determining a
download priority level for the document associated with the queue element
as a function of the download history of the queue element.
16. The computer program product of claim 15, wherein
the name/value pairs stored in each of the records in the queue element
include at least one content based value which can be compared with a
corresponding content based value in another of the records to determine
whether the document's content changed between the downloads of the
document corresponding to the records in which the content based values
are stored; and
the computer program product includes instructions for determining a
download priority level for the document associated with the queue element
as a function of whether the content based value in a last one of the
records in the queue element is not equal to the corresponding content
based value in an earlier one of the records in the queue element.
17. The computer program product of claim 16, wherein the content base
value is a checksum of the contents of the document corresponding to the
queue element.
18. The computer program product of claim 15, wherein
the name/value pairs stored in each of the records in the queue element
include a purported expiration date and time; and
the computer program product includes instructions for comparing the
purported expiration date and time with at least one other date and time
value and assigning the queue element a download priority level in
accordance with an outcome of the comparison.
19. The computer program product of claim 10, wherein the name/value pairs
stored in each of the records by the at least one application program are
dynamically extensible by the at least one application program.
20. A computer system for downloading documents from among a plurality of
host computers, comprising:
a plurality of threads of execution;
an enqueuing module that, when executed by the computer system, obtains at
least one referring document that includes addresses of one or more
referred documents, each referred document address including a host
component corresponding to a host computer, and enqueues queue elements in
a plurality of queues, each queue element denoting one of the referred
document addresses; each queue element including a download history
comprising zero or more records; and
a dequeuing module that is substantially concurrently executed by each of
the plurality of threads so as to process the referred document addresses
in the queues; the dequeuing module, when executed by a respective one of
the threads, repeatedly performs the functions of:
(a1) identifying a queue element in a selected one of the queues,
downloading a referred document corresponding to a referred document
address in the identified queue element, and dequeuing the identified
queue element;
(a2) adding a record to the queue element;
(a3) executing at least one application program, distinct from the
enqueuing module and dequeuing module, for processing the downloaded
document, the at least one application program including instructions that
store name/value pairs in the record added to the queue element, wherein
the name of each name/value pair is specified by the at least one
application program and the value of each name/value is determined by the
at least one application program; and
(a4) storing the queue element, including the added record, in a predefined
data structure for further processing.
21. The computer system of claim 20, wherein said enqueuing module is
configured to use at least one of the downloaded referred documents as a
new referring document.
22. The computer system of claim 20, wherein the at least one application
program includes instructions for deleting a subset of the records in the
queue element in accordance with predefined record deletion criteria.
23. The computer system of claim 20, wherein the at least one application
program includes instructions for determining a number corresponding to
how many records are in the queue element, and the predefined record
deletion criteria include the number of records in the queue element
exceeding a threshold value.
24. The computer system of claim 20, wherein the at least one application
program includes instructions for reading the name/value pairs in at least
one of the records in the queue element and for conditionally performing
an action based on the value in at least one of the name/value pairs read
by the at least one application program.
25. The computer system of claim 20, wherein
the plurality of queues includes a plurality of parallel priority level
queues, each having a distinct associated download priority level, the
download priority level corresponding to a probability of the queue
elements enqueued in the associated priority level queue therein being
processed by the threads; and
the at least one application program includes instructions for determining
a download priority level for the document associated with the queue
element as a function of the download history of the queue element.
26. The computer system of claim 25, wherein
the name/value pairs stored in each of the records in the queue element
include at least one content based value which can be compared with a
corresponding content based value in another of the records to determine
whether the document's content changed between the downloads of the
document corresponding to the records in which the content based values
are stored; and
the at least one application program includes instructions for determining
a download priority level for the document associated with the queue
element as a function of whether the content based value in a last one of
the records in the queue element is not equal to the corresponding content
based value in an earlier one of the records in the queue element.
27. The computer system of claim 25, wherein the content base value is a
checksum of the contents of the document corresponding to the queue
element.
28. The computer system of claim 25, wherein
the name/value pairs stored in each of the records in the queue element
include a purported expiration date and time; and
the at least one application program includes instructions for comparing
the purported expiration date and time with at least one other date and
time value and assigning the queue element a download priority level in
accordance with an outcome of the comparison.
29. The computer system of claim 20, wherein the name/value pairs stored in
each of the records by at the least one application program are
dynamically extensible by the at least one application program. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
The present invention relates to a system and method for accessing
documents, called web pages, on the world wide web (WWW) and, more
particularly, to a method for associating an extensible set of data with
each document downloaded by a web crawler.
BACKGROUND OF THE INVENTION
Documents on interconnected computer networks are typically stored on
numerous host computers that are connected over the networks. For example,
so-called "web pages" are stored on the global computer network known as
the Internet, which includes the world wide web. Each web page on the
world wide web has a distinct address called its uniform resource locator
(URL), which identifies the location of the web page. Most of the
documents on the world wide web are written in standard document
description languages (e.g., HTML, XML). These languages allow an author
of a document to create hypertext links to other documents. Hypertext
links allow a reader of a web page to quickly move to other web pages by
clicking on their respective links. These links are typically highlighted
in the original web page. A web page containing hypertext links to other
web pages generally refers to those pages by their URL's. Links in a web
page may refer to web pages that are stored in the same or different host
computers.
A web crawler is a program that automatically finds and downloads documents
from host computers in networks such as the world wide web. When a web
crawler is given a set of starting URL's, the web crawler downloads the
corresponding documents, extracts any URL's contained in those downloaded
documents and downloads more documents using the newly discovered URL's.
This process repeats indefinitely or until a predetermined stop condition
occurs. As of 1999 there were approximately 500 million web pages on the
world wide web and the number is continuously growing; thus, web crawlers
need efficient data structures to keep track of downloaded documents and
any discovered addresses of documents to be downloaded.
Collecting Information About Documents Downloaded by a Web Crawler
After a document is downloaded by the web crawler, the web crawler may
extract and store information about the downloaded page. For instance, the
web crawler may determine if the downloaded page contains any new URL's
not previously known to the web crawler, and may enqueue those URL's for
later processing. In addition, pages downloaded by the web crawler may be
processed by a sequence of processing modules. For instance, one
processing module might determine whether the document has already been
included in a web page index, and whether the page has changed by more
than a predefined amount since its entry in the web page index was last
updated. Another processing module might add or update a document's entry
in the web page index. Yet another processing module might look for
information of a specific type in the downloaded documents, extract the
information and store it in a directory or other data structure.
During the course of processing a downloaded document, various data can be
collected about it. Examples include the date and time of the download,
how long it took to perform the download, whether the download was
successful, the document's size, its MIME type, the date and time it was
last modified, its expiration date and time, and a checksum of its
contents. These data can be used for a variety of purposes, including, but
not limited to:
passing information from one processing module to a later processing module
in a processing pipeline;
collecting statistics about the downloaded documents; and
in the context of a continuous web crawler, the collected data can be used
as a basis for determining when a document should next be downloaded
(refreshed).
After a document has been processed, its associated data can be saved to
disk and analyzed off line.
A continuous web crawler is one that automatically refreshes a database of
information about the pages it has downloaded. A web page can have an
assigned or purported expiration date and time, which indicates when the
page should be assumed to be no longer valid. Furthermore, a web crawler
can be configured to assume that certain types of pages, such as pages on
certain types of web sites, cannot be valid for more that a particular
length of time. Thus, pages on a news web site might be assumed to be
valid for only a few hours, while pages of an online encyclopedia might be
assumed to be valid for a much longer time, such as month.
In the context of a continuous web crawler, it may be advantageous to
record not only the data associated with a document's most recent
download, but also with its previous downloads. How complete a document
download history to keep may vary depending on the user's requirements.
The Scooter (a trademark of AltaVista Company) web crawler saves a fixed
set of data for each document it discovers and downloads, namely, the
document's URL, the number of attempts that have been made to download it,
the date and time of the last download attempt, the HTTP status code of
the last download, and the document's last modification date and time.
The Sphinx web crawler developed by Bharat and Miller allows document
classifiers to associate name/value pairs with a downloaded page. However,
Sphinx discards any name/value pairs associated with a document once the
document has been processed. Moreover, the values must be strings, not
values of arbitrary types.
It would be desirable to provide a much more flexible mechanism that
enables application programs that process downloaded pages to determine
what information to save for each document downloaded. In that way the
data structure for storing such information would be dynamically
determined, and the manner in which that information is used would be
dynamically determined, without having to customize the code of the web
crawler for each application.
Prioritizing Document Downloads
Every web crawler must maintain a data structure or set of data structures
reflecting the set of URL's that still must be downloaded. In this
document, that set of data structures is called "the Frontier." The
crawler repeatedly selects a URL from the Frontier, downloads the
corresponding document, processes the downloaded document, and then either
removes the URL from the Frontier or reschedules it for downloading again
at a later time. The latter scheme is used for so-called "continuous" web
crawlers.
When selecting a URL from the Frontier, the inventors have determined that
it would often be desirable for the crawler to preferentially select
certain URL's over others so as to maximize the quality of the information
processed by the other applications to which the web crawler passes
downloaded documents. For instance, the web crawler may pass downloaded
pages to a document indexer. An index of documents on an Intranet or the
Internet will be more accurate or higher quality if the documents of most
interest to the users of the index have been preferentially updated so as
to make sure that those documents are accurately represented in the index.
To accomplish this, the web crawler might preferentially select URL's on
web servers with known high quality content. Alternately, heuristics might
be used to gauge page quality. For instance, shorter URL's might be
considered to be better candidates than longer URL's.
In the context of a continuous web crawler, it may be desirable to prefer
URL's on web servers whose content is known to change rapidly, such as
news sites. It may be desirable to prefer newly-discovered URL's over
those that have been previously processed. Among the previously processed
URL's, it may be advantageous to prefer URL's whose content has changed
between the previous two downloads over URL's whose content has not
changed, and to prefer URL's with shorter expiration dates over those with
longer expiration dates.
Maintaining Freshness of Documents Downloaded by a Continuous Web Crawler
As alluded to earlier, web crawlers are traditionally used to collect
documents from the world wide web, as well as from Intranets, for some
purpose, the most common of which is to build an index for a search
engine. However, since many of the documents on the web and on Intranets
change over time, at any given point in time, some fraction of any web
index will contain stale content.
There are two obvious approaches to refreshing an index. One is to perform
repeated complete or "scratch" crawls to rebuild the index from scratch.
The disadvantage of this approach is that many of the documents may not
have changed between the two scratch crawls, in which case valuable
computer resources will be wasted unnecessarily refetching and processing
documents. Another approach is to perform a more targeted crawl, but it is
difficult to know a priori which documents need to be refetched, since the
web does not include an invalidation mechanism. That is, the only way to
discover that a page has changed is to query its web server.
Therefore it would be desirable to have a mechanism for keeping the results
of a crawl up to date, using a continuous crawl that is somehow biased
toward pages that are most likely to have been changed since the last time
the crawler fetched them.
SUMMARY OF THE INVENTION
A web crawler downloads documents from among a plurality of host computers.
The web crawler enqueues document addresses in a data structure called the
Frontier. The Frontier generally includes a set of queues, with all
document addresses sharing a respective common host component being stored
in a respective common one of the queues. Multiple threads substantially
concurrently process the document addresses in the queues.
The web crawler includes a set of tools for storing an extensible set of
data with each document address (URL) in the Frontier. These tools enable
the applications to which the web crawler passes downloaded documents to
store a record of information associated with each download, where each
record of information includes a set of name/value pairs specified by the
applications. The applications also determine how many records of
information to retain for each URL, when to delete records of information,
and so on.
In another aspect of the present invention, the Frontier includes a set of
parallel "priority queues," each associated with a distinct priority
level. Queue elements for URL's to be downloaded are assigned a priority
level, and then stored in the corresponding priority queue. Queue elements
are then distributed from the priority queues to a set of underlying
queues in accordance with their relative priorities. The threads then
process the queue elements in the underlying queues.
In yet another aspect of the present invention, the web crawler performs a
continuous crawl. The URL element for each downloaded document is assigned
a priority level and then reinserted into the Frontier, in the priority
queue corresponding to the assigned priority level. The priority level is
determined as a function of the extensible set of data stored with the
queue element. Each queue element for a newly found URL is also assigned a
priority level. That priority level is based on the fact that it is a
newly found URL and may also be based on properties of the URL itself, or
the web page on which the URL was found.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of a distributed computer system illustrating an
exemplary embodiment of the invention.
FIG. 2 is a block diagram illustrating an first exemplary embodiment of the
invention.
FIG. 3 is a block diagram of a queue element stored in the Frontier data
structures of the first exemplary embodiment.
FIGS. 4 and 5 are flow charts depicting the first exemplary embodiment of
the invention.
FIGS. 6 and 7 are block diagrams illustrating the Frontier data structures
used in a second exemplary embodiment of the invention.
FIGS. 8A and 8B are flow charts depicting the second exemplary embodiment
of the invention.
FIG. 9 is block diagram illustrating the Frontier data structures used in a
third exemplary embodiment of the invention.
FIG. 10 illustrates a table used in the third exemplary embodiment.
FIG. 11 is a block diagram of an ordered set data structure and procedures
used to access the ordered set in the third exemplary embodiment of the
invention.
FIGS. 12, 13, 14, 15 and 16 are flow charts depicting the third exemplary
embodiment of the invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
FIG. 1 shows an exemplary embodiment of a distributed computer system 100.
The distributed computer system 100 includes a web crawler 102 connected
to a network 103 through a network interconnection 110. The network 103
may be a global communication network, such as the Internet, or a private
network, sometimes called an Intranet. Examples of the network
interconnection 110 include switches, routers, etc.
The network 103 includes web servers 112 and a service known as a domain
name system 114. It may also optionally include a web page indexing system
116. The web servers 112 store web pages. The domain name system 114 is a
distributed database that provides the mapping between Internet protocol
(IP) addresses and host names. The domain name system 114 is a distributed
system because no single site on the Internet has the domain name mapping
information for all the web servers in the network. Each site
participating in the domain name system 114 maintains its own database of
information and runs a server program that other systems across the
network can query. The domain name system 114 provides the protocol that
allows clients and servers to communicate with each other. Any application
may look up the IP address (or addresses) corresponding to a given host
name or the host name corresponding to a given IP address in the domain
name system 114. An application accesses the domain name system 114
through a resolver. The resolver contacts one or more name servers to
perform a mapping of a host name to the corresponding IP address, or vice
versa. A given host name may be associated with more than one IP address
because a host may have multiple interfaces, with each interface of the
host having a unique IP address. Also, a host may be replicated on
multiple computers, each having its own IP address, but providing access
to the same information.
The web page indexing system 116 includes an index of words used on the
world wide web and addresses of the web pages that use each word. Such
indexing systems are maintained by various search engines, such as the
AltaVista search engine. The domain name system 114 and the web page
indexing system 116 may be accessed by the web crawler 102 in the process
of downloading web pages from the world wide web.
The web crawler 102 includes a communications interface 104, one or more
central processing units (CPU's) 106, a clock circuit 107 for keeping
track of the current time, an operator interface 108 (which may be
remotely located on another computer) and memory 118. In the preferred
embodiment, the communications interface 104 is able to handle overlapping
communication requests. The memory 118 includes:
a multitasking operating system 120;
an Internet access procedure 122 for fetching web pages as well as
communicating with the domain name system 114;
a multiplexer (mux) procedure 124 used by threads 130 for dequeuing URL's
from the queues 128;
a demultiplexer (demux) procedure 126 used by the threads for enqueuing
URL's on the queues 128;
a set of queues 128, also called the "Frontier," for storing addresses of
web pages to be downloaded;
threads 130 for downloading web pages from the servers 112, and processing
the downloaded web pages;
a host-to-queue assignment table 132 for recording dynamic assignments of
host identifiers to the queues 128;
a heap or other ordered set data structure 134 for storing information
about queues waiting to be serviced by threads;
a set of heap procedures 136 for adding a queue to, and for selecting a
queue from the ordered set data structure 134;
a set of Queue Element handling procedures 138 for adding and deleting
records of information to queue elements, and for adding and deleting
name/value pairs to those records of information;
one or more URL priority determination procedures 140 for assigning a
priority level to a queue element associated with a URL; and
one or more document processing applications 141, which process documents
downloaded by the web crawler.
The document processing applications include instructions 139 for
determining the value of various parameters (e.g., metadata sent by the
host server from which the documents were downloaded) and storing
corresponding name/value pairs in the download history portion of the
queue elements corresponding to the downloaded documents.
In the third exemplary embodiment, discussed below, the host-to-queue
assignment table 132 is used and updated by the demux and mux procedures
126, 124. In the first and second exemplary embodiments the assignment
table 132 is not used.
In some of the exemplary embodiments the number of queues exceeds the
number of threads, and in those embodiments the number of queues is
preferably at least twice the number of threads; in some embodiments the
number of queues exceeds the number of threads by a factor of three to
ten. The number of threads is generally determined by the computational
resources of the web crawler, while the number of queues is determined by
setting a queue-to-thread ratio parameter when the web crawler is
configured.
Given a set of URL's, the web crawler 102 enqueues the URL's into
appropriate queues 128. Multiple threads 130 are used to dequeue URL's out
of the queues 128, to download the corresponding documents or web pages
from the world wide web and to extract any new URL's from the downloaded
documents. Any new URL's are enqueued into the queues 128. This process
repeats indefinitely or until a predetermined stop condition occurs, such
as when all URL's in the queues have been processed and thus all the
queues are empty. In continuous web crawler embodiments, there is no such
stop condition. Multiple threads 130 are used to simultaneously enqueue
and dequeue URL's from multiple queues 128. During the described process,
the operating system 120 executes an Internet access procedure 122 to
access hosts on the network through the communications interface 104.
FIG. 2 illustrates the relationships between a set of "m"
first-in-first-out (FIFO) queues 128 and the demux and mux procedures 126,
124 in a first exemplary embodiment of the present invention. When a new
URL is discovered, the new URL is passed to the demux 126. The demux 126
enqueues the new URL into an appropriate queue based on a predetermined
policy. In the preferred embodiments, URL's having the same associated
host component will be enqueued into the same queue. However, other URL to
queue assignment policies could also be used. When a thread 130 is ready
to dequeue from one of the queues 128, the head URL in the queue assigned
to that thread is dequeued from that queue by the mux 124 and is passed to
the thread for processing.
Queue Elements with Extensible Set of Download History Data
FIG. 3 illustrates a queue element data structure 142, also called the URL
entry data structure, which is the data structure used to represent each
URL in the Frontier, represented in this embodiment by queues 128. Each
queue element 142 includes a URL value 144, and a list (i.e., an ordered
set) of information records 148. Each record 148 includes one or more
name/value pairs 149 for a particular download of the document
corresponding to the URL 144, where the names identify parameters and the
values are the corresponding values for those parameters. In addition to
the records 148, the queue element 142 may also include a header 146 for
retaining cumulative download history information, such as a count of the
number of downloads of the corresponding document by the web crawler, a
count of the number of download attempts, and the like. This information
could also be kept in the records, with increasing count values being
stored in successive records 148. The list of records associated with a
URL together comprise the URL's download history.
The set of queue element handling procedures 138 that can be used by the
web crawler, and more particularly by document processing applications 141
which process the pages downloaded by the web crawler, include but are not
limited to the following:
Size( ) returns the number of records in the list, for the currently
selected queue element;
Get(i) returns the record at position i in the list;
Delete(i) removes the record at position i from the list; compacting the
list accordingly;
Add(record) inserts the given record at the front of the list; as well as
procedures that operate on a particular record, including:
Lookup(name) returns the value from the name/value pair, if a matching pair
is found;
Set(name, value) adds a name/value pair to the record consisting of the
given name and given value, and replaces any previous pair with the
identical name;
Delete(name) removes the name/value pair with the given name from the
record, if a matching pair is found; and
Enumerate( ) returns a list of the name/value pairs in the record.
As will be described in more detail below, when a queue element is removed
from the Frontier, a new empty record is added to its download history,
representing the imminent download attempt. The document identified by the
queue element's URL is downloaded and processed. During the course of
processing a document, all records of the corresponding queue element's
download history may be inspected, and name/value pairs may be set in the
element's newly added record.
In the case of a continuous crawl, the queue element is reinserted into the
Frontier. Before the queue element is reinserted, one or more of its
records maybe removed. If no records are removed, the document's complete
download history is kept. Other alternatives include, but are not limited
to: keeping the "p" most recent records; keeping a uniform sample of
records (e.g., for every third download); keeping a random sample of
records (e.g., each record might be kept with a probabi | | |