|
Description  |
|
|
BACKGROUND OF THE INVENTION
The invention relates to capturing hypertext web pages for convenient
viewing.
The World Wide Web ("the web") of the Internet has become in recent years a
popular means of publishing documentary information. In particular, it is
now common for users with access to the web to browse through collections
of linked documents through the use of hypertext browsers, such as
Netscape Navigator.TM. or Microsoft Internet Explorer.TM., whereby
selection by the user of certain screen objects in a displayed document
causes the contents of another document to be retrieved and displayed to
the user.
Many of the documents on the web are encoded using a markup language known
as the Hypertext Markup Language (HTML). HTML Version 3.2 with Frame
Extensions is described in Graham, HTML Sourcebook, Third Edition,
published by Wiley Computer Publishing, 1997. A markup language is a set
of codes or tags which can be embedded within a document to describe how
it should be displayed on a display device, such as a video screen or a
printer. HTML is what is known as a "semantic" markup language. This means
that, while it is possible to use HTML to dictate certain physical
characteristics of a document (such as line spacing or font size), many
HTML tags merely identify the logical features of the document, such as
titles, paragraphs, lists, tables, and the like. The precise manner in
which these logical features are displayed is then left to the browser
software to determine at the time the document is displayed.
Because HTML tags often do not specify a fixed physical size of a document
or its components, the precise appearance of a particular document
displayed by a browser will often depend on the size of the browser window
in which it is displayed. For example, FIGS. 1 and 2 show two views of the
home web page of the US Patent and Trademark Office (specified by Uniform
Resource Locator (URL) http://www.uspto.gov/ in September of 1997). In
FIG. 2, the web browser window is significantly smaller than that in FIG.
1 and, as can be seen, the web page as seen through the two windows
differs in its overall appearance, for example with respect to the width
of the title 30 and list element 40.
One important feature of HTML is the ability, within an HTML document, to
refer to external data resources. One way that such references are used
within HTML is to identify auxiliary documents which are sources of
content to be displayed as part of the display of the HTML document. For
example, the HTML tag "IMG" specifies that the contents of a specified
image document should be displayed within a portion of the display of the
HTML document in which the IMG tag is found. Similarly, the tag "FRAME"
within an HTML document specifies that the content of a specified document
should be displayed within a particular frame of a frame set defined by
the HTML document. (The use of frames and frame sets within HTML is
explained in more detail below).
HTML also features the ability to have a hypertext link within an HTML
document. A hypertext link within an HTML document creates an association
between a screen object (e.g., a word or an image) and an external
resource. When the HTML document is displayed by a browser, a user may
select the screen object, and the browser will respond by retrieving and
displaying content from the external resource. A hypertext link may be
specified within an HTML document with, for example, the HTML anchor tag
with an HREF attribute.
The use of such external references within HTML facilitates distributed
document storage on a wide area network (WAN). A large document may be
broken up and stored as a set of smaller documents logically associated by
external references. For example, it is common for the graphical images in
an HTML document to be stored as separate documents (e.g., in the GIF or
JPEG format). It is also common to store sections of a large text as
separate documents, and to facilitate easy movement from one section to
another through the use of hypertext links.
In addition, a set of pre-existing documents may be linked together with
HTML tags to form a coherent whole. For example, an HTML document may be
created containing hypertext links to a set of pre-existing documents
relating to a common subject, thus facilitating the systematic review of
such documents by a user.
A characteristic of HTML documents is that they are not paginated. That is,
the displayed "height" of an HTML document is determined solely by the
amount and arrangement of the screen objects defined within it, as
displayed by the browser used to view it, and not by any fixed page size
associated with the document. (Here "page size" does not necessarily refer
to physical pages printed on paper, for example, but is simply a
characteristic of an electronic document in which the content of the
document is divided into a sequence of regions with fixed dimensions.) If
the displayed document does not fit within the height of the browser
window, the browser permits scrolling of the web page to permit additional
content to be viewed. FIG. 3 shows the home web page of the US Patent and
Trademark Office displayed within the same browser window as in FIG. 2,
except that the page has been scrolled somewhat to reveal additional
material.
A recent extension to HTML permits multiple scrollable and resizable
"frames" to be displayed within a single browser window. A frame is
defined by a special type of HTML document known as a "frame set". A frame
set provides information giving the size and orientation of frames in a
window, and specifies the contents of each frame. The contents of a frame
may be either the contents of an HTML document, or a subsidiary frame set
(i.e., a frame set, the entire contents of which appear within a single
frame of the larger frame set). As with other HTML screen objects, the
height or width of a frame may be specified in absolute or relative terms.
FIGS. 4, 5 and 6 illustrate the operation of frames in HTML. FIG. 4 shows a
browser window displaying a frame set containing two frames. Frame 50 is a
narrow vertical column on the left hand side of the screen. Frame 55 is a
wider column to the right of frame 50. Frame 50 contains an HTML document
which is as long as the browser window is high, while frame 55 contains a
document which is longer than the browser window's height. As can be seen
in FIG. 5, frame 55 can be scrolled independently of frame 50 to display
the remainder of the HTML document contained within it.
In the above example, frame 50 is defined to have a fixed width of 115
pixels, whereas the width of frame 55 is defined relative to the width of
frame 50--its width is set equal to the browser window's width, less the
115 pixels used by frame 50. As can be seen in FIG. 6, when the browser
window is made smaller, frame 55 shrinks accordingly, while frame 50
remains at a fixed width.
As explained above, the ultimate appearance of an HTML document being
displayed by a browser will usually depend on the size of the browser
window (or frame) in which it is to be displayed. In general, a web
browser will extract from an HTML document a series of screen objects
(e.g., words, images, lists, frames or tables), and place them
sequentially in rows on the screen. When a row has been filled, the next
object is placed in a successive row. This process continues until all
screen objects within the HTML document have been placed.
This general principle, however, is limited by the constraint that the
width of the displayed HTML document cannot be narrower than the minimum
width of the widest screen object contained within it. Under this
constraint, if the minimum width of a screen object is wider than the
width of the browser window, parts of the document will remain off screen
(to the left or right) when viewed through the browser window, and a
horizontal scroll bar will typically be displayed to permit the user to
shift views of the document to the left or right.
HTML screen objects may have either a fixed or a variable width. For
example, the width of a single word of text in an HTML document is fixed
(given the font chosen by the browser in which to display it). Its width
is determined by the characters in the word and the size font in which
they will be displayed. Similarly, the width of a cell in an HTML table
may be made fixed by explicitly specifying its width as a certain number
of pixels.
By contrast, the width of a variable width screen object will vary,
depending on the width of the browser window in which it appears. However,
even a variable width screen object will have a minimum width. For
example, the width of a paragraph of text will generally vary according to
the size of the browser window; however, it can be no narrower than the
widest word contained within the paragraph. Similarly, a table containing
images may have cells whose widths are defined in relative terms, but the
table nonetheless cannot be narrower than the sum of the widths of the
images within its widest row.
This constraint is illustrated in FIGS. 7, 8, 9 and 10. In each of FIGS. 7,
8 and 9, an identical HTML document is displayed in a browser window 65.
An excerpt of the underlying HTML code is shown in FIG. 10. Referring to
FIGS. 7 and 10, the document being displayed includes a table 80 having
two cells aligned to the top, one cell 85 containing a client-side image
map and the other cell 90 containing the heading "U.S. Patent and
Trademark Office", a horizontal line, and an unordered list with the
heading "New on the PTO site:". In FIG. 8, the window 65 is narrower than
in FIG. 7, but wider than the minimum width of any object on the screen.
Therefore, each line of the document is adjusted to be as wide as the
window 65 and nothing is hidden from the user to the right of the browser
window. By contrast, in FIG. 9, window 65 is narrower than the minimum
width of table 80, since the fixed width of the image map in cell 85 plus
the width of the widest word in cell 90 (the word "trademark") is greater
than the width of the browser window 65. Therefore, the resulting display
width of the document is wider than window 65, resulting in the rightmost
part of the document being hidden from view.
While collections of visual display data on the web are typically stored as
sets of linked HTML documents, it is also common and convenient for visual
display data to be stored as a single document, having a fixed page size,
using a physical markup language such as the portable document format
(PDF). PDF is described in the publication Adobe Systems, Inc., Portable
Document Format Reference Manual, Addison-Wesley Publishing Co., 1993.
SUMMARY OF THE INVENTION
In general, in one aspect, the invention features a method for converting a
semantic markup representation of a document into a physical markup
representation of the document. The method includes calculating a logical
minimum width equal to the minimum width required to display all screen
objects within the document at their normal size, creating a physical
markup representation of the document, the physical markup representation
having a width at least as wide as the logical minimum width, and
conforming the physical markup representation to a target size, including
a target width, such that conforming the physical markup representation
includes scaling the width of the physical markup representation by a
scaling factor derived from the ratio of an element of the target size to
the logical minimum width. Preferred embodiments of the invention include
one or more of the following features. The physical markup representation
is incorporated into a newly created document. The physical markup
representation is incorporated into an existing document. The element of
the target size is the target width. The physical markup representation is
a paginated representation including pages each having a respective
physical width and a respective physical height. The target size includes
a target height. The target size is a standard paper size. The standard
paper size is one of 8.5.times.11 inches, 8.5.times.14 inches, A4, A5, and
11.times.17 inches. The pages of the physical markup representation have
the same aspect ratio as the target size. The height of the physical
markup representation is scaled by the scaling factor. The page height of
the physical markup representation is scaled by the scaling factor. The
element of the target size is the target height. The pages of the physical
markup representation are rotated by plus or minus 90.degree.. The ratio
of the target width to the logical minimum width is tested whether it is
less than a specified threshold. The document is a frame set specifying a
plurality of frames. The document contains at least one hypertext link,
the physical markup representation is displayed in a viewer, and an
external document is accessed when a hypertext link is selected by a user
from the displayed markup. The hypertext link is a server-side image map.
The semantic markup representation is HTML. The physical markup
representation is PDF. After the physical markup representation is
conformed to the target size, the physical markup representation is scaled
by the inverse of scaling factor and the result is displayed in a viewer.
In general, in another aspect, the invention features a method for
displaying hypertext data. The method includes displaying in a viewer a
first document represented in a physical markup representation and
containing at least one hypertext link, accessing an external document
when a hypertext link is selected by a user from the displayed first
document, converting the semantic markup representation of the external
document into a physical markup representation, and incorporating the
physical markup representation of the external document into the first
document. Preferred embodiments of the invention include one or more of
the following features. A hypertext link is modified to point to the
physical markup representation of the external document. The original
state of the hypertext link is saved. In response to an action deleting a
portion of the first document, a hypertext link which pointed to the
deleted portion is restored to its original state. The external document
is digested to create a digest of the external document, and the digest of
the external document is tested to determine whether the physical markup
representation of the external document has already been incorporated into
the first document. The external document comprises a primary document and
one or more auxiliary documents. Each auxiliary document is digested to
create a respective auxiliary document digest, and the digital digest of
each auxiliary document is tested to determine whether the physical markup
representation of the external document has already been incorporated into
the first document. The digital digest is a compound digest.
In general, in another aspect, the invention features a method for creating
a distinguishing identifier of a collection of data comprising a primary
document and one or more auxiliary documents. The method includes
digesting each auxiliary document to create a respective auxiliary
document digest and creating a distinguishing identifier by digesting a
concatenation of the primary document with all auxiliary document digests.
Preferred embodiments of the invention include one or more of the
following features. A digital digest algorithm is applied. The digital
digest algorithm is the MD5 Message Digest Algorithm.
In general, in another aspect, the invention features a method for
retrieving documents transitively linked to an initial document on a
hierarchical file system. The method includes retrieving the initial
document and retrieving only those other documents for which there is a
transitive link from the initial document to the other document and for
which the transitive link includes documents which are all within the same
directory path as the initial document. Preferred embodiments of the
invention include one or more of the following features. The hierarchical
file system is distributed on a network. The hierarchical file system is
distributed on an internet.
In general, in another aspect, the invention features a computer program,
residing on a computer-readable medium, for converting a semantic markup
representation of a document into a physical markup representation of the
document, having instructions for causing a computer to calculate a
logical minimum width equal to the minimum width required to display all
screen objects within the document at their normal size, create a physical
markup representation of the document, the physical markup representation
having a width at least as wide as the logical minimum width, and conform
the physical markup representation to a target size, including a target
width, the instructions for causing a computer to conform the physical
markup representation including instructions for causing a computer to
scale the width of the physical markup representation by a scaling factor
derived from the ratio of an element of the target size to the logical
minimum width. Preferred embodiments of the invention include one or more
of the following features. The program includes instructions for causing a
computer to incorporate the physical markup representation into a newly
created document. The program includes instructions for causing a computer
to incorporate the physical markup representation into an existing
document. The element of the target size is the target width. The physical
markup representation is a paginated representation including pages each
having a respective physical width and a respective physical height. The
target size includes a target height. The target size is a standard paper
size. The standard paper size is one of 8.5.times.11 inches, 8.5.times.14
inches, A4, A5, and 11.times.17 inches. The pages of the physical markup
representation have the same aspect ratio as the target size. The program
includes instructions for causing a computer to scale the height of the
physical markup representation by the scaling factor. The program includes
instructions for causing a computer to scale the page height of the
physical markup representation by the scaling factor. The element of the
target size is the target height. The program includes instructions for
causing a computer to rotate the pages of the physical markup
representation by plus or minus 90.degree.. The program includes
instructions for causing a computer to test whether the ratio of the
target width to the logical minimum width is less than a specified
threshold. The document is a frame set specifying a plurality of frames.
The document contains at least one hypertext link and the program includes
instructions for causing a computer to display the physical markup
representation in a viewer and access an external document when a
hypertext link is selected by a user from the displayed markup. The
hypertext link is a server-side image map. The semantic markup
representation is HTML. The physical markup representation is PDF. The
program includes instructions for causing a computer to, after conforming
the physical markup representation to the target size, scale the physical
markup representation by the inverse of scaling factor and display the
result in a viewer. The program includes instructions for causing a
computer to display in a viewer a first document represented in a physical
markup representation and containing at least one hypertext link access an
external document when a hypertext link is selected by a user from the
displayed first document convert the semantic markup representation of the
external document into a physical markup representation and incorporate
the physical markup representation of the external document into the first
document. The program includes instructions for causing a computer to
modify a hypertext link to point to the physical markup representation of
the external document. The program includes instructions for causing a
computer to save the original state of the hypertext link. The program
includes instructions for causing a computer to, in response to an action
deleting a portion of the first document, restore a hypertext link which
pointed to the deleted portion to its original state. The program includes
instructions for causing a computer to comprising instructions for causing
a computer to digest the external document to create a digest of the
external document, and test the digest of the external document to
determine whether the physical markup representation of the external
document has already been incorporated into the first document. The
external document comprises a primary document and one or more auxiliary
documents. The program includes instructions for causing a computer to
digest each auxiliary document to create a respective auxiliary document
digest and test the digital digest of each auxiliary document to determine
whether the physical markup representation of the external document has
already been incorporated into the first document. The digital digest is a
compound digest.
In general, in another aspect, the invention features a computer program,
residing on a computer readable medium, for creating a distinguishing
identifier of a collection of data comprising a primary document and one
or more auxiliary documents having instructions for causing a computer to
digest each auxiliary document to create a respective auxiliary document
digest and create a distinguishing identifier by digesting a concatenation
of the primary document with all auxiliary document digests. Preferred
embodiments of the invention include one or more of. the following
features. The program includes instructions for causing a computer to
apply a digital digest algorithm. The digital digest algorithm is the MD5
Message Digest Algorithm.
In general, in another aspect, the invention features a computer program,
residing on a computer readable medium, for retrieving documents
transitively linked to an initial document on a hierarchical file system,
having instructions for causing a computer to retrieve the initial
document and retrieve only those other documents for which there is a
transitive link from the initial document to the other document and for
which the transitive link includes documents which are all within the same
directory path as the initial document. Preferred embodiments of the
invention include one or more of the following features. The hierarchical
file system is distributed on a network. The hierarchical file system is
distributed on an internet.
Among the advantages of the invention are one or more of the following. Web
pages written in a semantic markup language, such as HTML, can be
integrated into a single paginated document described in a physical markup
language, such as PDF. Web pages can be converted to a format having fixed
page dimensions, without losing information because of space constraints.
A virtually unique single identifier can be created for a primary document
and associated auxiliary documents. All of the documents which are linked
to a document and also in the same directory path can be retrieved from a
file system.
Other features and advantages of the invention will become apparent from
the following description and from the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a view of a web page displayed in a conventional web browser.
FIG. 2 is a view of a web page displayed in a conventional web browser.
FIG. 3 is a view of a web page displayed in a conventional web browser.
FIG. 4 is a view of a web page containing frames in a conventional web
browser.
FIG. 5 is a view of a web page containing frames in a conventional web
browser.
FIG. 6 is a view of a web page containing frames in a conventional web
browser.
FIG. 7 is a view of a web page displayed in a conventional web browser.
FIG. 8 is a view of a web page displayed in a conventional web browser.
FIG. 9 is a view of a web page displayed in a conventional web browser.
FIG. 10 shows a portion of the underlying HTML code for the web page
displayed in FIGS. 7-9.
FIG. 11 is a block diagram of a computer system programmed in accordance
with the present invention.
FIGS. 12, 12a and 12b are a flowchart of a method of incorporating web
pages into a single paginated document.
FIG. 13 is a flowchart showing steps of a routine FetchAndIncorporate.
FIG. 14 is a flowchart showing steps of a routine FetchDoc.
FIG. 15 is a flowchart showing steps of a routine ConvertToPDF.
FIG. 16 shows the logical relationship between a LayoutRegion and content
of an associated PDF document.
FIGS. 17, 17a and 17b are a flowchart showing steps taken by a routine
LayoutElement.
FIG. 18 is a view of a web page displayed in a conventional web browser.
FIG. 19 is a view of a web page displayed in a conventional web browser.
FIG. 20 shows a PDF page produced by the present invention.
FIG. 21 shows PDF pages produced by the present invention.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
Referring to FIG. 11, a user computer 100 running client software is
connected over a communications link 102 to web servers, such as web
server 140. Web servers are linked (statically or dynamically) to data
stores, such as data store 142, containing web pages, such as page 144.
The client software (which may include one or more separate programs, as
well as plug-in modules and operating system extensions) typically
displays information on a display device such as a monitor 104 and
receives user input from a keyboard (not shown) and a cursor positioning
device such as a mouse 106. The computer 100 is generally programmed so
that movement by a user of the mouse 106 results in corresponding movement
of a displayed cursor graphic on the display 104.
The programming of computer 100 includes an interface 108 that receives
position information from the mouse 106 and provides it to applications
programs running on computer 100. Among such applications programs are a
web browser 110, and a PDF viewer 120. Also running on computer 100 is a
web page integrator 135, which is may be part of the PDF viewer 120. In
response to a request from the user, the PDF viewer may request the web
page integrator 135 to retrieve, from one or more web servers (such as web
server 140), an initial document specified by a URL supplied by the user,
and other documents which are linked, directly or indirectly, to the
initial document. When the requested documents are retrieved, the web page
integrator integrates them into a single PDF document, which is then
displayed by the PDF viewer 120.
The PDF document which is displayed by the PDF viewer may have hypertext
links to web pages, as well as to internal pages within the PDF document.
When the user selects a hypertext link in the PDF document, e.g. with the
mouse, if the link is to a page within the PDF document, that page is
displayed by the PDF viewer. However, if the hypertext link is to a web
page, that page is either displayed by the browser, or integrated into the
PDF document and displayed by the PDF viewer, depending on a mode set by
the user.
FIGS. 12, 12a and 12b are a flowchart of a method of incorporating web
pages into a single paginated document, which will be described as
implemented in a programmed computer system. First, the system queries the
user to provide the name of an existing PDF document, or a URL along with
web traversal criteria (step 200). If the user provides the name of a PDF
document, the document becomes the "target document" (step 210). The
target document is displayed in the PDF viewer and user input is awaited
(step 220). If the user provides a URL with web traversal criteria, then a
new, empty, PDF document is created. This document becomes the target
document. Parameters of the target document are set which specify a target
width and a target height of pages within the document (collectively the
"target size" of the document), according to either a default value or
input from the user. Then, the routine FetchAndIncorporate is called,
which incorporates a starting document specified by the URL, as well as
other documents which are linked to the starting document and which
satisfy the web traversal criteria, into the target document (step 230).
The target document is then displayed by the PDF viewer and the system
waits for user input (step 220).
The pages of the target document are normally displayed in their target
size, i.e. the size of the pages as specified in their PDF encoding. Upon
request of the user, however, the pages may be displayed in their "natural
size." By the "natural size" of a page we mean a size having the same
aspect ratio as the target size, but having a width equal to the greater
of the target width and the minimum width required to display in a browser
the web page from which the page was incorporated.
If the user selects a hypertext link (step 235), then, and referring now to
FIG. 12a, the link is examined to determine whether it points to a
document which has already been incorporated into the target document
(step 240), and if so, the page of the target document corresponding to
the previously incorporated document is displayed by the PDF viewer (step
250). Otherwise, the value of a user-settable flag Incorporate? is checked
(step 260) and one of the following steps is taken.
If the Incorporate? flag is FALSE, the URL specified by the hypertext link
is provided to a standard web browser program with instructions to display
the document corresponding to the URL (step 270).
If the Incorporate? flag is TRUE, FetchAndIncorporate is called with the
URL, and with web traversal criteria specifying that only the document
associated with the URL be retrieved (step 280). This results in the
creation of one or more pages in the target document corresponding to the
document specified by the URL. The first of these pages is then displayed
by the PDF viewer (step 290).
Referring again to FIG. 12, if the user requests submission of a form
contained within the target document (step 300), then, and referring to
FIG. 12a, the contents of the form are submitted to the appropriate server
(step 310). Any web document received from the server in response to the
form submission is either displayed in the web browser (step 330) or
incorporated into the target document by the procedure ConvertToPDF
(described in more detail below) and displayed by the PDF viewer (step
340), according to the value of the Integrate? flag (step 320).
Referring again to FIG. 12, the following steps are taken if the user
selects a point on a server-side image map within the target document
(step 350). (A server-side image map is an image displayed in a browser
such that if the user selects any point within the image using a pointing
device such as a mouse, the coordinates of that point within the image are
submitted to a specified server, which responds by transmitting a document
back to the browser.) First, and referring now to FIG. 12b, the
coordinates selected by the user are divided by the value of a variable
ScalingFactor associated with the currently displayed page (step 360).
ScalingFactor indicates the amount, if any, by which the dimensions of the
original server-side image map were reduced in order to fit it on a page
within the target document. The resulting coordinate values are then
transmitted to the server (step 360), and, according to the value of the
Incorporate? flag (step 370), the document transmitted back by the server
is either displayed by the web browser (step 380), or is incorporated into
the target document and displayed by the PDF viewer (step 390).
Referring again to FIG. 12, if the user requests deletion of a page from
the target document (step 400), then, and referring now to FIG. 12b, the
page is deleted (step 410), and all hypertext links within the document
which had pointed to that page are reset to be external links (step 420).
When the user request has been processed, control returns to step 220,
where further requests from the user are awaited.
FIG. 13 is a flowchart showing the steps of the routine
FetchAndIncorporate, which retrieves a collection of documents linked from
a given URL into the target document. First, the URL is placed on a list
of pending URLs (step 500). Then, the list is checked to determine whether
any of the URLs on it is valid, according to criteria specified by the
user (step 510).
One web traversal criterion which may be specified by the user is a maximum
depth criterion. This criterion limits the depth of recursive calls to
FetchAndIncorporate, and thus limits the "link distance" between the
initially retrieved document and subsequently retrieved documents to be
incorporated into the target document.
Another criterion which may be specified by the user is a "stay on server"
criterion. When this criterion is set, only documents with URLs indicating
the same server as the initially retrieved document are retrieved.
Another criterion which may be set by the user is a "same path" criterion.
When this criterion is set, only documents with URLs indicating the same
file system directory (or subdirectories of that directory) as the
initially retrieved document are retrieved.
If there are valid URLs on the list, the document identified by the first
valid URL on this list is retrieved by calling the routine FetchDoc (step
520). FetchDoc returns either a set of pages from the target document, or
a document retrieved from a web server with zero or more associated
auxiliary documents. If FetchDoc returns pages from the target document
(step 530), this indicates that the requested document has already been
incorporated into the target document, and the routine continues at step
560.
If FetchDoc returns a document containing PDF pages from a web server,
those pages are appended to the end of the target document (step 540).
If FetchDoc returns a non-PDF document (possibly with associated auxiliary
documents) from a web server, the routine | | |