|
Description  |
|
|
FIELD OF THE INVENTION
The present invention generally pertains to locating documents via embedded
links on computer networks in general, and more specifically, to the use
of uniform resource locator (URL) hyperlinks in documents on the Internet
and on other types of networks.
BACKGROUND OF THE INVENTION
An on-line information system typically includes one or more computer
systems (the servers) that makes information available so that other
computer systems (the clients) can access the information. Each server
manages access to the information, which can be structured as a set of
independent on-line services. A server and client communicate via messages
conforming to a communication protocol and sent over a communication
channel such as a computer network or through a dial-up connection.
Typical uses for on-line services include document viewing, electronic
commerce, directory lookup, on-line classified advertisements, reference
services, electronic bulletin boards, document retrieval, electronic
publishing, keyword searching of documents, technical support for
products, and directories of on-line services. The on-line service may
make the information available free of charge, or for a fee, and may be on
publicly accessible or private computer systems.
Information sources managed by the server may include files, databases, and
applications on the server system or on an external computer system. The
information that the server provides may simply be stored on the server,
may be converted from other formats manually or automatically, may be
computed on the server in response to a client request, may be derived
from data and applications on the server or other machines, or may be
derived by any combination of these techniques.
The user of an on-line service typically uses a specialized computer
program, such as a browser, that is executed on the client system to
access the information managed by an on-line service. Possible user
capabilities include viewing, searching, downloading, printing, editing,
and filing the information managed by the server. The user may also price,
purchase, rent, or reserve services or goods offered through the on-line
service.
An exemplary on-line service for catalog shopping might work as follows. A
user running a program on a client system requests a connection to the
catalog shopping service using a service name that either is well known or
can be found in a directory. The request is received by the server
employed by the catalog shopping service, and the server returns an
introductory document that asks for an identifier and password. The client
program displays this document, the user fills in an identifier and
password that were assigned by the service in a previous visit, and the
information is sent to the server. The server verifies the identifier and
password against an authorization database, and returns a menu document
that is then presented to the user. Each time the user selects a menu
item, the selection is sent to the server, and the server responds with
the appropriate new page of information, possibly including item
descriptions or prices that are retrieved from a catalog database. By
selecting a series of menu items, the user navigates to the desired item
in the catalog and requests that the item be ordered. The server receives
the order request, and returns a form to be completed by the user to
provide information about shipping and billing. The user response is
returned to the server, and the server enters the order information into
an order database.
On-line services are available on the World Wide Web (WWW), which operates
over the global Internet. The Internet is a publicly accessible wide area
network (WAN) comprising a multitude of generally unrelated computer
networks that are interconnected. Similar services are available on
private networks called "Intranets" that may not be connected to the
Internet, and through local area networks (LANs). The WWW and similar
private architectures provide a "web" of interconnected document objects.
On the WWW, these document objects are located at various sites on the
global Internet. A more complete description of the WWW is provided in
"The World-Wide Web," by T. Berners-Lee, R. Cailliau, A. Luotonen, H. F.
Nielsen, and A. Secret, Communications of the ACM, 37 (8), pp. 76-82,
August 1994, and in "World Wide Web: The Information Universe," by
Berners-Lee, T., et al., in Electronic Networking: Research, Applications
and Policy, Vol. 1, No. 2, Meckler, Westport, Conn., Spring 1992.
Among the types of document objects in an on-line service are documents and
scripts. Documents that are published on the WWW are written in the
Hypertext Markup Language (HTML). This language is described in HyperText
Markup Language Specification--2.0, by T. Berners-Lee and D. Connolly, RFC
1866, proposed standard, November 1995, and in "World Wide Web & HTML," by
Douglas C. McArthur, in Dr. Dobbs Journal, December 1994, pp. 18-20, 22,
24, 26 and 86. Many companies also are developing their own enhancements
to HTML. HTML documents are generally static, that is, their contents do
not change over time unless modified by a service or web site developer.
HTML documents can be created using programs specifically designed for
that purpose, such as Microsoft Corporation's FRONTPAGE.TM. Web Page
publishing program, by editing a text file, or by executing a script file.
The HTML language is used for writing hypertext documents, which are more
formally referred to as Standard Generalized Markup Language (SGML)
documents that conform to a particular Document Type Definition (DTD). An
HTML document includes a hierarchical set of markup elements; most
elements have a start tag, followed by content, followed by an end tag.
The content is a combination of text and nested markup elements. Tags,
which are enclosed in angle brackets (`<` and `>`), indicate how the
document is structured and how to display the document, as well as
destinations and labels for hypertext links. There are tags for markup
elements such as titles and headers, text attributes such as bold and
italic, lists, paragraph boundaries, links to other documents or other
parts of the same document, in-line graphic images, and for many other
features.
The following lines of HTML briefly illustrate how the language is used:
Some words are <B>bold</B>, others are
<I>italic</I>. Here we start a new paragraph.<P>Here's a
link to
the <A HREF="http://www.microsoft.com">Microsoft Corporation
</A>homepage.
This sample document is a hypertext document because it contains a
hypertext "link" (hyperlink) to another document, in the line that
includes "HREF=." The format of this link is described below. A hypertext
document may also have a link to other parts of the same document. Linked
documents may generally be located anywhere on the Internet. When a user
is viewing the document using a client program called a Web browser
(described below), the links are displayed as highlighted words or
phrases. For example, using a Web browser, the sample document above might
be displayed on the user's screen as follows:
Some words are bold, others are italic. Here we start a new paragraph.
Here's a link to Microsoft Corporation homepage.
In the Web browser, the link may be selected, for example, by clicking on
the highlighted area with a mouse. Typically, the screen cursor noticeably
changes (shape and/or color) when positioned on a hypertext link.
Selecting a link will cause the associated document to be displayed. Thus,
clicking on the highlighted text "Microsoft Corporation" would fetch and
display the associated homepage for that entity.
The HTML language also provides a mechanism (the image or "IMG" element)
enabling an HTML document to include an image that is stored as a separate
file. When the end user views the HTML document with a browser program,
the included image is displayed to the user as part of the document, at
the point where the image element occurred in the document.
Another kind of document object in a web is a script. A script is an
executable program or a set of commands stored in a file that can be run
by a server program called a Web server (described below) to produce an
HTML document that is then returned to the Web browser. Typical script
actions include running library routines or other applications to fetch
information from a file or a database, or initiating a request to obtain
information from another machine, or retrieving a document corresponding
to a selected hypertext link. A script may be run on the Web server when,
for example, the end user selects a particular hypertext link in the Web
browser, or submits an HTML form request. Scripts are usually written by a
service developer in an interpreted language such as Basic, Practical
Extraction and Report Language (Perl), or Tool Control Language (Tcl) or
one of the Unix operating system shell languages, but they also may be
written in more complex programming languages such as "C" and then
compiled to produce an executable program. Programming in Tcl is described
in more detail in Tcl and the Tk Toolkit, by John K. Ousterhout,
Addison-Wesley, Reading, Mass., USA, 1994. Perl is described in more
detail in Programming in Perl, by Larry Wall and Randal L. Schwartz,
O'Reilly & Associates, Inc., Sebastopol, Calif., USA, 1992.
Each document object in a web has an identifier called a Universal Resource
Identifier (URI). These identifiers are described in more detail in T.
Berners-Lee, "Universal Resource Identifiers in WWW: A Unifying Syntax for
the Expression of Names and Addresses of Objects on the Network as used in
the World-Wide Web," RFC 1630, CERN, June 1994; and T. Berners-Lee, L.
Masinter, and M. McCahill, "Uniform Resource Locators (URL)," RFC 1738,
CERN, Xerox PARC, University of Minnesota, December 1994. A URI allows any
object on the Internet to be referred to by name or address, such as in a
link in an HTML document as shown above. There are two types of URIs:
Universal Resource Name (URN) and Uniform Resource Locator (URL). A URN
references an object by name within a given name space. The Internet
community has not yet defined the syntax of URNs. A URL references an
object by defining an access algorithm using network protocols. An example
of a URL is "http://www.microsoft.com". A URL has the syntax
"scheme://host:port/path?search" where
"scheme" identifies the access protocol (such as HTTP, FTP, or GOPHER);
"host" is the Internet domain name of the machine that supports the
protocol, and comprises the fully qualified domain name of a network host,
or its IP address as a set of four decimal digit groups separated by ".".
Fully qualified domain names take the form of a sequence of domain labels
separated by ".", each domain label starting and ending with an
alphanumerical character and possibly also containing "-" characters. The
rightmost domain label will never start with a digit, though, which
syntactically distinguishes all domain names from the IP addresses (See
Section 3.5 of RFC 1034 and Section 2.1 of RFC 1123).
"port" is the transmission control protocol (TCP) port number of the
appropriate server (if different from the default);
"path" is a scheme-specific identification of the object. It supplies the
details of how the specified resource can be accessed. Note that the "/"
between the host (or port) and the path is NOT part of the path; and
"search" contains optional parameters for querying the content of the
object.
URLs are also used by web servers and browsers on private computer systems,
Intranets, or networks, and not just for the WWW.
The HTTP URL scheme is used to designate Internet resources that may be
accessed using HTTP. The HTTP URL has the syntax
"http://<host>:<port>/<path>?<searchpart>", where
<host> and <port> are as described above. If :<port> is
omitted, the port defaults to 80. No user name or password is allowed.
<path> is an HTTP selector, and <searchpart> is a query
string. The <path> is optional, as is the <searchpart> and its
preceding "?". If neither <path> nor <searchpart> is present,
the "/" may also be omitted. Within the <path> and
<searchpart> components, "/", ";", "?" are reserved. The "/"
character may be used within HTTP to designate a hierarchical structure.
There are generally two types of URLs that may be used in the hypertext
link: absolute URLs and relative URLs. An absolute URL includes a protocol
identifier, a machine name, and an optional HTTP port number. A relative
URL does not include a protocol identifier, machine name or port, and must
be interpreted relative to some known absolute URL called the base URL.
The base URL is used to determine the protocol identifier, machine name,
optional port, and base directory for a relative URL. For further
discussion of URL format and usage, see the document "Uniform Resource
Locators," Internet Request for Comments (RFC) 1738, by T. Berners-Lee, L.
Masinter, M. McCahill, University of Minnesota, December 1994. For further
discussions of relative URL format and usage, see "Relative Uniform
Resource Locators," RFC 1808, by R. Fielding, University of California,
Irvine, June 1995.
A hypertext link to an electronic document is specified by one of several
HTML elements. One of the parameters of an HTML element for a hypertext
link is the URL that serves as the identifier for the target of the link.
An HTML document may have a base element defining an absolute URL that
specifies the base URL for that document. If the document has no base
element, then the absolute URL of the document is used as the base URL.
The base element provides a base address for interpreting relative URLs
when the document is read out of context.
For example, FIG. 7A shows text with a document URL 200, a base element
202, a hypertext link with an absolute URL 204, and a hypertext link with
a relative URL 206, which is evaluated with respect to base element 202 to
produce a resulting URL 208. As an additional example, FIG. 7B shows text
with a document URL 210, no base element, a hypertext link with an
absolute URL 212, and a hypertext link with a relative URL 214, which is
evaluated with respect to document URL 210 to produce a resulting URL 216.
A site at which documents are made available to network users is called a
"Web site" and must run a "Web server" program to provide access to the
documents. A Web server program is a computer program that allows a
computer on the network to make documents available to the rest of the WWW
or to a private network. The documents are often hypertext documents
written in the HTML language, but may be other types of document that
include other types of objects as well, such as images, audio, and/or
video data. The information that is managed by the Web server includes
hypertext documents that are stored on the server or are dynamically
generated by scripts on the Web server. Several Web server software
packages exist, such as the Conseil Europeen pour la Recherche Nucleaire
(CERN, the European Laboratory for Particle Physics) server or the
National Center for Supercomputing Applications (NCSA) server. Web servers
have been implemented for several different platforms, including the Sun
SPARC II.TM. workstation running the Unix operating system, and personal
computers with the Intel PENTIUM.TM. processor running the Microsoft
MS-DOS.TM. operating system and the Microsoft WINDOWS.TM. graphic user
interface operating environment.
Web servers also use a standard interface for running external programs,
such as the Common Gateway Interface (CGI) or ISAPI. CGI is described in
more detail in How to Set Up and Maintain a Web Site, by Lincoln D. Stein,
Addison-Wesley, August 1995. A gateway is a program that handles incoming
information requests and returns the appropriate document or generates a
document dynamically. For example, a gateway might receive queries, look
up the answer in a database to provide a response, and translate the
response into a page of HTML so that the server can send the response to
the client. A gateway program may be written in a language such as "C" or
in a scripting language such as Perl or Tcl or one of the Unix operating
system shell languages. The CGI standard specifies how the script or
application receives input and parameters, and specifies how output should
be formatted and returned to the server.
For security reasons, a Web server machine may limit access to files. To
control access to files on the Web server, the Web server program running
on the server machine may provide an extra layer of security above and
beyond the normal file system and login security procedures of the
operating system on the server machine. The Web server program may add
further security rules such as: (a) optionally requiring input of a user
name and password, completely independent of the normal user name and
passwords that the operating system may maintain on user accounts; (b)
allowing groups of users to be identified for security purposes,
independent of any user group definitions defined in the security
components of the operating system; (c) access control for each document
object such that only specified users (with optional passwords) or groups
of users are allowed access to an object, or so that access is only
allowed for clients at specific network addresses, or some combination of
these rules; (d) allowing access to the document objects only through a
specified subset of the possible HTTP methods; and (e) allowing some
document objects to be marked as HTML documents, others to be marked as
executable scripts that will generate HTML documents, and others to be
marked as other types of objects such as images. Access to the on-line
service document objects via a network file system would not conform to
the security features of the Web server program and would provide a way to
access documents outside of the security provided by the Web server. The
Web server program also typically maps document object names that are
known to the client to file names on the server file system. This mapping
may be arbitrarily complex, and any author or program that tries to access
documents on the Web server directly would need to understand this name
mapping.
A user (typically using a machine other than the machine used by the Web
server) who wishes to access documents available on the network at a Web
site must run a Web browser program. The combination of the Web server and
Web browser communicating using an HTTP protocol over a computer network
is referred to herein as a "web architecture." The Web browser program
allows the user to retrieve and display documents from Web servers. Some
of the popular Web browser programs are: NAVIGATOR.TM. browser from
NetScape Communications Corp., of Mountain View, Calif.; MOSAIC.TM.
browser from the National Center for Supercomputing Applications (NCSA);
WINWEB.TM. browser, from Microelectronics and Computer Technology Corp. of
Austin, Tex.; and Internet Explorer from Microsoft Corporation of Redmond,
Wash. Web browsers have been developed to run on different platforms,
including personal computers with the Intel Corporation PENTIUM.TM.
processor running Microsoft Corporation's MS-DOS.TM. operating system and
Microsoft Corporation's WINDOWS.TM. graphic user interface environment,
and Apple Corporation's MACINTOSH.TM. personal computers, and other
independent operating systems, such as Linux.
The Web server and the Web browser communicate using the Hypertext Transfer
Protocol (HTTP) message protocol and the underlying transmission control
protocol/Internet protocol (TCP/IP) data transport protocol of the
Internet. HTTP is described in Hypertext Transfer Protocol--HTTP/1.0, by
T. Berners-Lee, R. T. Fielding, H. Frystyk Nielsen, Internet Draft
Document, Oct. 14, 1995. In HTTP, the Web browser establishes a connection
to a Web server and sends an HTTP request message to the server. In
response to an HTTP request message, the Web server checks for
authorization, performs any requested action, and returns an HTTP response
message containing an HTML document in accord with the requested action,
or an error message. The returned HTML document may simply be a file
stored on the Web server, or may be created dynamically using a script
called in response to the HTTP request message. For instance, to retrieve
a document, a Web browser may send an HTTP request message to the
indicated Web server, requesting a document by reference to the URL of the
document. The Web server then retrieves the document and returns it in an
HTTP response message to the Web browser. If the document has hypertext
links, then the user may again select one of those links to request that a
new document reference by the selected link be retrieved and displayed.
As another example, a user may fill in a form requesting a database search.
In response, the Web browser will send an HTTP request message to the Web
server including the name of the database to be searched, the search
parameters, and the URL of the search script. The Web server calls a
search program, passing in the search parameters. The program examines the
parameters and attempts to answer the query, perhaps by sending the query
to a database interface. When the program receives the results of the
query, it constructs an HTML document that is returned to the Web server,
which then sends it to the Web browser in an HTTP response message.
Request messages in HTTP contain a "method name" indicating the type of
action to be performed by the server, a URL indicating a target object
(either document or script) on the Web server, and other control
information. Response messages contain a status line, server information,
and possible data content. The Multipurpose Internet Mail Extensions
(MIME) specification defines a standardized protocol for describing the
content of messages that are passed over a network. HTTP request and
response messages use MIME header lines to indicate the format of the
message. MIME is described in more detail in MIME (Multipurpose Internet
Mail Extensions): Mechanisms for Specifying and Describing the Format of
Internet Message Bodies, Internet RFC 1341, June 1992.
Internet users typically access web resources through one of three ways:
(1) by directly entering (e.g., typing in) the URL for the resource, such
as http://www.Microsoft.com; (2) through a reference in another document,
such as a hyperlink; or (3) through a separate storage of the link's URL,
such as a listing under a "Favorites" (or Bookmarked) menu item in a
browser, a folder view of the browser's history, or the results displayed
by an Internet search engine. These methods all work equally well as long
as the URL for the linked document or site doesn't change. Unfortunately,
changes in web pages and sites are very common, and URL's for sites and
documents are constantly being changed. When a hyperlink's URL no longer
points to its (previously) associated resource (e.g., a web page), the
hyperlink is said to be "broken." In such instances, the URL entry
provided by any of the foregoing methods will not locate the resource it
was previously mapped to unless there is some provision for forwarding the
user to the new URL. For instance, the author of a site can associate some
HTML code with the previous URL that automatically forwards a user
traversing the link to the new URL. Unfortunately, there is no facility
built into the Internet's URL referential addressing scheme that
automatically remaps the locations of web resources. As a result, it is
very common for users to receive a "Document/Page not Found" error when a
web page has been moved, and the prior URL is no longer valid.
Conventional web authoring tools only provide a partial solution to the
foregoing resource relocation problem. For example, Microsoft
Corporation's FRONTPAGE.TM. maintains lists of links within a currently
authored web site, and ensures that when pages are moved, the links to the
moved pages that are located in other pages within the same web site are
updated. For instance, if a FRONTPAGE.TM. user is authoring a web site and
moves one of the documents, all of the hyperlinks within the site are
automatically updated to map to the page's new location. However, this
does not address the other commonly encountered problems concerning broken
hyperlinks discussed above, such as when the web page is linked through an
external reference (i.e., external relative to the web site). In
particular, it would be advantageous to provide a scheme that
automatically updates broken URL references so that the resources
previously associated with the broken URLs can be more easily located.
SUMMARY OF THE INVENTION
The invention addresses many of the problems associated with changes in the
locations of resources stored on a site through a method for dealing with
broken hyperlinks to the resources that have been moved. It should be
noted that the term "moved" as used herein with regard to resources or
documents (both in the specification and in the claims that follow)
includes the renaming of such resources or documents, since renaming a
resource or document has the effect of changing its storage location. The
present invention addresses any change in the full path to a resource that
breaks a hyperlink to that resource and thus addresses a change in the
storage location of a resource or a change in the name of the resource.
The system and method are preferably implemented by a set of program
modules that comprise a Referential Preservation Engine (RPE). The RPE
program modules preferably are part of one or more application programs
that are used in a web page authoring environment.
According to a first aspect of the invention, the RPE implements a method
for maintaining the integrity of hyperlinks within a web site. The
hyperlinks reference the locations of resources such as web page documents
on external (remote) servers that can be accessed over a private wide area
network or a public wide area network such as the Internet. It is common
for resources to be moved within web sites when the sites are being
developed or as part of routine maintenance of the sites. An RPE running
on an external server tracks the movement of resources on that server and
saves changes in the locations of resources as redirection data. The
redirection data preferably include the previous and new location for each
of the moved resources. The RPE also tracks the usage of hyperlinks
employed to retrieve the moved resources, recording the addresses of web
page sites that follow hyperlinks to the external site. The redirection
data and hyperlink usage data are preferable stored as the redirection
data in files that are associated with the moved resources. When a
resource on an external server is moved, or on a periodic basis, the
external server sends the redirection data to the servers that have
referred the links based on the hyperlink usage data. An RPE running on
one of these referring servers can then update the hyperlinks in the
documents on that site to reflect the new location of the moved resources.
According to a second aspect of the invention, the RPE provides a method
for updating URL references that are stored in browsers. A browser runs on
a client computer and typically contains a list of web sites or documents
that are marked as favorites by a user. These favorites are typically
stored as URL references that are mapped to the site or document the user
desires to save a mark for. When these web sites are initially marked as
favorites, or optionally, when a user uses one of these favorite URLs to
visit a web site or page, the browser sends a message identifying the
client's address to the server where the favorite site or page is located.
Web sites that are running the RPE compile these messages, and store them
in a database. When resources are moved on these web sites, the URLs for
the resources typically must be changed. The RPE for the site tracks the
movement of the resources on that site and the associated changes to the
URLs and sends messages containing the new location of the moved resources
to the browsers in the client computers that have previously sent messages
to that server concerning use or storage of the URL that previously was
mapped to the moved resource. The browser in the client computer can then
update the URL reference for the favorite site or document based on this
information.
According to another aspect of the invention, the RPE provides a method for
maintaining a web site that comprises multiple web page documents that are
stored on a server. Each document has a content and an original URL
reference that is mapped to a location on the server to where the document
is originally stored. As the web site is developed or maintained, various
documents are moved from their original locations to new locations or
deleted altogether. These movements and deletions are tracked by an RPE
running on the server. For each of the moved or deleted documents, the RPE
applies predefined rules to determine if tracking changes in the location
of the document is justified. If the document fails to meet these
predefined rules, and if the document is moved within the site or deleted,
links to the document that are contained in the site's various other
documents are nevertheless updated, but redirection data for the document
are not maintained. Conversely, if the document meets the predefined
rules, a redirection page is created, if possible. The redirection page
preferably contains a URL stub with HTML code that redirects a browser to
the new location for the document when a user tries to access the document
with the document's original (and no longer valid) URL. The redirection
page may optionally display a message for a predetermined amount of time
indicating that a new URL for the link has been provided, and may also
include a hyperlink to the new location for the document. As with the
documents that fail to meet the predefined rules, links to documents that
do meet the predefined rules and have been moved or deleted are updated in
the site's various other documents.
The predefined rules may specify a minimum predefined number of times that
a page must have been visited, a predetermined minimum rate of users
accessing a document, whether the page has been marked by its author as
requiring redirection data, and whether the page has been marked by a
browser as a favorite.
BRIEF DESCRIPTION OF THE DRAWING FIGURES
The foregoing aspects and many of the attendant advantages of this
invention will become more readily appreciated as the same becomes better
understood by reference to the following detailed description, when taken
in conjunction with the accompanying drawings, wherein:
FIG. 1 is a flow chart illustrating the logical steps implemented by a
Referential Preservation Engine in accord with the present invention, when
a page on a web site is moved or deleted;
FIG. 2 shows a flow diagram for applying predefined rules to determine if
redirection data should be maintained for a document or web page;
FIG. 3A is a flow diagram illustrating the steps that the Referential
Preservation Engine executes when a user marks a URL as a favorite;
FIG. 3B is a flow diagram illustrating the steps the Referential
Preservation Engine executes when a user employs a favorite URL to reach a
web site or page;
FIG. 4 is a flow diagram illustrating the steps that the Referential
Preservation Engine executes when a user browses a URL under various
conditions;
FIG. 5 is a flow diagram illustrating the steps that the Referential
Preservation Engine executes when it fixes broken external hyperlinks;
FIG. 6 is a block diagram of a personal computer system for implementing
the present invention;
FIG. 7A is a sample HTML document with a base URL showing examples of a
hyperlink using a relative URL, and a hyperlink using an absolute URL;
FIG. 7B is a sample HTML document without a base URL, showing examples of a
hyperlink using a relative URL, and a hyperlink using an absolute URL;
FIG. 8 is a schematic diagram illustrating three web pages on an exemplary
web site;
FIG. 9A illustrates the file structure of the web site shown in FIG. 8;
FIG. 9B illustrates the URL structure of the web site shown in FIG. 8; and
FIG. 9C illustrates the file structure of the meta-data files that
correspond to various documents that comprise the web site shown in FIG.
8;
DESCRIPTION OF THE PREFERRED EMBODIMENT
The present invention enables the integrity of URL references on web sites
to be maintained to prevent broken links, where appropriate. The system
and method are preferably implemented by a set of program modules that
comprise a Referential Preservation Engine (RPE). The program modules
preferably are part of one or more application programs executed on a
personal computer and used in providing a web page authoring environment.
The following discussion pertains to the use of the RPE in Microsoft
Corporation's FRONTPAGE.TM. web page authoring program. It should be noted
that this is not meant to be limiting, as the RPE can likely be applied to
other web page authoring programs as well.
As discussed above, web sites on the Internet typically comprise multiple
HTML documents that are stored on a web server. The pages for a web site
are generally organized in a structured hierarchy based on content level.
For example, if a user clicks on a hyperlink to a travel agency site, such
as the "www.traveltickets.com" site shown in FIG. 8, that site's homepage
300 will be displayed. This page includes a company logo 301, and several
picture icons 302, 304, 306, 308 that correspond to various categories of
travel offerings with related pages available at the site. Adjacent to the
picture icons are text blocks 310, 312, 314, and 316 that are respectively
paired with a corresponding picture icon and its associated category. Not
visible are hyperlinks to each of the pages referenced by the picture
icons/text blocks. To simplify the following explanation, the reference
numbers for text blocks 310, 312, 314, and 316 will be assumed to also
refer to their associated hyperlinks.
Homepage 300 is at the top level of the content hierarchy for the web site
referenced by www.traveltickets.com. There is a "nested" page for each of
the travel categories that can be reached by either clicking one of the
picture icons or one of the text blocks, both of which are associated with
one of the hyperlinks. For instance, clicking on either icon 304 or text
block 312 will link the browser to a Cruises page 318, causing the Cruises
page to open in the browser. Cruises page 318, and the pages associated
with the other travel categories (e.g., an Air Travel page, a Trains page,
etc., (none of which are shown)) are all nested at a second level of the
content hierarchy. As with homepage 300, Cruises page 318 also contains
hyperlinks pointing to pages that are nested below it. These hyperlinks
are associated with picture icons and text block pairs, including an
"Alaska" icon/text block 320, a "Caribbean" icon/text block 322, a "Puerto
Rico" icon/text block 324, and a "Mexico" icon/text block 326. Each of
these icon/text block pairs and their associated hyperlinks can be used to
access specific pages at a third level of the content hierarchy. For
instance, clicking on "Caribbean" icon/text block 322 activates the
associated hyperlink that links the browser to a Caribbean Cruise page
328, which contains detailed information about a Caribbean cruise for
which the user can purchase tickets at the web site. There are similarly
nested detailed information pages for the other cruise destinations
(Alaska, Puerto Rico, Mexico--none shown), which can be accessed by the
user activating the respective hyperlink associated with the icon/text
block for that page.
Clicking on the "I Want to Go!" button 330 activates another hyperlink (not
visible) that loads a ticket reservation page in the browser. The ticket
reservation page (not shown) displays travel dates, accommodation options,
pricing information, and payment information, etc. Since the same ticket
reservation page can be accessed from the other third level pages (e.g.,
from a Mexico Cruise page), the ticket reservation page is not nested
below the third level pages, but rather is located below the homepage on
the second level of the content hierarchy.
Each of the pages (documents) on a web site is typically stored as an
individual HTML file on the web site's server. The HTML files are usually
stored in a file hierarchy that is similar in structure to the content
hierarchy. Such a file hierarchy is schematically shown in the block
diagram of FIG. 9A. All of the documents are stored in either in a root
directory or folder, or subdirectories or subfolders thereof. For example,
the HTML files for the travel agency site are stored in a root folder 332
having a location on the server represented by the path
"H:.backslash.server.backslash.travel." The HTML homepage document for a
site is commonly stored on the web server in the root folder, and
generally has a special name such as "index.htm" or "default.htm" so that
the web server can identify it as the homepage. For instance, homepage
document 331 for the travel agency site is stored in root folder 332 as
"index.htm." The HTML documents that correspond to the nested web pages
are typically located in subdirectories (or subfolders) that are nested at
one or more levels below the root directory. For example, an "index.htm"
HTML document 333 used for displaying Cruises page 318 is stored in a
cruises subfolder 334 (i.e., stored on the server as
"H:.backslash.server.backslash.travel.backslash.cruises.backslash.index.
htm"), as well as a "caribbean.htm" HTML document 335, which is used to
display the Caribbean Cruise page (and stored on the server as
"H:
.backslash.server.backslash.travel.backslash.cruises.backslash.caribbean.
htm"). There are additional subfolders corresponding to the different
travel categories, including an air travel subfolder 336, a trains
subfolder 338, and a tours subfolder 340. Each of subfolders 334, 336,
338, and 340 contains one or more HTML documents corresponding to the
content hierarchy of the site. By storing the web pages in a hierarchy
that corresponds to the web site content, the web server can more easily
locate and cache web pages, thereby improving web site performance.
There are two primary schemes used for mapping URLs to their corresponding
Internet resources. The first scheme uses an indirection table with
entries that tie or map a URL to each resource. For example, suppose that
the HTML document for Caribbean Cruise page 328 is stored as
H:
.backslash.server.backslash.travel.backslash.cruises.backslash.caribbean.
htm. The indirection table would contain a URL entry corresponding to this
file on the server, such as
"http://www.traveltickets.com/cruises/caribbean.htm", or alternately,
there might be an entry of for a URL "base/cruises/ | | |