|
Claims  |
|
|
What I claim is:
1. A computer implemented method of building a database which comprises
sets of associated property values wherein each set includes at least two
property values of different types, the property values being any of
classification values, contact values, geographic location values,
hereinafter collectively referred to as CCG-data, the method comprising
the steps of:
a) retrieving successive web pages from a computer network, each web page
being identified by a URL,
b) searching each web page for a CCG phrase that includes a plurality of
different types of CCG-data attributes,
c) extracting a plurality of said attributes from said phrase,
d) from each extracted attribute, deriving an attribute name and a related
attribute value,
e) determining the type of said extracted attribute and said attribute
value by reference to said attribute name,
f) relating said type of attribute value so determined to a corresponding
type of database property value,
g) relating the URL of said web page to an other type of database property
value,
h) writing said derived attribute value to the database property value of
said determined corresponding type in a set of associated property values,
and
i) writing the URL of said web page to a database property value of said
other type in said set of associated property values.
2. A computer implemented method of building a database which comprises
sets of associated property values wherein each set includes at least two
property values of different types, the property values being any of
classification values, contact values, geographic location values,
hereinafter collectively referred to as CCG-data, the method comprising
the steps of:
a) retrieving successive web pages from a computer network, each web page
being identified by a URL,
b) searching each web page for a CCG phrase that includes at least one type
of CCG-data attribute,
c) extracting at least one said attribute from said phrase,
d) from each extracted attribute, deriving an attribute name and a related
attribute value,
e) determining the type of said extracted attribute and said attribute
value by reference to said attribute name,
f) relating said type of attribute value so determined to a corresponding
type of database property value,
g) relating the URL of said web page to an other type of database property
value,
h) writing said derived attribute value to the database property value of
said determined corresponding type in a set of associated property values,
and
i) writing the URL of said web page to a database property value of said
other type in said set of associated property values.
3. A computer implemented method of building a database which comprises
sets of associated property values wherein each set includes at least two
property values of different types, the property values being any of
classification values, contact values, geographic location values,
hereinafter collectively referred to as CCG-data, the method comprising
the steps of:
a) retrieving successive web pages from a computer network,
b) searching each web page for a CCG phrase that includes a plurality of
different types of CCG-data attributes,
c) extracting a plurality of said attributes from said phrase,
d) from each extracted attribute, deriving an attribute name and a related
attribute value,
e) determining the type of said extracted attribute and said attribute
value by reference to said attribute name,
f) relating said type of attribute value so determined to a corresponding
type of database property value, and
g) writing said derived attribute value to the database property value of
said determined corresponding type in a set of associated property values.
4. A computer implemented method of finding references to web pages pasted
on computer network the method using a database comprising sets of
associated property values, the property values being any of
classification values, contact values, geographic location values,
hereinafter collectively referred to as CCG-data, and URL references, the
method comprising the steps of:
a) receiving a query phrase including query relational expressions from a
computer network,
b) parsing said query phrase and extracting each of said query relational
expressions included therein,
c) from each extracted query relational expression, deriving a query field
name,
d) determining the type of said query relational expression by reference to
its derived query field name,
e) relating said type of query relational expression so determined to one
of the following query relational expression types: CCG-data type, other
type,
f) provided said query relational expression is a CCG-data type, deriving a
query relational operator and query value related to its query field name
from said query relational expression,
g) determining the type of said query value by reference to said query
field name,
h) relating said type of query value so determined to a corresponding type
of database property value,
i) locating database property values of said determined corresponding type
which return a true value when tested against said query value using said
query relational operator,
j) extracting from said database a list of the URL references associated
with the so located database property values.
5. A computer implemented method of finding sets of associated database
property values the method using a database comprising sets of associated
property values wherein each set includes at least two property values of
different types, the property values being any of classification values,
contact values, geographic values, hereinafter collectively referred to as
CCG-data, the method comprising the steps of:
a) receiving a query phrase including query relational expressions from a
computer network,
b) parsing said query phrase and extracting each of said query relational
expressions included therein,
c) from each extracted query relational expression, deriving a query field
name,
d) determining the type of said query relational expression by reference to
its derived query field name,
e) relating said type of query relational expression so determined to one
of the following query relational expression types: CCG-data type, other
type,
f) provided said query relational expression is a CCG-data type, deriving a
query relational operator and query value related to its query field name
from said query relational expression,
g) determining the type of said query value by reference to said query
field name,
h) relating said type of query value so determined to a corresponding type
of database property value,
i) locating database property values of said determined corresponding type
which return a true value when tested against said query value using said
query relational operator,
j) extracting from said database sets of associated database property
values associated with the so located database property values. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
FIELD OF INVENTION
This invention relates to network based classified information systems, to
methods of automatically building searchable databases of classified
information derived from web pages posted on a network, and, to web pages
for use in such systems and methods. The information systems and databases
of most relevance to this invention are those which include classified
product and service catalogues similar to the Yellow Pages telephone
books, contact indexes similar to the White Pages telephone books, and/or
subject indexes similar to Library catalogues. Such information systems
and databases typically include sets of associated classification, contact
and/or geographic items of information. For convenience, classification,
contact and/or geographic information will be hereinafter called CCG-data.
The networks with which this invention is concerned are the worldwide
public computer/communications network commonly known as the Internet and
private networks--sometimes called intranets--which allow common access to
markup documents on computers connected to the network. Markup documents
are text files prepared using various markup languages such as HyperText
Markup Language (HTML) and Extensible Markup Language (XML) which are
implementations (or dialects) of the Standard Generalised Markup Language
(SGML). The system of accessible files on the Internet is called the World
Wide Web (WWW) and the markup documents themselves are commonly called
`web pages`. A web page is said to be `posted` on a network when it is
stored on computer-readable media of a host network computer as a file
which is generally accessible to network users. A web page is transported
from the host computer to a requesting computer through intermediate
network computers as a computer-readable signal embodied in a carrier
wave. Though this invention is not limited to Internet based information
systems, these terms are used for convenience.
BACKGROUND TO THE INVENTION
It has been estimated that there are about 100 million web pages on the
Internet and that the number is doubling every two years. Many of these
pages include information concerning commercially offered goods and
services and often include contact details. But the difficulty of locating
such information is increasing faster than the growth in the number of web
pages.
To assist network users locate web pages of interest, certain network
service providers create indexes (or databases) of the contents of web
pages posted (stored on computer readable media so as to be generally
accessible) on the network and provide `search engines` to use the
indexes. These indexes are often created automatically by the use of `web
crawlers` which (i) interrogate computer after computer on the network to
locate successive web pages and (ii) index the words in each web page
encountered against the network address (eg Internet Protocol Address or
IPA) and filing system path or universal resource locator (URL) at which
the web page is accessible. Hereinafter the terms URL and URI (Uniform
Resource Identifier) are taken to be identical in meaning and to signify
network addresses and filing system paths. Usually, the indexes consist of
a list of unique words with each word having an associated list of URLs of
the web pages wherein the word was found to occur during interrogation.
The URL serves as a `hyperlink` which, if selected by a user/searcher,
results in the associated web page being automatically transmitted from
the computer where it is posted on the network to the user/searchers
computer where it may be displayed or otherwise processed. The sending and
receiving of files in this way is greatly assisted by user interface
programs called `web browsers` (or more simply, `browsers`) such as
Netscape and Microsoft Internet Explorer.
The search for web pages of interest using search engines leaves much to be
desired:
simple searches (those using a few keywords in simple combinations) often
yield far too many web page references (URLs) to permit them to be
interrogated one-by-one,
complex searches (those using many keywords and/or complex Boolean
expressions) require considerable expertise to undertake,
even using optimum search criteria, many irrelevant web pages are
referenced because of inconsistent use of terminology by those who author
the original web pages,
even using optimum search criteria, many relevant pages are missed, again
because of inconsistent use of terminology by web page authors, and
because items of information included in the body of web pages cannot be
`understood` or associated in useful ways by web crawlers; that is
recognised as, say, a surname, a street name, a geographic locality, or
type of goods or services and, say, a surname strongly associated with a
street name, a geographic locality, or a type of goods or service.
The result is that information provided by search engines from databases
which are automatically compiled using web crawlers is a very poor
equivalent of the common Yellow Pages and White Pages directories which
serve the telephone industry (though these directories are not, of course,
automatically compiled from web pages).
In an attempt to improve the usefulness of automatically compiled network
databases, some search engine providers make use of information contained
in URLs, such as the country code and top level domain name codes such as
`com`, `edu`, `net` and `org` which is sometimes used to signify the
subject matter of web pages. It has been proposed to add more content
classifying codes to URLs (eg, "chem" to signify chemical subject matter)
to allow specialised databases--national, commercial, chemical, etc--to be
generated, However, this proposal has serious drawbacks:
URLs are Internet addresses and it is in principle undesirable to confuse
the address function of a URL with that of representing a list of web page
classifications or contact details.
A URL is an inappropriate container of multiple web page classification
codes and contact details because the length of the URL would cause it to
become unwieldy as an Internet address.
Including in a URL classification codes drawn from a list of thousands of
codes would compromise the mnemonic quality of Internet addresses such as
"www.yellowpages.com".
There is substantial overlap in the subject matter contained in web pages
having the various top level domain name codes.
There is no consensus on, or standard for, content classification codes in
URLs.
Another proposal to add content classification data to web pages has arisen
from the wish to identify pages containing material that may be offensive
to some viewers, or should not be accessed by minors. The Platform for
Internet Content Selection (PICS) (see http://www.w3.org/pub/WWW/PICS and
other documents at www.w3.org) is a web page ratings standard similar in
principle to the ratings systems for motion pictures. This system allows
page authors to "internally" self classify their pages through use of the
"<meta . . . >" HTML element. Alternatively, "external" PICS ratings
of web pages may be obtained from ratings service providers accessed each
time a URL is selected. In practice, the ratings service providers have
adopted very limited range of web page classifications. For example,
Ararat Software's Commercial Rating System (see
http://www.ararat.com.ratings/ararat10.html) provides just 5 categories of
web page content; commercial content, technical/customer support, ordering
information, downloading information and contact information. In other
examples, CyberPatrol (http://www.microsys.com/pics/pics_msi.htm) provides
16 categories, the Recreational Software Advisory Council
(http://www.rsac.org/faq.html) provides 4 categories, SafeSurf
(http://www.safesurf.com/ssplan.htm) provides 11 categories and Vancouver
Webpages Rating Service (http://vancouver-webpages.com/VWP1.0/ provides 11
categories. None of the categories provide classification of web pages by
industy, service, product or subject with sufficient specificity to be
useful when searching for web pages. Rather, the categories are intended
to prevent web browsers from displaying web pages unsuitable for
particular types of web browser users. Such rating systems are not
intended to be used for the automated creation of Yellow or White pages
like databases from web pages and are unsuitable for that purpose because
they can not represent contact details. Further, the ratings data may only
be encoded in the <meta . . . > element in the <head> of an
HTML document drastically limiting the type and usefulness of the data
that can be encoded.
Another proposal for classifying the content of web pages, the "Meta
Content Framework" (MCF--see http://mcf.research.apple.com/mcf.html"),
requires the content of web pages to be classified and the classification
data to be held in a separate non-HTML data file with a MIME type of
text/mcf. Storing data in non-HTML encoded documents which describes the
content of HTML encoded documents is a technical and economic barrier to
the adoption by search engine providers of the proposal. The MCF proposal
is thus entirely unsuited to the automated creation of Yellow or White
pages like databases from HTML encoded web pages (MIME type text/html)
because data stored according to the MCF proposal is not stored in HTML
encoded web pages.
The "Electronic Business Card", vCard, (see "vCard The Electronic Business
Card" Version 2.1, versit Consortium Specification, Sep. 18, 1996 or
ftp://ds.interbic.net/internet-drafts/draft-ietf-asid-mime-vcard-01.txt)
uses non-HTML data file (MIME Content Types of "text/plain" or the
non-standard "text/X-vCard") containing contact information equivalent to
an extended White Pages entry which can be exchanged on a network using
Simple Mail Transfer Protocol (SMTP) or using HTTP. It can be associated
with a web page by use of a URL in the web page which refers to the vCard
information (eg (link to) "http://www.thing.com/vCard.vcf">My
vCard</a>). Version 2.1 vCard standard data file format (published
Sep. 18, 1996) provides for the inclusion of many items of contact
information. The vCard specification recommends that, where possible,
there should be consistent mapping of vCard property names to HTML
"<input>" element attribute names (eg vCard properly name "TITLE"
maps to HTML "<input name=`title`>"). The intention is to facilitate
the transfer of vCard data into web page input forms by pasting from a
clipboard or by dragging from other computer applications. The VCard
proposal is unsuited to the automated creation of Yellow or White pages
like databases from HTML encoded web pages because data stored according
to the VCard proposal is not stored in HTML encoded web pages.
The inclusion of classified information in separate documents (such as Meta
Content files or vCards) has the disadvantage that there is necessarily
much duplication of data and coordination of modifications between the
separate documents and the web pages. This must be done to allow a person
who has accessed a web page using an HTML compliant browser to determine
whether it is worth calling up the associated file or vice versa. Also, to
allow portions of web pages to be classified, web page contextual
information would have to be duplicated in the separate document. vCards
in particular do not provide this functionality. Another disadvantage is
that non-HTML documents such as vCards contain no details as to how the
data they contain is to be displayed. In the display of HTML documents the
position, font, size, colour of the text and other elements of the
document are of great importance. The restriction of address data in a
vCard to untagged ordinally organised fields is inflexible. For examples,
multiple instances of extended parts of the address are not possible. Also
components of names, addresses and telephone numbers and so forth are
insufficiently identified.
The Online Computer Library Center Inc (OCLC, Dublin, Ohio, USA) proposal,
known as the "Dublin Core", proposes to classifying scholarly web pages by
subject (topic of the work, or keywords that describe the content of the
work), title, author, publisher, other agent, date, object type (genre of
the object such as home page, novel, poem etc), form, identifier, source,
language, relationship and coverage (spatial and temporal) (see
http://www.oclc.org:5046/.about.weibel/html-meta.html and other documents
at www.oclc.org). This proposal does not include industry, service,
product or subject classifications. It also does not include contact
details. Names such as that of the author are not specified in sufficient
detail to avoid ambiguities such as which is the author's first and last
names. The proposal specifies that the details are encoded using the
<meta . . . > element in the <head> of web pages. The proposal
is unsuited to the automated creation of Yellow or White pages like
databases from web pages because the proposal does not provide for
classification of web pages and does not provide adequate contact details.
Further, the use of keywords for describing the content of the work adds
very little to the effectiveness of indexing of web pages since the web
pages are usually indexed on every word of their content and most often
the key words would simply be a duplication of words already contained in
the document.
It has also been proposed to use the Dewey Decimal System (see
http://orc.rsch.oclc.org:6109/eval_dc.html and
http://orc.rsch.oclc.org:6109/bintro.html) to rank electronic documents
against a Dewey Decimal subject classification. The proposal suggests
automatically assigning Dewey Decimal subject classification codes to
documents during automated indexing and cataloguing but does not specify
this exact nature of the assignment although it is implied that the codes
are stored separately from the documents. The proposal admits that such
automated classification is less satisfactory than human classification.
The proposal is unsuited to the automated creation of Yellow or White
pages like databases from web pages because the accuracy of classification
is inadequate, does not provide for inclusion of industry, service or
product classifications and does not provide for inclusion of contact
details. Deriving a subject classification code from an analysis of every
word and phrase in a web page is computationally expensive.
The HTML 3.0 standard (see page 23 of the www.w3.org document
"draft-ietf-html-specv3-00.txt") provides "class" as an attribute of
almost all HTML "<body>" elements. The "class" attribute is intended
to be used with style sheets. Style sheets provide a means by which the
display of HTML documents may be altered to suit the needs of different
classes of browser users. For example, <div class="appendix"> could
be used to define a division that acts as an appendix, <h2
class="section"> could be used to define a level 2 header that acts as
a section header, although, of course, any string of characters could be
defined for those purposes. The "class" attribute, although never having
been suggested for holding goods and services classifications, is not
suited for such a use as it is, in any case, undesirable to confuse the
style sheet function of the "class" attribute.
The HTML 3.0 and earlier standards provided the HTML elements
"<person>" and "<address>" but do not specify the form of the
content or method of validating the content of those elements. A person's
name may be written as first name followed by last name or last name
followed by first name. Similarly, different conventions exist for writing
addresses. Similar ambiguities arise in the ill defined format of the HTML
elements "<person>" and "<address>". As such they are of
little use in the automatic compilation of searchable databases.
The XML language (see: http://textuality.com/sgml-erb/WD-xml.html) was
developed to extend HTML so that software vendors can add new elements and
new element attributes to HTML which are not specifically defined in any
HTML standard. The intention is to ensure that all new elements and
attributes could be parsed by all XML parsers even if the new elements
held no significance for any particular XML parser. However, like HTML,
XML does not provide a standard for the representation of industry,
service, product or subject classification, contact or geographic location
details within an web page.
Of course, many useful databases of the Yellow Pages or White Pages type
are made available by service providers on networks, but they are not
compiled automatically by using web crawlers to scan HTML web pages posted
on a network. For example, http://www.yellowpages.com.au and
http://www.mcp.com provide classified advertisements of the Yellow Pages
type with links to the web pages of paying advertisers or subscribers.
There are also directories of email addresses which approximate the White
Pages directories, listing the names of individuals and organisations and
contact details, (eg http://www.bigbook.com and
http://query1.whowhere.com). However, these email directories require
listers to manually add their directory entries and enquirers to be aware
of and to find the directory enquiry web page. They cannot be
automatically generated by scanning web pages using web crawlers since
there is no adequate mechanism to relate email addresses to the names of
people and organisations and their other contact details which may also
exist in the same web page.
OBJECTIVES OF THE INVENTION
The general object of the invention is to provide improved methods for
automatically building searchable databases of classification, contact,
and/or geographical information by using web crawlers to interrogate web
pages posted on a network. [For convenience, this information is
collectively referred to as CCG-data].
Other non-essential objectives are to provide methods for including and/or
displaying CCG-data within web pages accessed by browsers, for
automatically extracting CCG-data from web pages posted on a network and
for using the same, and/or to provide methods for searching automatically
compiled databases using such data.
Another subsidiary objective of the invention is to provide a new form of
web page which is better suited to the automatic compilation (using web
crawlers) of databases constructed by the automatic scanning of many such
pages posted on a network.
OUTLINE OF THE INVENTION
The invention is based upon the realisation that highly useful databases
can be automatically built by successively interrogating web pages posted
on a network if one or more HTML encoded CCG phrases are included in the
web pages. A CCG phrase is one containing CCG-data in a form which is
directly accessible and identifiable. CCG phrases may also include one or
more items which provide the web page author with control over how the
CCG-data is applied to the database.
Data duplication can be reduced if some of the CCG-data in the coded CCG
phrases can be displayed by browsers as well as being used to update
databases. Errors due to inexactly duplicated data are also eliminated.
Accordingly, it is envisaged that CCG phrases may include one or more
items which provide the web page author with control over how the CCG-data
is displayed by a browser.
HTML (including version 2 and version 3) and XML are evolving applications
(sub-sets or dialects) of ISO Standard 8879 1986 known as Standard
Generalised Markup Language (SGML). HTML, in large part, is a language
used to describe how text (unstructured data) and graphics is to be
formatted for display. The HTML language consists of a finite number of
"elements" (for example; "<BR>" where "BR" is the element name, also
called the tag name) which may contain "attributes" (for example; "<DL
COMPACT>" where "COMPACT" is an attribute named "COMPACT") and may
contain values associated with attributes (for example; "<FONT
SIZE=+1>" where +1 is the attribute value of the attribute named
"SIZE"). XML is a language used to describe structured data. The XML
language is similarly composed of elements, attributes and values with a
similar syntax to HTML but unlike HTML the element names which may be used
are not restricted and the meaning of the XML data may be interpreted in
any convenient manner. While the XML language is mute about how data
described by XML is to be formatted for display, the data may be used by
computer programs for any purpose including description of how XML coded
data is displayed. However, due to its historic importance in connection
with web pages, the term "HTML" is herein used to refer to all markup
languages which are subsets or complete sets of the SGML language. In
particular, the term "HTML encoded CCG phrase" and the synonymous term
"CCG phrase" are herein used to refer to CCG-data encoded in a subset or
complete set of the SGML language. Herein, a "web page" is a document
adapted to be or actually accessible through a network and encoded in a
subset or complete set of the SGML language.
For convenience, CCG items in HTML encoded CCG phrases, whether they are
syntactically represented as elements or as attributes, will be referred
to hereinafter as CCG attributes.
A CCG phrase includes at least one of the following identifiable types of
CCG-data attributes:
industry, product, service, and/or subject classifications,
contact categories, contact person(s) and/or organisation(s) names, titles
or associations, contact details including physical and postal addresses,
telephone and fax numbers, email and Internet or network addresses or
locations, public keys, and
geographic location details.
A CCG phrase may also include any of the following identifiable types of
CCG control attributes:
database control attributes to indicate which parts of the data are to be
used to update databases, and
display control attributes to indicate how browsers are to display the
data.
By virtue of occurring in the same CCG phrase, a plurality of CCG-data
attributes are associated with each other.
By virtue of their occurrence in the same CCG phrase, CCG-data attributes
are idententified as a set of associated attributes. However the degree of
association between attributes can be controlled by the inclusion in the
phrase of database control attributes.
The start and end of CCG phrases should be identifiable to clearly
distinguish these phrases from other data. To identify the beginning and
end of a CCG phrase, at least one HTML element should have a CCG specific
HTML element name or CCG specific attribute name or CCG specific value.
Each CCG attribute may consist, with or without other incidental
characters, of a CCG attribute name and/or a CCG value or values.
Preferably, each CCG phrase is contained in the "<body>" of the web
page.
Two examples of a CCG specific HTML element are: "<CCG . . . >" or
"<CCG . . . />" or "<CCG> . . . </CCG>". (Where a CCG
phrase is coded in XML, the elements "<XML>" and "</XML>" may
also be needed at the start and end of the CCG phrase.) A less
satisfactory example is: "<!--CCG . . . --> where the characters
"CCG" after HTML comment element name "!--" are used to signify that the
comment contains CCG-data. An example of the use of a CCG specific
attribute name is: "<START CCG>" . . . "<END CCG>". An example
of the use of a CCG specific value is: "<START TYPE=`CCG`>". . .
"<END TYPE=`CCG`>". Obviously, other character strings could be
substituted for the element name, element attribute name or element
attribute value "CCG" string of the examples.
The codes "<CCG . . . >" and "<CCG . . . />" are compatible
with most HTML specifications, but being non-standard HTML, most web
browsers do not display any text or attributes (eg PQ="AQD") within the
angle brackets "<" and ">". These codes are preferred where display
of the CCG data is not required and compatibility with older browsers is
required (eg CCG phrases containing only classification values).
From one aspect, therefore, the invention comprises a web page for posting
on a network, the web page being characterised by the inclusion of at
least one CCG phrase in the "<body>" of the page, the CCG phrase
being such that the CCG attributes contained therein are accessible and
identifiable by (i) HTML compliant editors and/or (ii) HTML compliant web
crawlers for the automatic construction of databases of classified
information, and/or (iii) HTML compliant browsers for display on the
computer screens of network users.
From another aspect, the invention comprises a method of constructing web
pages of the above described type. The web pages may be constructed on
digital computers using simple text editors such as Microsoft Windows
Notepad, or preferably, purpose built human controlled editors or
automated composing programs which embody knowledge of HTML and CCG syntax
and grammar. Which ever process is used, CCG attributes are selected and
inserted, modified, deleted and/or organised to form a valid CCG phrases
in HTML encoded documents and the documents are posted on computer
readable storage devices of computers connected to a computer network so
that the documents are generally available to computers on the network.
From another aspect, the invention comprises a method of populating a
database with CCG-data extracted from web pages. Web pages posted on a
network are successively retrieved by a digital computer program (eg: a
web crawler) and CCG phrases contained therein are identified and at least
some of the CCG attributes found within the CCG phrases are extracted. The
CCG attribute names are used to determine the type of data in the
associated values. Generally the CCG attributes of interest are those
relating to classification, contact and geographic data and database
update controls while the attributes of little or no of interest in
relation to database updating are those relating to display controls. Of
course, the CCG-data extracted need only be that relevant to the
particular database being updated. For example, one database may have been
designed to index only web page classifications and URLs while another
database may have been designed to index only contact details. Databases
also differ in their internal representation of data and means of
associating data. For example, some use "flat file" tables, others use
pointers to data to create network associations while others use hashing
and buckets.
The conventional nomenclature differs considerably between different types
of database. Depending on the particular database nomenclature, data of
the same type is said to be stored in table columns, fields, attributes
and properties. The terms column and field are somewhat related to the
physical representation of the data in files while attribute and property
is more related to the logical representation of data. To avoid confusion,
with the terms "HTML attribute", "CCG attribute" or just "attribute",
hereinafter a database property means both a type of data stored in the
database and a place in the database where data of the same type is
stored. Database properties are referred to by a name ("property name") or
similar reference and contain values. For example, a database property
with the name "City name" and which contains values which are all the
names of cities may be defined as a "City name" type database property.
Whichever style of database is used, it is preferred that the database
update program relate the CCG attributes to corresponding database
properties used by the database update process so that the database
property values are updated with CCG values in a manner which preserves
the distinctness, content and meaning of the CCG values and, preferably,
preserves the CCG value associations expressed in the CCG phrase as sets
of associated database property values of different types.
In some cases, it is desired to know the address of the web page from which
the CCG values were extracted. For example, the purpose of building a
database might be to allow searching of the database by web page
classification to provide a list URLs of web pages or URLs of portions of
web pages which contain matching CCG classifications. The URLs could then
be inserted in an HTML document and transmitted to a web browser as a list
of references to web pages matching a search expression. In that example,
associating the URL of a web page or the URL of a portion of a web page
with the CCG values extracted from the same web page or web page portion
is important and the URL or means of reconstructing it must be available
and supplied to the database update process. In one style of database, the
values of the same type are held separate rows in a column (property) of a
database table, and pointers held in another column (property) are
associated with the values by sharing the same table row. The table row
constitutes a set of associated property values. Each pointer points to a
bucket (block of data) containing a list of URLs or pointers to URLs held
in a separate bucket or table. In another style of database, values of
different types are held in different tables together with a set number,
pointer or similar code which is used to indicate which values are
associated as members of the same set. In one variation, the values of set
members are prefixed with a code indicating the type of value and all
values are held in the same column of a table. If the purpose of the
database is to hold contact data, recording the web page URL in the
database might not be required although if the URL is not present in the
database, updating changes in the CCG contact details contained within a
web page is more difficult. Of course, one database may be used to record
all types of CCG values contained in web pages and associate with each
other any and all values extracted from the same web page or even from
other web pages.
From another aspect, the invention comprises a method of searching the
databases constructed as outlined above. These databases may be used for a
variety of searching purposes. For example, to find web page URLs by using
the association of web page URLs with industry, service, product or
subject classification or a person's or organisation's name or address or
geographic location values or any combination thereof. In another example,
the databases may be used to find the contact details for people or
organisations by name or location of industry, service, product or web
page subject type and so forth by using the association between items of
the contact details in the database without having to retrieve web pages
associated with the contact details.
More particularly, the searching method involves finding URL references, or
finding sets of associated database property values, from databases
containing CCG-data. The method including steps of parsing a query phrase
received from a computer network to extract query relational expressions
and, from each expression, deriving a query field name, query relational
operator and query value, determining the type of the query field by
reference to its name, relating the query field to a corresponding
database property according to type and locating CCG-data database
property values in the database property which return a true value when
tested against the query value using the query relational operator.
Finally, the URL references or the sets of property values associated with
the so located CCG-data database property values are extracted.
Database queries are usually expressed in a query language in the form of a
phrase or sentence. In query by example style enquiry systems, the user
types values into input fields on a form and a program extracts the input
values and uses the values to automatically compose a query phrase or
sentence. There are many existing examples of query languages used in
connection with databases. Generally, they consist of relational
expressions (eg Field=Value), logical expressions and grouping of
relational and logical expressions by means such as parentheses. They may
also contain sorting and output formatting expressions. Often abbreviated
notation is used in time expressions such as leaving out field names or
relational operators which are then inferred from the value in the
expression or implied by default. In an enquiry the nature and format of
the output may also be implied, such as a list of URLs of web pages or a
list of contact details. Whatever is the mechanism of any particular
database, the query expression needs to be parsed and fields in the query
expression, explicit, default, implied or inferred, need be related to
database properties of similar type. In some styles of database enquiry
the query expression is evaluated against each row of a table or record of
a file to find rows or records (ie a set of associated property values)
which match the query expression. In other styles, sub-sets of the values
of the properties are selected according to the interpretation of
relational expressions in the query expression and the sub-sets are
combined according to logical and grouping expressions in the query to
find the sets of associated property values which match the query
expression. Often, to make logical operations which combine the selected
sub-sets more efficient, it is not the values which are selected but
pointers to the values (eg Table name and table row) or unique keys (eg
URLs or pointers to URLs) associated with the values. For example, the AND
logical operator is often used to combine two lists so that only values or
pointers or keys common to both lists are found in the combined list.
Usually, the query produces a result list which is then provided to other
processes. For example, a list of URLs of web pages is processed to
produce an attractively formatted HTML encoded document containing the
URLs and is sent to a web browser to allow an enquirer to retrieve
interesting web pages. In another example, the contact details associated
in the database with each value or pointer in the result list are
retrieved from the database and presented as a report in the form of an
HTML encoded document and is sent to a web browser for viewing.
From another aspect, the invention comprises a method of displaying
CCG-data contained in CCG phrases within web pages which are displayed by
a web browser executing on a digital computer. While a web page is loading
or has loaded in a web browser, the web browser parses the web page and
displays the text (or data) of the web page on a display device connected
to the computer. When the web browser parser encounters CCG phrases, the
web browser may display the CCG-data (element and/or attribute names (or
translations of element and/or attribute names) and/or values) in a number
of browser specific ways. For example, the web browser may by default not
display any CCG-data, display all CCG-data, not display any CCG-data until
a CCG display control attribute explicitly states that subsequent data
should be displayed or display all CCG-data until a CCG display control
attribute explicitly states that subsequent data should not be displayed.
The web browser may also use CGA display controls specifying the size,
font, position and so forth to alter the display of the CCG-data.
DESCRIPTION OF EXAMPLES
Having indicated the nature of the present invention, examples or
embodiments thereof will now be described by way of illustration only.
Example 1
HTML Syntax Suitable for Representing a CCG Phrase
The following is an example of HTML element syntax suitable for
representing CCG phrases in which a control (e.g. "SHOW") may be "good
until countermanded" and thus apply to more than one field:
<CCG HREF="url"
{{NAME="label" .vertline. ID="identifier_code"} &.vertline.
{LANG="language_code" &
CLASS="Class_name"}
{
{SET_SEPARATOR} &.vertline.
{INDEX .vertline. NOINDEX} &.vertline.
{SHOW .vertline. HIDE} &.vertline.
{XPOS="horizontal_position_number"} &.vertline.
{YPOS="vertical_position_number"} &.vertline.
{NEWLINE} &.vertline.
{ALIGN=center .vertline. left .vertline. right .vertline. justify}
&.vertline.
{SIZE=[+/-] 1 .vertline. 2 .vertline. 3 .vertline. 4 .vertline. 5
.vertline. 6 .vertline. 7} &.vertline.
{COLOR="#rrggbb".vertline. "color_name"} &.vertline.
{FACE="type_face_name"} &.vertline.
{BLINK &.vertline. BOLD &.vertline. UNDERLINE &.vertline. ITALIC
&.vertline. STRIKE} &.vertline.
{SUBSCRIPT .vertline. SUPERSCRIPT} &.vertline.
{CLEAR{=left .vertline. right .vertline. all}}
{NORMAL} &.vertline.
{{{CONTACT &.vertline. COPYRIGHT &.vertline. DEVELOPER} &.vertline.
{PERSONAL &.vertline. BUSINESS &.vertline. ASSOCIATION} &.vertline.
{attribute_name="attribute_value(s)"}
}
...
>
where: the ellipsis " . . . " implies optional repetition of the braced
("{" "}") items; the braces are used to group items and are not CCG
syntactic elements; "&" (and) implies items must occur together;
".vertline." (or) implies only one item must occur; and
"&.vertline.".(and/or) implies any including none of the items may appear
together.
Using the syntax of this example, each CCG phrase is represented as an HTML
element, the element name being "CCG" and the CCG-data (eg
attribute_name="attribute_value") and CCG controls (eg SIZE=+1) are
represented as attributes of the HTML element. Some of the attributes (eg
SIZE) having explicit values (eg +1) and some attributes have implied
values depending on the presence or absence in a CCG phrase (eg when the
attribute BUSINESS is present it has the implied value of True and the
implied value of False when absent).
Representation in XML syntax requires, at most, only a simple translation.
All the items, such as "NORMAL" and "attribute_name" may remain unchanged
as attributes of the element named "CCG" (eg <CCG size=+1/>).
However, when a CCG phrase is encoded in XML, it is preferred that the
items are represented as XML elements. For example attribute "SIZE=+1" can
be represented as element "<size>+1</size>" or "<size
value=+1/>" and "NORMAL" can be represented as "<normal/>.
In this example, the attributes, ID, LANG and CLASS take their meanings
from HTML 3.0. The "url" in HREF="url" or may be a link with or without
destination anchor labels. For example the URL http://www.w3.org/docs.html
does not contain a destination anchor label (or identifier) while
http://www.w3.org/docs.html#searching does contain the destination anchor
label "#searching" which is intended refer to an anchor in docs.html such
as <A NAME="searching"> . . . <A>. There is some confusion in
various HTML standards documentation about the distinction between the
expression NAME="label" and the expr | | |