|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to an information storage and retrieval
system which permits storage, retrieval and display of information such as
documents, drawings, photographs and the like in such a manner in which
common users can easily manipulate the system for the storage and/or
retrieval of information.
2. Description of the Prior Art
Heretofore, management of a data base which permits storage and retrieval
of an enormous amount of information has been relied on by those skilled
in the art. The information is available to the end user only through the
medium of experts. However, in accompaniment to the development of small
size storage devices of a large capacity such as optical disks, there are
realized document filing systems for office use which can be directly
manipulated by the end users. Further, word processors have increasingly
come into wide use. Under the circumstances, there is an increasing
tendency that a large amount of documents are stored in electronic
devices.
Heretofore, items, such as documents, are managed in tabular form listing
bibliographic data such as identification names, titles and author's names
attached to the documents, and attempts have been made to facilitate the
retrieval of information by assigning keywords or classification codes
thereto. Nevertheless, there arise problems mentioned below.
In most of the computer file systems, the file management is performed with
the aid of identification names (each composed of ca. 20 characters).
However, difficulty is often encountered in naming the document or file so
that it can be readily recalled. Besides, searching the file on the basis
of the character string which constitutes the name while inferring the
contents from the name is an extremely difficult job even for the user who
has prepared the name himself.
Since the bibliographic data are objective items, registration thereof can
be easily made. However, there scarcely arises the situation in which the
bibliographic data are made use of as means for retrieval. Utilization of
the bibliographic data as the aid for the retrieval is restricted to the
rare case in which the document to be retrieved is clearly known to the
user as the source or reference literature.
In most cases of the retrieval of documents, the title ambigously memorized
by user or the contents thereof provides a clue for the retrieval. To this
end, keywords and classification codes are employed. However, difficulty
is encountered in assigning the keywords or classification codes to the
documents upon registration thereof. In other words, it is difficult to
determine the keyword which makes it possible to retrieve properly the
associated document later on. By way of example, it is assumed that many
keywords are attached to a document so that it can be retrieved, as viewed
from various perspective. This however means that a number of keywords
which are useless for retrieval are employed. If the number of the
keywords is decreased, uncertainty arises as to the correct selection for
retrieval. In the data base for literatures, preparation and allocation of
the keywords have heretofore been relied on by those skilled in the art.
Moreover, difficulty is often encountered in recalling the keyword itself.
By way of example, upon preparation of the retrieval formula composed of
the keywords for the retrieval of a document, literatures having a
resemblance to the desired one are searched out from a general list for
picking up their keywords, which are then referred to for determining the
keywords possibly allocated to the desired document. Such procedure is not
rare and tells how difficult it is to recall the keyword.
In the case of filing documents through classification, ambiguity of the
taxonomic tree (hierarcal tree) as well as confusion of the taxonomic
trees (i.e. multiple classifications of one document) provide problems.
Further, standards for the classification vary as passes. A span of
several years will make the classification standards useless, giving rise
to another problem.
Under the circumstance, easy management and retrieval of information for
the user provide extremely important problems remaining to be solved in
the hitherto known document filing systems.
As n attempt to cope with the above problems, there has been proposed a
method of diagraming the retrieval conditions and deriving a formal query
formula for the retrieval by using natural language, as disclosed in J. F.
Sowa's "Cohceptual graphs for a Data Base Interface" IBM J. Research and
Development, Vol. 20, 1976, p.p. 336-357. Furthermore, a method of
assisting creation of the conditional formula for retrieval by presenting
knowledge concerning the contents of a data base from a computer is known,
as disclosed in F. N. Tou et al's "RABBIT: An Intelligent Database
Assistant", Proceedings of National Conference of AAAI, 1982, p.p.
314-318. These methods are intended only for assisting the retrieval from
the data base. No teachings are disclosed as to the assistance of storage
of information for the updating purpose.
In the filing of documents by the end user, registration of new documents
as well as maintenance of the file system (e.g. reexamination as to
pertinency of classification) is important for realizing the facilitated
retrieval. The approaches mentioned above do not meet this requirement.
Finally, the retrieval is accompanied by still another problem. Namely, no
measures are available for re-examining the old information from the view
point of a new concept which has not yet been clearly defined at the time
the old information was stored or for retrieving from the new point of
view. By way of example, there often occurs such case in which
classification is to be modified from the new viewpoint or in a manner
specific to the user himself after lapse of several years. In this way,
possibility of rearrangement of information as well as alteration of
retrieval also provide important factors for enhancing the easy usability
of the information storage and retrieval system.
SUMMARY OF THE INVENTION
An object of the present invention is to solve the problems mentioned above
and provide an information storage and retrieval system which allows the
user to retrieve the desired document from ambiguous or vague and
fragmentary (partial) information in a facilitated and simplified manner
while making it easy to enter or register documents and other information.
In view of the above and other objects which will be more apparent as
description proceeds, there is provided according to a general aspect of
the invention an information storage system in which a mechanism of
storing information in the machine is so arranged as to be compatible or
comparable to the user's memorization mechanism and thinking process so
that the end user can easily understand manipulation of the system to
thereby enhance the facilitated usability thereof.
More specifically, the invention contemplates to make it possible to
facilitate registration of new information and the inputting of conditions
for retrieval, realizing semantically meaningful retrieval, and adapting
the retrieval for diversity of viewpoints.
To this end, the system according to the invention is imparted with the
novel functions mentioned below:
(1) Supporting function for registration.
For registration of new documents, it is necessary to input the subject
matter and the nature or class thereof in addition to the entry of the
bibliographic items (author's name, title, the sources and others).
Further in order to realize semantic retrieval, it is required to
additionally provide more detailed or concrete information. By way of
example, suppose that the subject matter is a computer. Then, there may be
required such information as "what kind of computer it is", "what
characteristics it has", "what company has developed it", "where the
company is located", "which country the location belongs to", and so
forth. When the information mentioned above is stored, it is possible to
retrieve with the aid of inference function "the document concerning a
computer developed by a certain company located in a country A and having
characteristic features B".
According to the teachings of the invention, knowledge about the concepts
"computer", "company" and others is stored in the storage system, wherein
upon addition of new information, the user is given instruction as to what
kind of property data should be inputted through dialogical procedure, so
that he or she can input the data within a short time without being
accompanied with entry of erroneous or false information.
In the case where information or similar property has been already
registered, such function is realized which allows only the property
differing from that of the above information to be inputted without need
for entering all the property data of information to be newly inputted, to
thereby facilitate the inputting procedure. By way of example, suppose a
case in which a man named "John Smith" has been already registered and his
brother named "George Smith" is to be newly registered. In that case, by
selecting "John Smith" as a similar concept, the system displays a list of
the properties of this concept, for example, in a manner as follows:
(FATHER-IS "Davise Smith")
(MOTHER-IS "Samanser Smith")
(BIRTHDAY-IS "May 4, 1960")
(SEX-IS "male")
(HOBBY-IS "music") (1)
Then, the user can input the properties of the concept "George Smith" that
differ from the above, e.g. (BIRTHDAY-IS "June 7, 1963") and (HOBBY-IS
"sport").
(2) Supporting Function for retrieval condition input.
When the end user is going to perform the retrieval of a document, it is
common that he or she has only an ambiguous image or concept of the
document and has difficulty in expressing it in the natural language.
According to the teaching of the present invention, the retrieval is
started from the most important concept and information is sequentially
added through dialogical procedure or interaction. To this end, the
knowledge of the world model conserving the content of the filed documents
is stored in the system as is the case with the registration assistance
function. On the basis of the knowledge, the names of properties which can
be inputted and the concept (class of things) to which the properties may
belong are presented to the user.
By way of example, suppose that what the user wants is "technical paper".
Then, the user inputs "technical paper". The system knows that "technical
paper" has properties such as "author", "title", "subject matter" and
others. Accordingly, the system displays on a terminal CRT sets of names
of such properties and concepts such as (author, name), (title, text)..
and (subject, concept). The user who observes the display in turn inputs
the selected data which the user memorizes as the relevant information.
For example, "subject" is selected and "computer" is inputted. This
process can be recursively repeated. In the above example, when the
"computer" is inputted as the selected subject, the system in turn
displays (DEVELOPED-BY ORGANIZATION COMPANY), (RUNS COMPUTER-LANGUAGE),
(RUNS-UNDER OS) and others. In response thereto, the user will input (RUNS
LISP) as the additional condition for retrieval.
By virtue of the assistance function mentioned above, there can be
established the retrieval condition as follows:
______________________________________
"Technical paper about computer in which LISP
runs and which is written by an employee of company A"
(2)
______________________________________
As will be described in detail hereinafter, the above retrieval condition
is expressed in the formula or expression as follows:
______________________________________
(TECHNICAL-PAPER
(SUBJECT-IS
(COMPUTER (RUNS LISP))
(AUTHOR-IS
(EMPLOYEE (WORKS-AT COMPANY A))))
(3)
______________________________________
The above expression is based on symbolic expression (S-expression) in LISP
Language (refer to P. H. Winston "LISP" Addison-Wesley Publishing Co.,
1981, p. 18).
(3) Semantic retrieval function.
It is common that a user who wants to retrieve a certain item has only
fragmentary and ambiguous information thereof. On the other hand, the
computer memory (e.g. data base) stores that item in a concrete name. The
gap between the user's fragmentary information and the precise data stored
in the computer memory must be bridged.
In this connection, the ambiguity may be generally classified into five
varieties mentioned below:
(i) Incompleteness of name
Only a part of the name of an item or concept is memorized.
(ii) Synonym
The same thing is often memorized or recalled in terms of different words.
By way of example, words "artificial intelligence", "thinking machine",
and "AI" indicates the same concept.
(iii) Incompleteness of number.
It is rare that a user remembers numerical values precisely, as exemplified
by "during the generation of 1980s", "about 1985", "from 1983 to 1987",
"before 1960" and so on.
(iv) Taxonomic conceptual abstraction -1
Things and concepts are often memorized in terms of concepts of higher rank
with the concrete contents being forgotten. Memorization of the is often
based on the classification of concept, as exemplified by sayings that
"although the name of the company is forgotten, the organization is
neither university nor laboratory but a company at any rate", "that was a
certain electric machinery manufacturer" or the like.
In this case, assuming that the electric machinery manufacturer is "ABC
Co., Ltd.", for example, the following relations hold true.
("ABC Co., Ltd." IS-A ELECTRIC-MANUFACTURER)
(ELECTRIC-MANUFACTURER IS-A MANUFACTURER)
Schematically, the concepts "ABC Co., Ltd." and "ELECTRIC-MANUFACTURER" are
coupled by a link "IS-A". Herein, the link "IS-A" represents a relation
defined between the two concepts mentioned above and is referred to as the
subsumption relation which is an ordered relation representing a
superclass relation between two concepts.
In general, it is believed that all the concepts constitute a hierachial
taxonomy by means of the link "IS-A". The resulting hierachical tree is
referred to as a concept tree or conceptual tree.
(v) Partomic conceptual abstraction -2
The abstraction discussed above is a sort of set theoretical abstraction.
It should be pointed out that people often memorizes a thing in terms of
upper rank part in part-whole relation of a concept. For example, man says
that "although I can not remember the factory where Mr. A works, I am sure
that he is an employee of ABC Co., Ltd." or "although I can not remember
what the city is called, I am sure that the city is located in the state
of California".
In contrast, the conventional data base stores the corresponding facts in
more definite manner such as "Mr. A works at XYZ factory" or "ABC Co.,
Ltd., is located at Los Angeles". Accordingly, the information stored in
the data base can not be retrieved starting from the ambiguous information
memorized by the user.
In this case, the following relations play an important role.
("ABC Co., Ltd." HAS-PART-OF "XYZ factory")
("California state" HAS-PART-OF "Los Angeles").
What is important to be noted is
("LosAngeles" IS-A "California state")
is not correct, but should be
(LosAngeles IS-PART-OF "California state").
These relations "IS-PART-OF" and "HAS-PART-OF" are referred to as
"part-whole" relations which are ordered relations representing a
structural inclusion relationship between two concepts.
This relation should be clearly distinguished from the subsumption relation
described above. Parenthetically, it should be mentioned that the relation
"IS-PART-OF" is a reverse relation of "HAS-PART-OF".
In more strict sense, the relation having directivity is referred to,
simply as the relation, while the relation is referred to as the
relationship when the direction is not concerned.
As to a person's memorization faculty or characteristic, it may further be
pointed out that relation between the concepts is more susceptible to be
memorized than the concepts themselves. For example, in the case of
retrieval starting from such fragmentary ambiguous information that "the
subject matter of a certain article is an operating system which was
developed by an institute in U.S.A.", the fact "developed" is important,
and this fact represents "relation" defined between the two concepts
"operating system" and "institute". In more concrete, retrieval condition
may be expressed as follows:
______________________________________
("UX OPERATING SYSTEM"
IS-DEVELOPED-BY
"INSTITUTE B")
______________________________________
wherein "IS-DEVELOPED-BY" represents the relation. In the retrieval based
on the ambiguous information, this "relation" defined among the concepts
is important.
Among the characteristics of a person's memorizing faculty, the
incompleteness of name and numerical values are taken into consideration
in the hitherto known information retrieval. For example, there can be
mentioned the matching function of fragmentary (partial) character string
and designation of numerical range. The semantic retrieval function
according to the invention is characterized above all by the conceptual
abstractions among the classified varieties described above. More
specifically, with the aid of the retrieval condition input supporting
function, the semantically ambiguous retrieval is rendered possible, as
follows:
______________________________________
Retrieval Condition:
"Article concerning a
computer developed by a
certain company located in
California state and in which
an operating system developed
by a certain institute
runs" (4)
______________________________________
In the above conditional statement, the concrete concept is only
"California state". Other words which may possibly be used as keywords are
"computer", "institute", and "operating system". Through the hitherto
known information retrieval system, e.g. keyword retrieval system, any
satisfactory results of retrieval can not be obtained. It is however noted
that the conditional statement (4) is considered a "semantic meaningful
retrieval condition" according to the invention, because the statement (4)
contains relations between "California state" and "company", "company"
and "computer", and "operating system" and "computer", respectively, as
the information for retrieval. Further, in the sense that "company",
"computer", "operating system" are generic name (abstract concepts), the
so-called "abstract" retrieval is realized. In contrast, in the case of
the hitherto known retrieval system, since the relations between keywords
are not stated, the above statement (4) may be erroneously interpretted as
"article about computer introduced by an institute located in California
state and in which operating system developed by a certain company runs",
which is of course "semantically meaningless retrieval".
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a view showing a system arrangement according to an embodiment of
the present invention;
FIG. 2 is a view for illustrating a concept network;
FIG. 3 is a view illustrating the concept network in a schematic diagram;
FIG. 4 is a view showing a concept relation model in an Entity-Relation
diagram;
FIGS. 5 to 8 are views illustrating concrete examples of knowledge
representation by the concept relation model;
FIG. 9 is a view illustrating an example of image data management;
FIG. 10 is a functional block diagram showing software employed according
to an embodiment of the invention;
FIG. 11 is a view for illustrating a result of character substring matching
procedure;
FIG. 12 is a view showing a menu;
FIG. 13 is a view for illustrating network traverse procedure based on
selection from the menu;
FIG. 14 is a view showing a concept tree display;
FIG. 15 is a view showing a hierarcal tree based on the part-whole
relationship;
FIG. 16 is a view for illustrating network traverse procedure based on
concept frames;
FIG. 17 is a view for illustrating method for definition and registration
of a new concept;
FIG. 18 is a view for illustrating concept network edition;
FIGS. 19 to 22 are views for illustrating dialogical retrieval formula
creating procedure;
FIG. 23 is a view for illustrating semantic retrieval;
FIG. 24 is a view for illustrating a concept matching procedure; and
FIG. 25 is a view for illustrating functions for displaying concepts in
tabular form.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
In the following, the present invention will be described in detail in
conjunction with the exemplary or preferred embodiments thereof by
referring to the accompanying drawings.
FIG. 1 shows a general arrangement of an image information filing system in
which an information storage and retrieval system according to an
exemplary embodiment of the invention is adopted. Initially, the structure
and operation of the whole system will be outlined below.
Basically, the system is composed of a data processing portion and an image
information processing portion. The data processing portion comprises a
control unit (also referred to as CPU) 100, a main memory 300, magnetic
disk units 400 and a terminal console 200 (which includes a CRT 210, a
keyboard 220 and a mouse 230) and an image information processing portion.
On the other hand, the image information processing portion comprises an
image scanner 700, an image printer 750 an optical disk unit 450, an image
buffer memory 350, a high-speed image processor (also referred to as IP)
600 and a high-resolution image display (also referred to as CRT) 500. The
data processing portion and the image information processing portion are
interconnected through a bus adapter 805.
As main operations to be performed, there can be mentioned registration of
image information from documents, retrieval of desired information for
display or other type of outputting thereof, and inputting and editing of
information or data belonging to the field to be filed. In the
registration of the image knowledge of a document, the latter is scanned
through the image scanner 700, wherein the resulting image information is
loaded in the image buffer memory 350 and stored in the optical disk unit
450 after having been coded in a compressed form by the high-speed image
processor or IP 600. At that time, the image information in the buffer
memory 350 is displayed on the image display or CRT 500 to check whether
the image information has been properly digitized, while bibliographic
data of the document (such as subject or title, author, the source and
others) as well as significance thereof in the world knowledge are
inputted through the terminal console 200. The bibliographic data,
physical addresses (pack address, track address and sector address) of the
image information in concern on the optical disk unit 450 and properties
of the image (size, scan density, type of coding as adopted and the like)
are stored in the magnetic disk unit or file unit 420. On the other hand,
information about the significance of the document in the world knowledge
and the like is stored in the file unit 430.
In the retrieval and display operation, the desired document is identified
with the aid of the terminal console 200 through dialogical interacting
process described hereinafter to be thereby displayed on the image display
CRT 500. When a hard copy is desired, this can be outputted from the
printer 750. Information about the location of the identified document
(such as the physical address of the optical disk unit) is read out from
the file unit 420 to be subsequently sent to the optical disk control unit
450 as the control command for reading the optical disk by way of the bus
adapter 805. The image information or data thus read out is once stored in
the buffer memory 350 and is sequentically decoded through the IP 600 to
be displayed.
The mouse 230 is capable of designating the display position or location on
both the CRTs 210 and 500. Accordingly, the display position of the image
on the CRT 500 is designated by the mouse 230. By taking advantage of this
function, the document images on a plurality of pages can also be
displayed at given locations or positions on the CRT in overlapping
relation. Furthermore, the document image corresponding to one page can be
displayed in a reduced size through the IP 600, for thereby allowing a
number of ges to be simultaneously displayed on a single CRT screen.
Management of images to be displayed on the CRT is performed by the
control unit or CPU 100.
Inputs for editing the world knowledge are performed on the terminal 200 by
displaying the document on the CRT 500, as it is required. The phrase
"world knowledge", is intended to mean a set of concepts concerning the
world or field described in the document and the facts described in terms
of relationships among the concepts, which document is to be registered or
has already been registered. Further, the term "world knowledge"
encompasses these concepts, as well as the interconceptural relationships,
in a natural language. Needless to say, the document itself is included as
one of the concepts by the term "world". These knowledges are stored in
the file unit 430.
The three main functions described above can be arbitrarily called in a
modeless manner whenever they are required. By way of example, information
as required can be displayed on the CRT 500 by resorting to the retrieval
function in the course of performing the additional editing of the world
knowledges. It is also possible to additionally file the knowledge of the
contents of a document in the course of performing the registration of the
same document.
Next, discussion will be directed to the representation format of the world
knowledge data. The representation of knowledge is made in terms of two
varieties of elements, i.e. the concepts and the relation(s) between or
among the concepts. FIG. 2 is a schematic diagram illustrating
conceptually these elements in terms of a kind of a semantic network. In
the figure, each node represented by an ellipse represents a concept,
wherein the word written within the ellipse is typical word representing
that concept. This word is referred to as the name of the concept. Links
interconnecting the ellipses (i.e. solid and broken lines with respective
arrows) represent the relationships among the concepts. For example, the
fact that a "supercomputer 1012" is "one variety of" a "computer 1011" is
represented by a link labelled "IS-A". It should be mentioned that
"UNIVERSAL 1010" is a specific concept defined to subsume all the other
concepts. In other words, all the concepts constitute a concept tree
having a root constituted by the concept "UNIVERSAL", wherein the concept
tree represents a taxonomic hierarchy. The link "IS-A" is one variety of
the relationships. However, this link also serves as a route for
inheriting the property of a concept to the one ranked lower.
Consequently, this link or relationship is considered discriminatively
from the other relationships. To this end, the links "IS-A" are
represented by the arrowed solid lines, while other links or relationships
are represented by broken lines.
By way of example, in considering a generic property that "computer runs
software", it will be noted that this property can also be represented by
the expression "software runs on computer". This kind of relationship will
herein be referred to as the generic relation. The representing format of
the generic relation in the case of the example mentioned above is
(COMPUTER RUNS SOFTWARE)
(SOFTWARE RUNS-ON COMPUTER) . . . . (5)
These generic relations can be taken over or inherited to the low rank
concepts in such a manner that "supercomputer runs software" and "X-800
computer runs software" or "operating system runs on computer" and "UX
runs on computer", where each of the foregoing is referred to as a generic
relation. These relationships can be derived from the generic relation (5)
and is not directly described in the knowledge base.
In FIG. 2, the link 1005 interconnecting the concepts "X-800" and "UX"
differs from the aforementioned generic relationship. This link 1005
represents the individual relation defined between the two concepts linked
together. This sort of relation will be referred to as the instance
relation or simply as relation. It should however be noted that the
relation 1005 is an instance relation of the generic relationship 1004.
In this way, the schematic diagram of FIG. 2 tells a fact that the subject
matter of an article "ART #018" denoted by a numeral 1018 is the
supercomputer X-800 and that an operating system UX runs on the
supercomputer X-800. Further, it will be seen that all the concepts are
interconnected by longitudinal lines referred to as the links labelled
"IS-A" on one hand and interconnected by transverse links referred to as
the generic relations and the instance relations, to thereby constitute
the conceptual network.
In this conjunction, it is important to no | | |