|
Description  |
|
|
FIELD OF THE INVENTION
This invention relates in general to system administration and in
particular to automated management of a group of computers and its
associated hardware and software.
BACKGROUND ART
The following documents are hereby incorporated by reference in its
entirety:
1. Object Oriented Programming, Coad P., and Nicola J., YourDon Press
Computing Series, 1993., ISBN 0-13-032616-X.
2. The C Programming Language, Kernighan B., and Ritchie D., 1st Edition,
Prentice-Hall Inc., ISBN 0-13-110163-3
3. The Unix Programming Environment, Kernighan and Pike, Prentice-Hall
Inc., ISBN 013-937699-2
4. Unix Network Programming, Stevens, Prentice Hall Software Series, 1990,
ISBN 0-13-949876-1.
5. Internetworking with TCP/IP, Volume I, Principles, Protocols, and
Architecture, 2d Ed, Prentice Hall, 1991, ISBN 0-13-468505-9
6. Solaris 1.1, SMCC VersionA, AnswerBook for SunOS 4.1.3 and Open Windows
Version 3, Sun Microsystems Computer Corporation, Part Number 704-3183-10,
Revision A.
7. Artificial Intelligence, Rich E., McGraw-Hill, 1983, ISBN 0-07-052261-8.
8. Artificial Intelligence, Winston P., 2d Edition, 1984, ISBN
0-201-08259-4.
9. Documentation for the SunOS 4.1.3 operating system from Sun
Microsystems, Inc.
10. SunOS 4.1.3 manual pages ("man pages") from Sun Microsystems, Inc.
As used within this document and its accompaning drawings and figures, the
following terms are to be construed in this manner:
1. "CPU" shall refer to the central processing unit of a computer if that
computer has a single processing unit. If the computer has multiple
processors, the term CPU shall refer to all the processing units of such a
system.
2. "Managing a computer" shall refer to the steps necessary to manage a
computer, for example, gathering and storing information, analyzing
information to detect conditions, and acting upon detected conditions.
The problem of system administration for a computer with a complex
operating system such as the UNIX operating system is a complex one. For
example, in the UNIX workstation market, it is common for an organization
to hire one system administrator for every 20-50 workstations installed,
with each such administrator costing a company (including salary and
overhead) between $60,000 and $100,000. Indeed, some corporations have
discovered that despite freezing or cutting back hardware and software
purchases, the rising cost of retaining system administrators has
nevertheless continued to escalate the cost of maintaining an Information
Services organization at a substantial rate.
In a typical system administration environment, the work cycle consists of
the following. A problem occurs on the computer which prevents the end
user from carrying out some task. The end user detects that problem some
time after it has occurred, and calls the complaint desk. The complaint
desk dispatches a system administrator to diagnose and remedy the problem.
This has three important consequences: First, problems are detected after
they have blocked a user's work. This can be of substantial impact in
organizations which use their computers to run their businesses. Second,
problems which do not necessarily block a user's work, but which may
nonetheless have important consequences, are difficult to detect. For
example, one vendor supplies an electronic mail package which is dependent
upon a functional mail daemon process. This mail daemon process has a
tendency to die on an irregular, but frequent basis. In such situations,
the end user typically does not realize that he is not capable of
receiving electronic mail until after they've missed a meeting scheduled
by electronic mail. Third, because problems are not detected until after
they block a user's work, a problem which at an earlier state might have
been easier to fix cannot be fixed until it has escalated into something
more serious, and more difficult to correct.
Currently, system administrators manage a group of computers by performing
most actions manually. Typically, the system administrator periodically
issues a variety of commands to gather information regarding the state of
the various computers in the group. Based upon the information gathered,
and based upon a variety of non-computer information, the system
administrator detects problems and formulates action plans to deal with
the detected problems.
Automation of a system administration's task is difficult for several
reasons:
1. Data regarding the state of the computer is difficult to obtain.
Typically, the system administrator must issue a variety of commands and
consider several pieces of information from each command in order to
diagnose a problem. If the system administrator is responsible for several
machines, these commands must be repeated on each machine.
2. When the system administrator detects a problem, the appropriate action
plan may vary depending on a variety of external factors. For example,
suppose a particular computer becomes slow and unresponsive when the
system load on that computer crosses a certain threshold. If this problem
occurs during normal business hours under ordinary circumstances, it will
probably be a problem which must be resolved in a timely manner. On the
other hand, suppose this problem occurs in the middle of the night. While
this situation might still be a problem, the resolution need not be as
timely since the organization's work will not be impacted, unless the
problem still exists by the start of the business day. Now suppose the
accounting department, at the end of each month, runs a processor
intensive task to do the end-of-month accounting, which normally forces
the load average above that threshold. If the system load crosses that
same average during the time when the accounting department runs their end
of month program, that's not a problem. In order to build a tool to handle
situations like these using current tools would require writing a large
series of inter-related complex boolean expressions. Unfortunately,
writing and testing such a series of complex boolean expressions are
difficult.
3. Current system administration tools view the universe of computer
problems as a static universe. Computer problems, however, evolve over
time as hardware and software are added, removed, and replaced in a
computer.
4. Furthermore, an automated tool should also flexibly alter its behavior
based on the nature of the commands a system administrator issues to it in
guiding it in to resolve problems. Thus, if the system administrator
routinely ignores a particular problem, the automated tool should warn the
system administrator less frequently if the routinely ignored problem
reoccurs.
What is needed is a tool which will automatically gather the necessary
computer information to manage a group of computers, detect problems based
upon the gathered information, inform the system administrator of detected
problems, and automatically perform corrective actions to resolve detected
problems.
SUMMARY OF THE INVENTION
The shortcomings of the prior art are overcome and additional advantages
are provided in accordance with the principles of the present invention
through the provision of SYSTEMWatch AI-L, which automatically manages at
least one computer by automatically gathering computer information,
storing the gathered information, analyzing the stored information to
identify specific computer conditions, and performing automatic actions
based on the identified computer conditions.
BRIEF DESCRIPTION OF DRAWINGS
The subject matter which is regarded as the invention is particularly
pointed out and distinctly claimed in the claims at the conclusion of the
specification. The foregoing and other objects, features, and advantages
of the invention will be apparent from the following detailed description
taken in conjunction with the accompaning drawings in which:
FIG. 1 illustrates an embodiment of the present invention which comprises
two groups of computers, a group of managed computers and a group of
monitoring computers.
FIG. 2 illustrates one example of the structure of a managed computer,
comprising a processing unit, memory, disk, network interface,
peripherals, and a SYSTEMWatch AI-L client;
FIG. 3 illustrates one embodiment of the structure of a monitoring &
command computer, comprising a processing unit, disk, network interface,
peripherals, and a SYSTEMWatch AI-L console;
FIG. 4 illustrates one embodiment of the structure of a computer which is
both a managed computer and a monitoring computer, comprising a processing
unit, disk, network interface, peripherals, a SYSTEMWatch AI-L console,
and a SYSTEMWatch AI-L client;
FIG. 5 illustrates one embodiment of the SYSTEMWatch AI-L client and the
SYSTEMWatch AI-L console, comprising of a core layer plus an application
layer.
FIG. 6 illustrates one embodiment of the logical structure of the core
layer in accordance with the principles of the present invention;
FIG. 7 illustrates one example of an embodiment of data within the database
of the core layer accordance with the principles of the present invention;
FIGS. 8a-8b illustrates one embodiment of the operation of the expert
system found in the core layer of SYSTEMWatch AI-L;
FIG. 9 illustrates one embodiment of the SYSTEMWatch AI-L client's "client
loop";
FIG. 10 illustrates one embodiment of the SYSTEMWatch AI-L console's
"console loop";
FIG. 11 illustrates one embodiment of the SYSTEMWatch AI-L request
facility; and
FIG. 12 illustrates one embodiment of the SYSTEMWatch AI-L report facility.
DESCRIPTION OF THE PREFERRED EMBODIMENT
One preferred embodiment of the technique of the present invention of
managing a group of computers is targeted at groups of workstations
running the UNIX operating system. Alternative embodiments of the present
invention can consist of groups of computers running other operating
systems, such as, Microsoft's Windows NT and IBM's OS/2. As viewed in FIG.
1, the invention comprises, for instance, 2 groups of computers:
a. A group of managed computers, 1, which includes computers, 2-5,
comprising, for example, (see FIG. 2) a CPU, 9, memory, 10, disks, 14,
communications interface, 16, other peripherals, 15, and a SYSTEMWatch
AI-L client, 13. The size of the managed group of computers can range from
1 to several thousand. Data which is gathered from a managed computer is
stored on the managed computer. From time to time, a managed computer may
send data to a monitoring computer (see below).
b. A group of monitoring computers, 6, which includes computers comprising,
for example, (see FIG. 3) a CPU, 17, memory,18, disks, 22, communications
interface, 24, other peripherals, 23, and a SYSTEMWatch AI-L console, 21.
The size of the monitoring group of computers can range from 0 to several
hundred. Although data gathered from a managed computer is stored on the
managed computer, from time to time a managed computer may send data to a
monitoring computer. A monitoring computer can also explicitly request
data from a managed computer. Data which is received by the monitoring
computer from a managed computer is stored on the monitoring computer.
Furthermore, since a monitoring computer can receive data from several
managed computers, a monitoring computer may perform post-processing on
data received from several managed computer, and/or perform additional
data gathering itself, in which case that data is stored on the monitoring
computer.
In another embodiment the two groups of computers may be the same group
(all managed computers are also monitoring computers), two distinct groups
(no managed computers are monitoring computers), or overlap (some managed
computers are monitoring computers). The computers which form the groups
of computers may be heterogeneous or homogeneous. The only requirement is
that each managed computer have the capability to communicate with at
least one monitoring computer. One preferred embodiment of this invention
is to have all the computers on a computer network, but any other means of
communication, e.g., over a modem using a telecommunications network, is
adequate. The differentiation between managed and monitoring computers are
the SYSTEMWatch AI-L client and the SYSTEMWatch AI-L console, which are
described below:
a. As show in FIG. 2, a computer is a managed computer if the computer is
running the SYSTEMWatch AI-L client, which provides a means for the
computer to automatically detect and respond to problems. Additionally,
the SYSTEMWatch AI-L client also accepts and responds to commands issued
by a SYSTEMWatch AI-L console described below.
b. As shown in FIG. 3, a computer is a monitoring computer if the computer
is running the SYSTEMWatch AI-L console, which provides a means for the
computer to receive and display notifications of detected problems, and to
display the corrective actions taken. Additionally, the SYSTEMWatch AI-L
console is also able to issue commands to any group of managed computers.
c. As shown in FIG. 4, a computer is both a managed computer and a
monitoring computer if it contains both SYSTEMWatch AI-L client, 13, and
SYSTEMWatch AI-L console, 21.
An Overview of the SYSTEM Watch AI-L Client
The task of the SYSTEMWatch AI-L client is to manage a computer and to
provide notification of management actions to the SYSTEMWatch AI-L
console. Before explaining how the SYSTEMWatch AI-L client operates,
however, it is necessary to consider how the SYSTEMWatch AI-L client is
organized. As previously mentioned, the SYSTEMWatch AI-L client is
bifurcated into a core layer, 33, which provides the SYSTEMWatch AI-L
client with the underlying mechanism for detecting and responding to
problems, and an application layer, 34, which configures the SYSTEMWatch
AI-L client to operate in a useful manner. The SYSTEMWatch AI-L client was
designed this way because the nature of a particular computer's problem is
not static. For example, problems may evolve as changes are made to the
hardware and software of the computer, and if the computer is a multi-user
computer, as users are added and removed from the system. As computer
problems change, only the SYSTEMWatch AI-L client's application layer need
be modified. As shown in FIG. 6, the core layer is composed of four
elements: a database, 41, an expert system, 40, a language interpreter,
39, and a communications mechanism, 42. One example of a preferred
embodiment of the application layer, 34, is a series of programs written
in a language which can be interpreted by the language interpreter of the
core layer.
Care Layer Description--Database
The first element of the core layer is SYSTEMWatch AI-L database, 41. The
database is used for storing gathered data, intermediate results, and
other information. Refering to FIG. 7, in the context of the database,
SYSTEMWatch AI-L uses two concepts: ENTITYs, 43, 53, and PROPERTYs, 44,
47, 49, 54, 56. These two features are now described in greater detail:
1. PROPERTY
Conceptually, PROPERTYs are similar to field descriptions. In one
embodiment, a PROPERTY has the following features:
TABLE 1
__________________________________________________________________________
FEATURE DESCRIPTION
__________________________________________________________________________
NAME A property must have a name.
TYPE A property must have a type, which corresponds to the type of
the data to
be stored in the field.
FORMAT A property may optionally have a string which describes how the
data in
the field should be formatted. The format string is similar to
the C
language's printf( )'s formatting control.
HEADER A property may optionally contain a string which will be
displayed as the
column header when a report featuring records containing the
property is
displayed.
DISPLAYUNIT
A string used by the reporting facility which is appended to the
data in the
field during a report. Thus, if the PROPERTY is a description of
memory
utilization in kilobytes, an appropriate DISPLAYUNIT might be
"kb"
DISPLAYTYPE
Some display formats are commonly used through SYSTEMWatch
AI-L.
DISPLAYTYPES are keywords which corresponds to a particular
FORMAT. Examples of DISPLAYTYPEs include STRING20, for a string
limited to 20 characters in width, DATESMALL, for displaying
date in
mm/dd format, PERCENT, for automatically display numbers
between
0.0 and 1.0 as percentages (e.g.: 0.52 is displayed as 52%)
SHORTDESC
A PROPERTY may optionally contain an abbreviated description of
the
PROPERTY.
LONGDESC A PROPERTY may optionally contain a long description of the
PROPERTY.
__________________________________________________________________________
2. ENTITY
Conceptually, ENTITYs are similar to database tables. In SYSTEMWatch AI-L,
ENTITYs are used to group related PROPERTYs.
FIG. 7 illustrates the concept that each piece of data in the database is
associated with a given PROPERTY and a given ENTITY. In this document, it
will be necessary to refer to certain combinations of ENTITYs and
PROPERTYs. The construction <entity name>.sub.-- <property name> (e.g.:
IGNORE.sub.-- IGNORETIME) will refer to a database entry with an entity
equal to <entity name> and a property equal to <property name>.
In addition to ENTITYs and PROPERTYs, the database, 41, in SYSTEMWatch AI-L
also has these additional features:
1. Host Information
Each piece of data in database, 41, automatically has host information
associated with it. Thus, as data is stored in the database, the database
automatically associates the host from which the data originated from.
This is because in SYSTEMWatch AI-L, data is "owned" by the host from
where the data originated. Other hosts may request a copy of the data
since SYSTEMWatch AI-L has communications capabilities. Some data may be
stored in a central location (e.g.: a SYSTEMWatch AI-L console) if it is
relevant to multiple computers. Because each piece of data has host
information associated with it, a SYSTEMWatch AI-L console can conslidate
data from multiple hosts.
2. Time Information
Each piece of data in database, 41, has a time field associated with it.
The time field by default has the last time the data was updated, but
SYSTEMWatch AI-L provides a mechanism of changing the time field so its
possible to store some other time in the field.
3. Name
Each piece of data in database, 41, has a key field which is called the
name field. A name field must be unique for a given ENTITY, PROPERTY, and
host (the name of a computer). Thus, within an ENTITY and PROPERTY used
for tracking computer processes, the name field might be the process id
since process ids are unique on each computer, so by specifying the ENTITY
name, PROPERTY name, and host name, the name field forms a unique key to
locate the data.
4. Value
Of course, a database stores data. In SYSTEMWatch AI-L, the term value
refers to the data stored in the database.
In one example, database, 41, is currently implemented as a relational
database: One table is used for describing ENTITYs. This table is used to
associate ENTITYs with PROPERTYs. Another table is used for describing
PROPERTYs. Finally, another table holds the information, which can be
located by providing an ENTITY name, PROPERTY name, and the name field of
the data. This table also contains the associated host and time
information.
In another embodiment, database, 41, can also be implemented with a
database which is object oriented, i.e, a database which supports the
ability to inherit data and methods from super and sub classes.
Additional requirements of database, 41, used in the core is that the
database must support certain query operations and certain set operations.
Specifically, the query operations supported by the database include:
1. regular expression matching in queries.
2. creation time or update time query, i.e., searching for a data item
based upon the time the data was stored in the database or based on the
time the data was last updated in the database.
3. host of origin in queries, i.e., searching for a data item based on the
host which created the data.
4. time comparison query, i.e., searching for data based upon a time
comparison. Note: SYSTEMWatch AI-L stores its time in a manner similar to
the UNIX operating system. That is to say, all time is converted to
seconds elapsed since the beginning of UNIX time. The advantages of using
this method is that time comparisons are easily made, and a time plus an
interval can be added to obtain a future time.
The set operations which database, 41, supports include:
1. set intersections (ANDs)--given 2 or more sets of data, return the
elements present in all sets.
2. set union (ORs)--given 2 or more sets of data, return the elements in
all sets.
3. set exclusion (NOTs)--given a first set and a second set, return
elements in the first set which are not elements of the second set.
Care Layer Description--The Expert System
The second element of the core layer is an expert system, 40, which is used
for problem detection and action initiation. The expert system, 40, is a
forward chaining rule based expert system using a rule specificity
algorithm. When SYSTEMWatch AI-L client, 13, is started, the expert system
contains no rules. Rules are declared and incorporated into the core
layer. Rules support both the IF-THEN rules as well as IF-THEN-ELSE rules.
The rules used in SYSTEMWatch AI-L permit assignments and function calls
within the condition of the rule. Additionally, SYSTEMWatch AI-L expert
system, 40, also has the following features:
a. Rules can declare variables. All variables declared within a rule are
static variables.
b. Rules can have an initialization section. The initialization section
contains actions which must be performed only once, and before the rule is
ever tested. It can, for example, contain a state declaration and an
interval declaration (states and intervals are described below). It may
contain variable declarations for variables used by the rules, and it may
contain code to do a variety of actions.
c. Rules can have, for instance, an INTERVAL and a LASTCHECK time. In
accordance with the principles of the present invention, in order for a
rule to be eligible for testing by the expert system, at the time of
testing the clock time must be equal to or greater than the LASTCHECK time
plus the INTERVAL time. The LASTCHECK time for each rule is set to the
clock time whenever a rule is actually tested. This way, the INTERVAL
specifies the minimum amount of time which must elapse since the last time
a rule was checked before the rule becomes eligible for | | |