|
Description  |
|
|
FIELD OF THE INVENTION
This invention relates to systems for management of computer networks and,
more particularly, to methods for isolation of faults in computer
networks.
BACKGROUND OF THE INVENTION
Computer networks are widely used to provide increased computing power,
sharing of resources and communication between users. Computer systems and
computer system components are interconnected to form a network. Networks
may include a number of computer devices within a room, building or site
that are interconnected by a high speed local data link such as local area
network (LAN), token ring, Ethernet, or the like. Local networks in
different locations may be interconnected by techniques such as packet
switching, microwave links and satellite links to form a world-wide
network. A network may include several hundred or more interconnected
devices.
In computer networks, a number of issues arise, including traffic overload
on parts of the network, optimum placement of network resources, security,
isolation of network faults, and the like. These issues become more
complex and difficult as networks become larger and more complex. For
example, if a network device is not sending messages, it may be difficult
to determine whether the fault is in the network device itself, the data
communication link or an intermediate network device between the sending
and receiving network devices.
Network management systems have been utilized in the past in attempts to
address such issues. Prior art network management systems typically
operated by remote access to and monitoring of information from network
devices. The network management system collected large volumes of
information which required evaluation by a network administrator. Prior
art network management systems place a tremendous burden on the network
administrator. He must be a networking expert in order to understand the
implications of a change in a network device parameter. The administrator
must also understand the topology of each section of the network in order
to understand what may have caused the change. In addition, the
administrator must sift through reams of information and false alarms in
order to determine the cause of a problem.
It is therefore desirable to provide a network management system which can
systematize the knowledge of the networking expert such that common
problems can be detected, isolated and repaired, either automatically or
with the involvement of less skilled personnel. Such a system must have
certain characteristics in order to achieve this goal. The system must
have a complete and precise representation of the network and the
networking technologies involved. It is insufficient to extend prior art
network management systems to include connections between devices. A
network is much more than the devices and the wires which connect them.
The network involves the network devices, the network protocols and the
software running on the devices. Without consideration of these aspects of
the network, a model is incomplete. A system must be flexible and
extendable. It must allow not only for the modeling of new devices, but
must allow for the modeling of new technologies, media applications and
protocol. The system must provide a facility for efficiently encapsulating
the expert's knowledge into the system.
Faults in computer networks are frequently difficult to isolate because the
failure of one network device may cause contact to be lost with one or
more other network devices that are fully operational. Prior art network
management systems typically provided a list of possible sources of a
fault. The network administrator was required to determine the source of
the fault based on his experience and his knowledge of the network. It is
desirable to provide a method of automatically isolating the source of a
network fault so that the job of the network administrator is simplified,
and less skilled persons can respond to network failures.
It is a general object of the present invention to provide improved methods
for isolation of faults in a network.
It is another object of the present invention to provide network management
systems that are capable of isolating faults in complex networks.
It is a further object of the present invention to provide methods for
fault isolation in a computer network wherein the fault status of a
network device is suppressed when all adjacent network devices cannot be
contacted.
It is yet another object of the present invention to provide methods for
fault isolation in a network management system using model-based
intelligence.
SUMMARY OF THE INVENTION
According to the present invention, these and other objects and advantages
are achieved in a method for isolating a network fault using a network
management system. The method comprises the steps of setting a fault
status for a first network device when contact between the network
management system and the first network device is lost, determining
whether the network management system can contact all network devices
adjacent to the first network device, suppressing the fault status of the
first network device when the network management system is unable to
contact all network devices adjacent to the first network device, and
maintaining the fault status of the first network device when the network
management system is able to contact at least one network device adjacent
to the first network device.
In a preferred embodiment, the network management system includes models of
network entities and model relations which define relations between
network entities, and the fault status of each network entity is contained
in the corresponding model. The fault status of each network entity is
obtained by each model regularly polling the corresponding network device.
The step of determining the fault status of all network devices adjacent
to the network device preferably includes a model of the first network
device monitoring the fault status in models of all network devices
adjacent to the first network device. When the model of the first network
device loses contact with the first network device, the fault status of
the first network device is set. The model then determines whether
adjacent models have lost contact with the adjacent network devices.
The method of the present invention is typically used for fault suppression
in a topological representation of the network. The fault isolation
technique can also be applied in a hierarchical representation of the
network. When contact with all network devices at a given level of the
hierarchy is lost, the fault status in the model of the next higher level
of the hierarchy is set.
BRIEF DESCRIPTION OF THE DRAWINGS
For a better understanding of the present invention, together with other
and further objects, advantages and capabilities thereof, reference is
made to the accompanying drawings which are incorporated herein by
reference and in which:
FIG. 1 is a block diagram of a network management system in accordance with
the invention;
FIG. 2 is a block diagram showing an example of a network;
FIG. 3 is a schematic diagram showing the structure of models and the
relations between models;
FIG. 4 is a block diagram showing a portion of the representation of the
network of FIG. 2 in the virtual network machine;
FIG. 5 is a flow chart illustrating an example of operation of the virtual
network machine;
FIG. 6 is a flow chart of a fault isolation technique in accordance with
the present invention;
FIGS. 7A-7C show examples of location display views provided by the network
management system;
FIGS. 8A and 8B show examples of toplogical display views provided by the
network management system;
FIG. 9 is a schematic diagram of a multifunction icon employed in the user
display views; and
FIG. 10 shows an example of an alarm log display provided by the network
management system.
DETAILED DESCRIPTION OF THE INVENTION
A block diagram of a network management system in accordance with the
present invention is shown in FIG. 1. The major components of the network
management system are a user interface 10, a virtual network machine 12,
and a device communication manager 14. The user interface 10, which may
include a video display screen, keyboard, mouse and printer, provides all
interaction with the user. The user interface controls the screen,
keyboard, mouse and printer and provides the user with different views of
the network that is being managed. The user interface receives network
information from the virtual network machine 12. The virtual network
machine 12 contains a software representation of the network being
managed, including models that represent the devices and other entities
associated with the network, and relations between the models. The virtual
network machine 12 is associated with a database manager 16 which manages
the storage and retrieval of disk-based data. Such data includes
configuration data, an event log, statistics, history and current state
information. The device communication manager 14 is connected to a network
18 and handles communication between the virtual network machine 12 and
network devices. The data received from the network devices is provided by
the device communication manager to the virtual network machine 12. The
device communication manager 14 converts generic requests from the virtual
network machine 12 to the required network management protocol for
communicating with each network device. Existing network management
protocols, include Simple Network Management Protocol (SNMP), Internet
Control Message Protocol (ICMP) and many proprietary network management
protocols. Certain types of network devices are designed to communicate
with a network management system using one of these protocols.
A view personality module 20 connected to the user interface 10 contains a
collection of data modules which permit the user interface to provide
different views of the network. A device personality module 22 connected
to the virtual network machine 12 contains a collection of data modules
which permit devices and other network entities to be configured and
managed with the network management system. A protocol personality module
24 connected to the device communication manager contains a collection of
data modules which permit communication with all devices that communicate
using the network management protocols specified by the module 24. The
personality modules 20, 22 and 24 provide a system that is highly flexible
and user configurable. By altering the personality module 20, the user can
specify customized views or displays. By changing the device personality
module 22, the user can add new types of network devices to the system.
Similarly, by changing the protocol personality module 24, the network
management system can operate with new or different network management
protocols. The personality modules permit the system to be reconfigured
and customized without changing the basic control code of the system.
The overall software architecture of the present invention is shown in FIG.
1. The hardware for supporting the system of FIG. 1 is typically a
workstation such as a Sun Model 3 or 4, or a 386 PC compatible computer
running Unix. A minimum of 8 megabytes of memory is required with a
display device which supports a minimum of 640.times.680 pixels .times.256
color resolution. The basic software includes a Unix release that supports
sockets, X-windows and Open Software Foundation Motif 1.0. The network
management system of the present invention is implemented using the C++
programming language, but could be implemented in other object-oriented
languages such as Eiffel, Smalltalk, ADA, or the like. The virtual network
machine 12 and the device communication manager 14 may be run on a
separate computer from the user interface 10 for increased operating
speed.
An example of a network is shown in FIG. 2. The network includes
workstations 30, 31, 32, 33 and disk units 34 and 35 interconnected by a
data bus 36. Workstations 30 and 31 and disk unit 34 are located in a room
38, and workstations 32 and 33 and disk unit 35 are located in a room 40.
The rooms 38 and 40 are located within a building 42. Network devices 44,
45 and 46 are interconnected by a data bus 47 and are located in a
building 48 at the same site as building 42. The network portions in
buildings 42 and 48 are interconnected by a bridge 50. A building 52
remotely located (in a different city, state or country) from buildings 42
and 48, contains network devices 53, 54, 55 and 56 interconnected by a
data bus 57. The network devices in building 52 are interconnected to the
network in building 48 by interface devices 59 and 60, which may
communicate by a packet switching system, a microwave link or a satellite
link. The network management system shown in FIG.1 and described above is
connected to the network of FIG. 2 at any convenient point, such as data
bus 36.
In general, the network management system shown in FIG. 1 performs two
major operations during normal operation. It services user requests
entered by the user at user interface 10 and provides network information
such as alarms and events to user interface 10. In addition, the virtual
network machine 12 polls the network to obtain information for updating
the network models as described hereinafter. In some cases, the network
devices send status information to the network management system
automatically without polling. In either case, the information received
from the network is processed so that the operational status, faults and
other information pertaining to the network are presented to the user in a
systematized and organized manner.
As indicated above, the network entities that make up the network that is
being managed by the network management system are represented by software
models in the virtual network machine 12. The models represent network
devices such as printed circuit boards, printed circuit board racks,
bridges, routers, hubs, cables and the like. The models also represent
locations or topologies. Location models represent the parts of a network
geographically associated with a building, country, floor, panel, rack,
region, room, section, sector, site or the world. Topological models
represent the network devices that are topologically associated with a
local area network or subnetwork. Models can also represent components of
network devices such as individual printed circuit boards, ports and the
like. In addition, models can represent software applications such as data
relay, network monitor, terminal server and end point operations. In
general, models can represent any network entity that is of interest in
connection with managing or monitoring the network.
The virtual network machine includes a collection of models which represent
the various network entities. The models themselves are collections of C++
objects. The virtual network machine also includes model relations which
define the interrelationships between the various models. Several types of
relations can be specified. A "connects to" relation is used to specify an
interconnection between network devices. For example, the interconnection
between two workstations is specified by a "connects to" relation. A
"contains" relation is used to specify a network entity that is contained
within another network entity. Thus for example, a workstation model may
be contained in a room, building or local network model. An "executes"
relation is used to specify the relation between a software application
and the network device on which it runs. An "is part of" relation
specifies the relation between a network device and its components. For
example, a port model may be part of a board model or a card rack model.
Relations are specified as pairs of models, known as associations. The
relations can specify peer-to-peer associations and hierarchical
associations.
Each model includes a number a attributes and one or more inference
handlers. The attributes are data which define the characteristics and
status of the network entity being modeled. Basic attributes include a
model name, a model type name, a model type handle, a polling interval, a
next-time-to-poll, a retry count, a contact status, an activation status,
a time-of-last-poll and statistics pertaining to the network entity which
is being modeled. Polling of network devices will be described
hereinafter. In addition, attributes that are unique to a particular type
of network device can be defined. For example, a network bridge contains a
table that defines the devices that are located on each side of the
bridge. A model of the network bridge can contain, as one of its
attributes, a copy of the table.
In a preferred embodiment of the invention, each attribute contained in a
model type includes the following:
1. An attribute name that identifies the attribute.
2. An attribute type that defines the kind of attribute. Attribute types
may include Boolean values, integers, counters, dates, text strings, and
the like.
3. Attribute flags indicate how the attribute is to be manipulated. A
memory flag indicates that the attribute is stored in memory. A database
flag indicates that the attribute is maintained in the database of the
virtual network machine. An external flag indicates that the attribute is
maintained in the device being modeled. A polled flag indicates that the
attributes' value should be periodically surveyed or polled by the device
being modeled. The flags also indicate whether the attribute is readable
or writable by the user.
4. Object identifier is the identifier used to access the attribute in the
device. It is defined by the network management protocol used to access
the device.
5. Attribute help string is a text string which contains a description of
the defined attribute. When the user asks for help regarding this
attribute, the text string appears on the user interface screen.
6. Attribute value is the value of the attribute.
The models used in the virtual network machine also include one or more
inference handlers. An inference handler is a C++ object which performs a
specified computation, decision, action or inference. The inference
handlers collectively constitute the intelligence of the model. An
individual inference handler is defined by the type of processing
performed, the source or sources of the stimulus and the destination of
the result. The result is an output of an inference handler and may
include attribute changes, creation or destruction of models, alarms or
any other valid output. The operation of the inference handler is
initiated by a trigger, which is an event occurring in the virtual network
machine. Triggers include attribute changes in the same model, attribute
changes in another model, relation changes, events, model creation or
destruction, and the like. Thus, each model includes inference handlers
which perform specified functions upon the occurrence of predetermined
events which trigger the inference handlers.
A schematic diagram of a simple model configuration is shown in FIG. 3 to
illustrate the concepts of the present invention. A device model 80
includes attributes 1 to x and inference handlers 1 to y. A device model
82 includes attributes 1 to u and inference handlers 1 to v. A connect
relation 84 indicates that models 80 and 82 are connected in the physical
network. A room model 86 includes attributes 1 to m and inference handlers
1 to n. A relation 88 indicates that model 80 is contained within room
model 86, and a relation 90 indicates that model 82 is contained within
room model 86. Each of the models and the model relations shown in FIG. 3
is implemented as a C++ object. It will be understood that a
representation of an actual network would be much more complex than the
configuration shown in FIG. 3.
As discussed above, the collection of models and model relations in the
virtual network machine form a representation of the physical network
being managed. The models represent not only the configuration of the
network, but also represent its status on a dynamic basis. The status of
the network and other information and data relating to the network is
obtained by the models in a number of different ways. A primary technique
for obtaining information from the network involves polling. At specified
intervals, a model in the virtual network machine 12 requests the device
communication manager 14 to poll the network device which corresponds to
the model. The device communication manager 14 converts the request to the
necessary protocol for communicating with the network device. The network
device returns the requested information to the device communication
manager 14, which extracts the device information and forwards it to the
virtual network machine 12 for updating one or more attributes in the
model of the network device. The polling interval is specified
individually for each model and corresponding network device, depending on
the importance of the attribute, the frequency with which it is likely to
change, and the like. The polling interval, in general, is a compromise
between a desire that the models accurately reflect the present status of
the network device and a desire to minimize network management traffic
which could adversely impact normal network operation. According to
another technique for updating the information contained in the models,
the network devices automatically transmit information to the network
management system upon the occurrence of significant events without
polling. This requires that the network devices be preprogrammed for such
operation.
It will be understood that communication between a model and its
corresponding network entity is possible only for certain types of devices
such as bridges, card racks, hubs, etc. In other cases, the network entity
being modeled is not capable of communicating its status to the network
management system. For example, models of buildings or rooms containing
network devices and models of cables cannot communicate with the
corresponding network entities. In this case, the status of the network
entity is inferred by the model from information contained in models of
other network devices. Since successful polling of a network device
connected to a cable may indicate that the cable is functioning properly,
the status of the cable can be inferred from information contained in a
model of the attached network device. Similarly, the operational status of
a room can be inferred from the operational status contained in models of
the network devices located within the room. In order for a model to make
such inferences, it is necessary for the model to obtain information from
related models. In a function called a model watch, an attribute in one
model is monitored or watched by one or more other models. A change in the
watched attribute may trigger inference handlers in the watching models.
The virtual network machine also includes an event log, a statistics log
and an alarm log. These logs permit information contained in the models to
be organized and presented to the user and to be recorded in the database.
The event message provides specific information about events, including
alarms that have occurred in a given model. The events pass from the model
to an event log manager which records the event in the external database.
An event message is also sent to the user interface based on event
filters, as discussed below. The user can request event information from
the database. An event message includes a model handle, a model-type
handle, an event date and time, an event type and subtype, an event
severity, a model name, a model-type name, an event user name, an event
data count and event variable data. The event variable data permits
additional information to be provided about the event.
Event messages sent to the user interface can utilize a filter process that
is specified by the user. The user can specify model types and a minimum
event severity for which events will be displayed on the user screen.
Events from unspecified model types or less than the minimum severity will
not be displayed. Many other event selection or filtering criteria can be
used. In general, any information contained in the event message can be
used for event filtering.
Statistics history messages are similar to the event messages described
above. The statistics information includes any model parameters or
functions which the user wishes to monitor. A statistics history message
passes from the model to a statistics log manager and subsequently to the
external database. The statistics message is also sent to the user
interface based predefined filter parameters. The user can request the
statistics log manager to obtain and display statistics information from
the external database. Statistics messages are compiled whenever a device
read procedure occurs.
When an alarm event occurs in a model, a notice of the alarm event is sent
to an alarm log and to the event log. The alarm log selects the most
severe alarm for each model which is registering an alarm. The alarms are
sent to an alarm window in the user interface. The user can obtain more
information on the alarm message by pressing an appropriate button on the
window display. Alarm log messages include the following parameters: alarm
condition, alarm cause, alarm status, alarm security data, alarm clear
switch and alarm unique ID.
An example will now be given to illustrate operation of the virtual network
machine 12. A portion of the virtual machine 12 is shown schematically in
FIG. 4. The models shown in FIG. 4 correspond to network entities shown in
FIG. 2. A flow chart illustrating the example is shown in FIG. 5. Each
network device has a model in the virtual network machine 12. Thus, for
example, model 144 corresponds to network device 44, model 145 corresponds
to network device 45, etc. Models 144 and 145 are related by connection
relation 147 which corresponds to data bus 47. Room model 148 is related
to models 144 and 145 by a contains relation.
In operation, at a specified time model 144 initiates polling of network
device 44 in step 200 in order to obtain an update of the status of
network device 44. The model 144 sends a request to the device
communication manager 14 to poll network device 44. The device
communication manager 14 converts the request to the required protocol for
communication with network device 44 and sends the message. The requested
information may, for example, be the number of packets sent on the network
in a given time and the number of errors that occurred. When the requested
information is returned to model 144, the corresponding attributes in
model 144 are updated in step 206 and an error rate inference handler is
triggered. The error rate inference handler in step 208 calculates the
error rate for network device 44. If the error rate is within prescribed
limits (step 210), an error rate attribute is updated, and the new
information is logged into the database (step 212). If the calculated
error rate is above a predetermined limit, an error alarm inference
handler is triggered. The error alarm inference handler may shut off the
corresponding network device 44 and send an alarm to the user interface in
step 214. The alarm is also logged in the database. If the network device
44 is shut off in response to a high error rate, a condition attribute in
model 144 is updated to reflect the off condition in step 216. If no
response was received from the network device 44 when it was polled (step
218), a fault isolation inference handler is triggered in step 220. The
fault isolation inference handler operates as described below to determine
the network component which caused network device 44 to fail to respond to
the poll. When the cause of the fault is determined, a fault message is
sent to the user interface.
Polling of network device 44 is repeated at intervals specified by an
attribute contained in model 144. In addition, other network devices are
polled at intervals which may be different for each network device. The
information returned to each model is processed by the inference handlers
for that model and by inference handlers in other models that are watching
such information. In general, each model type may include a different set
of inference handlers.
As described above, an attribute change in one model can trigger an
inference handler in one or more other models and thereby produce a chain
of actions or responses to the attribute change. For example, if a fault
occurs in a network device, the condition attribute of that device is
changed in the corresponding model. The condition change may trigger a
condition change in the model of the room which contains the device.
Likewise, the condition change in the room may trigger a condition change
in the building or site model. The condition attribute in each model may
have a different level of significance. For example, failure of a device
may have a high significance in the network device model but a relatively
low significance in the site model.
The software models and model relations that are representative of a
network as described herein are highly flexible and adaptable to new
network configurations and new management functions. New models and model
relations are easily added to the virtual network machine to accommodate
the needs of the user. The use of the C++ programming language permits new
model types to be derived from existing model types. Thus, the virtual
network machine 12 can be customized for a particular application.
A model type editor is used to modify and control the models in the virtual
network machine 12. The following functions are provided:
1. Describe () describes some aspect of the specified model type.
2. Create () creates a new model for the specified model type.
3. Destroy () removes the specified model from the configuration.
4. Read () reads the value of the specified attribute from a model.
5. Write () writes the given values to the attributes of the model.
6. Action () performs the specified action.
7. Generate event () creates an event message.
Similarly, the model relations can be edited by the user. The following
functions can be performed on model relations.
1. Describe () describes an aspect of the specified relation.
2. Read () reads a set of associations.
3. Add () adds an association.
4. Remove () removes a set of associations.
5. Count () returns the number of associations that match the selection
criteria.
6. Read rule () reads a set of relation rules.
As indicated above, each inference handler is triggered by the occurrence
of a specified event or events. The user must register the inference
handler to receive the trigger. An inference handler can be triggered upon
the creation or destruction of a model, the activation or initializing of
a model, the change of an attribute in the same model, the change of an
attribute in a watched model, the addition or removal of a relation, the
occurrence of a specified event or a user-defined action.
The virtual network machine described above including models and model
relations provides a very general approach to network management. By
customizing the virtual network machine, virtually any network management
function can be implemented. Both data (attributes) and intelligence
(inference handlers) are encapsulated into a model of a network entity.
New models can be generated by combining or modifying existing models. A
model can be identified by a variety of different dimensions or names,
depending on the attributes specified. For example, a particular network
device can be identified as a device, a type of device, or by vendor or
model number. Models are interrelated with each other by different types
of relations. The relations permit stimulus-response chaining. The model
approach provides loosely-coupled intelligent models with interaction
between models according to specified triggers. The system has data
location independence. The data for operation of the virtual network
machine may reside in the database, memory or in the physical network
which is being modeled.
An important function of a network management system is the identification
and isolation of faults. When the network management system loses contact
with a network device, the reason for the loss of contact must be
determined so that appropriate action, such as a service call, can be
taken. In a network environment, loss of contact with a network device may
be due to failure of that network device or to failure of another network
device that is involved in transmission of the message. For example, with
reference to FIG. 2, assume that contact is lost with network device 53.
The loss of contact could be due to the failure of network device 53, but
could also be due to the failure of network devices 50, 60 or 59. In prior
art network management systems, the network administrator was typically
provided with a list of possible causes of a fault and was required to
isolate the fault based on his experience and knowledge of the network.
In accordance with a feature of the present invention, the network
management system isolates network faults using a technique known as
status suppression. When contact between a model and its corresponding
network device is lost, the model sets a fault status and initiates the
fault isolation technique. According to the fault isolation technique, the
model (first model) which lost contact with its corresponding network
device (first network device) determines whether adjacent models have lost
contact with their corresponding network devices. In this context,
adjacent network devices are defined as those which are directly connected
to a specified network device. If adjacent models cannot contact the
corresponding network devices, then the first network device cannot be the
cause of the fault, and its fault status in the first model is suppressed
or overridden. By suppressing the fault status of the network devices
which are determine | | |