|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention pertains to computer networks. More particularly,
this invention relates to improving the ability of a network to route
around faulty components.
2. Background
Modern computer technology is advancing at a very fast rate and has
resulted in high-performance computing components being made available in
smaller and smaller packages. These small, high-performance components are
finding expanded uses in a wide range of personal, business and academic
fields.
One use of these high-performance components is in network systems. In a
network system, multiple processing units are coupled together to perform
various programmed tasks. For example, the processing units may be
networked together as a local area network (LAN) in an office building to
allow individuals with personal computer systems in the building to
communicate with one another. Such network systems are beneficial to users
because they allow the users to communicate with each other, such as by
electronic mail or transferring data files between one another. Or, by way
of another example, a "supercomputer" may contain multiple processing
units which are coupled together via a high-performance network and which
operate together to perform various programmed tasks. These supercomputers
are beneficial to users because they provide an extremely fast, powerful
and cost-effective system to carry out users' requests.
However, one disadvantage of network systems is that the greater the number
of components in the system, the greater the chances that a component will
become faulty during system operation. A network with thousands of
components has a relatively low mean time between failure for the system
components. That is, there is a relatively high probability that one
component within the network will fail within a given period of time (for
example, one failure per week). In order to be useful to the user(s), the
network should be able to resolve these component failures. A system which
shuts itself down upon detecting a faulty component and cannot re-start
until the component is repaired or replaced reduces the availability of
the system and increases the inconvenience to the users. Thus, it would be
beneficial to provide a system which is able to automatically bypass
faulty network components.
Furthermore, many users have neither the expertise nor the desire to
resolve a component failure in the network by indicating to the network
how to route around the faulty component. Many users do not have the
technical expertise required to perform such a correction. Furthermore,
performing such a correction could be very time-consuming, and distracts
the user from his or her other responsibilities. Thus, it would be
beneficial to provide a system which resolves the failure of a component
in a manner which is transparent to the system user(s).
In addition, depending on the layout of a network, a faulty component could
cut off multiple good components from the remainder of the network.
Depending on the type of network, this could mean that some personal
computers could not communicate with others, or that certain processing
units would not be available to the system user, even though they are in
good working condition. Thus, it would be beneficial to provide a system
which reduces the number of good components which are disconnected from
the remainder of the system by a faulty component.
Additionally, network systems should effectively resolve "deadlock"
situations. A deadlock situation occurs when one or more components within
the network cannot advance in their operation due to resources within the
system which the component(s) requires being unavailable. The occurrence
of a deadlock situation is dependent on the routing technique utilized in
the system. In one routing technique, referred to as "circuit switching,"
a source node sends control information for a packet through its intended
path to a destination node in the network to reserve each link in the
path. Once the entire path is reserved, the source node transfers the data
along the reserved path to the destination node. In another routing
technique, referred to as "wormhole routing," the source node sends the
necessary control information through its intended path to the destination
node, followed immediately by the data. That is, the source node does not
wait for the entire path to be reserved prior to beginning transfer of the
data. In both of these routing techniques, the data packet maintains
reservation of portions of the path already reserved while waiting for
subsequent portions to be reserved. Thus, a deadlock situation may arise
when, for example, two or more source nodes are attempting to transfer
data to one or more destination nodes and none can advance because the
other is blocking a portion of the data path required by the other. Thus,
in order to provide continued performance of a network system, such
deadlock issues need to be resolved.
The present invention provides for these and other advantageous results.
SUMMARY OF THE INVENTION
A method and apparatus for enhancing the fault-tolerance of a network is
described herein. The present invention finds a set of computing nodes
within the network which are available for use in the network upon
detection of a faulty component. The present invention finds this set of
available computing nodes by first determining a set of computing nodes
within the network which are physically connected together. The present
invention then determines a connectivity value for each computing node
within this set. A subset of this set is then generated such that each
computing node in the subset is able to transfer data to and from each
other computing node in the subset. This subset is then utilized as the
set of available computing nodes. In one embodiment, the set of computing
nodes which are physically connected together is the largest set of
physically connected computing nodes in the system.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention is illustrated by way of example and not limitation
in the figures of the accompanying drawings, in which like references
indicate similar elements and in which:
FIG. 1A is a network system according to one embodiment of the present
invention;
FIG. 1B is a network system according to an alternate embodiment of the
present invention;
FIG. 2 shows a router coupled to a computing node according to one
embodiment of the present invention;
FIG. 3 shows a packet of data according to one embodiment of the present
invention;
FIG. 4 shows a routing device according to one embodiment of the present
invention;
FIG. 5 is a flowchart showing the steps followed in the present invention
in routing packets of data along paths through a network system; and
FIG. 6 shows the steps followed by one embodiment of the present invention
in establishing a set of nodes which have a deadlock-free path between
each other.
DETAILED DESCRIPTION
In the following detailed description numerous specific details are set
forth in order to provide a thorough understanding of the present
invention. However, it will be understood by those skilled in the art that
the present invention may be practiced without these specific details. In
other instances well known methods, procedures, components, and circuits
have not been described in detail so as not to obscure the present
invention.
Some portions of the detailed descriptions which follow are presented in
terms of algorithms and symbolic representations of operations on data
bits within a computer memory. These algorithmic descriptions and
representations are the means used by those skilled in the data processing
arts to most effectively convey the substance of their work to others
skilled in the art. An algorithm is here, and generally, conceived to be a
self-consistent sequence of steps leading to a desired result. The steps
are those requiring physical manipulations of physical quantities.
Usually, though not necessarily, these quantities take the form of
electrical or magnetic signals capable of being stored, transferred,
combined, compared, and otherwise manipulated. It has proven convenient at
times, principally for reasons of common usage, to refer to these signals
as bits, values, elements, symbols, characters, terms, numbers, or the
like. It should be borne in mind, however, that all of these and similar
terms are to be associated with the appropriate physical quantities and
are merely convenient labels applied to these quantities. Unless
specifically stated otherwise as apparent from the following discussions,
it is appreciated that throughout the present invention, discussions
utilizing terms such as "processing" or "computing" or "calculating" or
"determining" or "displaying" or the like, refer to the action and
processes of a computer system, or similar electronic computing device,
that manipulates and transforms data represented as physical (electronic)
quantities within the computer system's registers and memories into other
data similarly represented as physical quantities within the computer
system memories or registers or other such information storage,
transmission or display devices.
FIG. 1A shows a network system according to one embodiment of the present
invention. A two-dimensional mesh network 100 is shown including multiple
routing devices 102, also referred to as routers. Each router 102 is
coupled to the two, three, or four routers adjacent to the router in the
matrix, depending on its location in the matrix as shown. A network 100
may have any number of routing devices. In one embodiment, network 100
includes 16 such routing devices organized in a four-by-four grid,
creating a 4-ary 2-dimensional network.
Each router 102 is coupled to its adjacent routers 102 via a bi-directional
communication link 105. Communication link 105 can be any of a wide
variety of conventional communication devices. In one embodiment,
communication link 105 is a set of wires or other signal transmission
medium via which signals issued by a source router propagate to a
destination router.
It should be noted that the present invention is not limited to
two-dimensional mesh networks as shown FIG. 1A. Routers may be coupled
together in a k-ary n-dimensional network, where k and n are any number
greater than or equal to one. For example, a three-dimensional mesh-based
network may be utilized in which each router is coupled to three, four,
five or six other routers, depending on their location within the network.
Alternatively, the routers may be connected in a torus network 150, as
shown in FIG. 1B. In a torus network, the routers on the ends of the
network are directly coupled to the routers on the opposing ends of the
network; thus, each router is directly coupled to four other routers. For
example, router 152 is directly coupled to routers 154, 156, 158 and 160.
In an alternate embodiment, the routers may be connected in a partial
torus network, such that the routers on only two of the ends of the
network are directly coupled together, thereby resulting in each router
being directly coupled to three or four other routers depending on its
location within the network.
Furthermore, it will be appreciated that the network systems shown in FIGS.
1A and 1B represent a wide variety of computer networks. For example, the
network could be a mesh-based interprocessor communication network
utilized to couple computing nodes together in a supercomputer.
Alternatively, the network could be a LAN which couples multiple personal
computers together, such as multiple file or video servers.
In one embodiment of the present invention, each router 102 is coupled to a
computing node 200 as shown in FIG. 2. A computing node 200 is shown
comprising a bus or other communication device 210 for communicating
information between one or more processors 215 and 220 for processing
information and instructions. In one implementation, the present invention
includes Intel.RTM. architecture microprocessors as processors 215 and
220; however, the present invention may utilize any type of microprocessor
architecture. In one embodiment, bus 210 includes address, data and
control buses. The system also includes random access memory (RAM) 225
coupled with bus 210 for storing information and instructions for the
processors 215 and 220, a read only memory (ROM) 230 coupled with the bus
210 for storing static information and instructions for the processors 215
and 220, mass storage device 235 such as a magnetic or optical disk and
disk drive coupled with the bus 210 for storing information and
instructions, and input/output (I/O) devices 240 coupled with the bus 210
which input and output data and control information to and from the
processors 215 and 220. I/O devices 240 include, for example, a display
device, an alphanumeric input device including alphanumeric and function
keys, and a cursor control device. A hard copy device such as a plotter or
printer may also be included in I/O devices 240 for providing a visual
representation of computer images.
A network interface unit 205 is also coupled with the bus 210 for allowing
the node 200 to communicate with the router 102. In an alternate
embodiment, network interface unit 205 is coupled to a separate I/O bus,
such as a Peripheral Component Interconnect (PCI) bus, which is coupled to
bus 21 0 via a bus bridge. In another alternate embodiment, network
interface unit 205 is included as part of I/0 devices 240. The network
interface unit 205 operates in a conventional manner to transfer
information to and from a router 102.
In one embodiment, the method of the present invention is implemented as a
series of software routines that are run by the processors 215 and 220 of
the computing nodes in the system. These software routines interact with
the network to establish paths around faulty components. It will be
appreciated by those skilled in the art, however, that in an alternative
embodiment, the present invention may be implemented in discrete hardware
or firmware.
It will be appreciated that certain implementations of the present
invention may include additional processors or other components.
Furthermore, certain implementations of the present invention may not
require nor include all of the above components. For example, processor
220 or a display device may not be coupled to bus 210.
Returning to FIG. 1 A, the present invention routes packets of data from
source nodes to destination nodes in the network 100 utilizing "custom
routing". A packet of data, as referred to herein, is the data which is
being transferred from the source node to the destination node. Each
packet may be of any size, typically ranging from a few bytes to several
megabytes. In custom routing, the source node determines which path
through the network 100 to utilize for transferring packets of data to a
particular destination node. The path of a packet of data refers to the
routers and links the packet travels through between the source and
destination nodes. The path of the packet consists of one or more pathway
segments. For example, assuming every router in FIG. 1A is coupled to a
computing node, a path in the network of FIG. 1A exists where node A is
the source node transferring information to the destination node E. The
path node A selects includes pathway segments between node A and node B,
node B and node C, node C and node D, and node D and node E.
In one embodiment of the present invention, custom routing is implemented
utilizing multiple header blocks as shown in FIG. 3. In this embodiment,
each router in the network is assigned an identifier, such as a unique
identification number. The packet 300 is generated by the source node and
includes multiple header blocks which contain the necessary control
information to indicate to the routers the path of the packet through
network 100. The packet 300 shown includes four header blocks 310, 320,
330 and 340. Header block 310 is the first header block and indicates that
the source node is the computing node coupled to router A and the first
pathway segment is from router A to router B. The subsequent header blocks
320, 330 and 340 indicate pathway segments from router B to router C,
router C to router D, and router D to router E, respectively. It should be
noted that the header blocks in packet 300 indicate where the path through
network 100 should begin, end, and the destination routers for pathway
segments. For example, in FIG. 1A additional routers may exist between
routers A, B, C, D and E. However, since those additional routers are not
indicated as destination routers, the path of the packet continues in the
same direction at these additional routers.
The header blocks 310-340 contain control information which indicate the
proper path through the network 100. Following the header blocks 310-340,
the packet 300 includes the data 350 and the tail 360. The data, as
discussed above, can be bytes or megabytes. The tail 360 indicates the end
of packet 300. Packets of data are transferred through the network 100 in
units referred to as "flits". In one implementation of the present
invention, each flit is 64 bits; however, a flit could contain any number
of bits. In one embodiment of the present invention, each one of the links
105 shown in FIG. 1A is a 16-bit bus or other communication means. In an
alternate embodiment, each one of the links 105 is capable of transferring
an entire flit between two routers concurrently. For example, if a flit is
64 bits then each one of the links is a 64-bit bus or other communication
means. In one embodiment of the present invention, each one of the header
blocks 310-340 and the tail 360 is a single flit. The data portion 350
typically comprises multiple flits and may comprise any number of flits.
In one implementation, data portion 350 typically comprises 4 k bytes, or
512 64-bit flits. In an alternate implementation, data portion 350
typically comprises four bytes, or four 64-bit flits.
In one embodiment of the present invention, custom routing is implemented
using a "wormhole routing" technique. In wormhole routing, when a packet
of data begins transfer along a path, each portion of the path it begins
on is reserved for that packet until the packet tail 360 is received. That
is, when router A begins the transfer to router B, router A determines
whether the link between router A and the next router is available. If the
link is not available, then the router A waits to begin transmission of
the packet until the link is available. Once the link is available, router
A transfers the first flit, header block 310, from router A to the next
router. Note that this next router may be router B, or may be an
additional router between router A and router B. Router A then holds that
portion of the path to the next router for this packet of data and
continues to transfer subsequent data in the packet to the next router
until the tail 360 is transferred. It should be noted, however, that other
packets may also be transferred over this same physical link utilizing
"virtual channels," as discussed in more detail below.
Each time a router receives data for a new packet, the router checks
whether the next link in the path is available, and begins transfer along
that link once it is available. The proper link to transfer the data to is
indicated in the header blocks. In one implementation, when a router
receives a new header block it checks whether it is the destination router
for that block. If it is not the destination router, then the router
transfers the header block and all subsequent flits in the packet in the
same direction within the network. However, if the router is the
destination router indicated by the first header block, it discards the
first header block and checks the subsequent flit (that is, the second
header block) to determine the next pathway segment. Once it determines
the next pathway segment, the router transfers the second header block and
all subsequent flits to the next router in the path.
Once the destination node is reached, the first flit received by the
destination router is the last header block. The second flit received by
the destination router is the first flit of data 350. In one
implementation, this first flit of data 350 is a flit of control
information indicating it is the first flit of data. Alternatively, the
destination router could know the flit is data and not a header block
because the control information included within each header block is not
contained within the data flit, or the last header block may contain
control information indicating it is the last header block. Thus, the
destination router knows that the computing node coupled to the router is
the proper destination for the subsequent flits, and transfers the
subsequent flits in the packet to the computing node rather than to
another router.
FIG. 4 shows a routing device according to one embodiment of the present
invention in more detail. The router 400 has four input links 401,402, 403
and 404, and four output links 411,412, 413 and 414. Each of the four
input links and the four output links represent one of the links 105 shown
in FIG. 1A. Thus, the router 400 shown in FIG. 4 is coupled to four other
routers 102 of FIG. 1A. The router 400 also includes a crossbar 41 6 which
transfers data received on an input link 401,402, 403 or 404 to the
appropriate output link 411,412, 413 or 414. It will be appreciated that
the number of input links and output links which are directly coupled to a
router is dependent on the topology of the network and that router's
location in the network, as shown in FIGS. 1A and 1B.
Each of the four input links 401-404 is coupled to four different inbound
queues 418a, 418b, 418c and 418d. The inbound queues 418a-418d are
temporary storage facilities in router 400 for incoming data. In one
embodiment of the present invention, each one of the inbound queues
418a-418d is a 16-flit first in-first out (FIFO) buffer. Therefore, each
of the inbound queues 418a-418d can be smaller than the size of a packet
being routed through the network.
Each one of the queues 418a-418d represents a different "virtual channel"
of the router. That is, even though the router 400 is coupled to only the
single input link 401, router 400 can temporarily store data before
passing it on to the next router via one of the output links 411-414.
Thus, even though a single physical network exists, the router 400 is able
to temporarily store data from four separate packets. Thus, router 400
supports four "virtual channels," because the router is able to support
four separate data paths utilizing its temporary storage capabilities.
Each one of these four virtual channels corresponds to one of the inbound
queues 418a-418d.
In one embodiment of the present invention, these virtual channels are
configured as static virtual networks. A static virtual network is a
network in which paths between source and destination nodes utilize the
same numbered virtual channels. For example, if two virtual networks are
supported by the system, then a path between source and destination nodes
is entirely in the first network or entirely in the second network
(however, note that two separate paths, one in each virtual network, could
exist). Thus, packets which router 400 receives on a particular virtual
network and transfers to input queue 418a (for example, virtual channel A)
will be transferred to virtual channel A in each of the other routers in
the system which receive the packet.
In an alternate embodiment of the present invention, static virtual
networks are not required. For example, a particular source node may
indicate that the path to a destination node should utilize a first
virtual channel for the first pathway segment and a second virtual channel
for the second pathway segment. In this embodiment, the channel selecting
logic 420 checks the control information in the header blocks received and
transfers the subsequent flits of the packet to the channel indicated by
the header block. Note that in this embodiment the router 400 also checks
whether the second virtual channel of the next link is available (for
example, another packet may currently be utilizing the second network).
Each node within the network system is able to utilize any one of these
four virtual networks. Which virtual network a source node utilizes to
transfer data to a destination node is dependent on several factors,
including the existence of a deadlock-free path within the virtual network
as discussed in more detail below. In the example shown, four virtual
networks are supported. It will be appreciated however, that any number of
virtual networks can be supported by router 400 by utilizing the proper
number of input queues 418.
Data is received by router 400 in flits, as discussed above. Upon receipt
of a flit of a new packet via input link 401, a channel selecting logic
420 coupled to input link 401 checks the control information in the flit.
In one embodiment of the present invention, the first flit of a new packet
is a header block. If the header block indicates that the current router
is not the destination router for the pathway segment, then the channel
selecting logic 420 checks the control information in the header block to
determine which virtual network the packet is using. The channel selecting
logic 420 then transfers the first flit and all subsequent flits in the
packet to the channel of input link 401 corresponding to the virtual
network and asserts a request signal to an arbitration unit to transfer
the flits of the packet to the output link continuing in the same
direction through the network.
However, if the header block indicates that the current router is the
destination router for this pathway segment, then channel selecting logic
420 discards the first header block and checks the second flit. If the
second flit is a header block, then channel selecting logic 420 checks the
control information in the header block to determine which of the virtual
networks the packet is using. The channel selecting logic 420 then
transfers the second and all subsequent flits in the packet to the channel
of input link 401 corresponding to that virtual network and asserts a
request signal to the arbitration unit to transfer the flits of the packet
to the output link indicated by the header block. The channel selecting
logic 420 continues to transfer all subsequent flits to this channel until
the tail of the packet is received. If, however, the second flit is not a
header block, then the packet is transferred to the computing node
connected to the router 400, as discussed in more detail below.
In one embodiment of the present invention, channel selecting logic 420
stores the directions of the input links 401-404 and the output links
411-414. Channel selecting logic 420 utilizes these directions when router
400 is not a destination router for the packet. In this situation, router
400 transfers the packet to the next router along the same direction in
the network as the previous link. Channel selecting logic 420 is able to
determine which output link 411-414 is in the same direction based on
these stored directions.
Data received by router 400 via input links 402-404 is handled analogously
to the discussion above regarding input link 401.
In one embodiment of the present invention, router 400 includes an
additional set of inbound queues 419a-419d which are coupled directly to
the computing node connected to router 400. These inbound queues 419a-419d
are used analogous to inbound queues 418a-418d discussed above, except
that the source of the packets transferred to inbound queues 419a-419d is
the computing node rather than another routing device. Thus, when the
computing node connected to the routing device is the source node for a
particular packet, the flits of that packet are transferred from the
computing node to the inbound queues 419a-419d and channel selecting logic
coupled to the inbound queues 419a-419d asserts a request signal to the
arbitration unit to transfer the flits of the packet to the appropriate
outbound link, as indicated by the header information for the packet.
Crossbar 416 transfers flits from the four input links 401-404 to the four
output links 411-414. The channels of each of the four input links 401-404
are multiplexed onto the output links 411-414 by crossbar 416. In one
embodiment of the present invention, router 400 operates according to an
arbitration policy which ensures that the data being received on each
input link 401-404 is treated fairly. In one implementation, this
arbitration policy is the well-known round-robin scheme. By ensuring that
each input link is treated fairly, packets being transferred on multiple
virtual networks progress through the network independent of the progress
of packets on any other virtual network. Alternatively, other arbitration
policies may be employed which are less fair, such as giving one or more
channels priority over the remaining channels. However, these other
arbitration policies should be such that the progress of packets on one
virtual network does not prevent the progress of packets on another
virtual network indefinitely.
In one embodiment of the present invention, channel selecting logic 420
maintains the current state for each channel of the four input channels.
That is, channel selecting logic 420 keeps track of whether a packet is
currently being transferred via a particular channel and the correct
outbound link for a packet if a packet is being transferred. In this
embodiment, crossbar 416 is a coupling device which connects each input
queue of each input link to each outbound channel of each output link.
When a new packet is received by channel selecting logic 420, it asserts
an access request signal to an arbitration unit for access to the
appropriate output channel. If another input link is currently using the
requested output channel then the request is denied. Once the output
channel is available, the arbitration unit asserts an access granted
signal to channel selecting logic 420. Channel selecting logic 420 does
not begin transfer of the data to the output link until this access
granted signal is received. Thus, conflicts between multiple input links
for the same output channel are resolved by the arbitration unit granting
access to the output channel to only one input link at a time. Note that
if only a single channel has data to be transferred, the crossbar 416
allows that channel to monopolize the crossbar 416 until another channel
has data to be transferred. The arbitration unit grants access to the
output channels according to the arbitration policy utilized by the
router, as discussed above.
It will be appreciated that other implementations of an arbitration unit
may also be utilized within router 400. Any of a wide variety of
arbitration policies and units may be employed by router 400 which allow
an inbound link to obtain access to an outbound link, which allow the
inbound link to maintain access to the outbound link until the transfer of
the packet is completed, and which resolve conflicts between multiple
input links for access to the same output link.
The appropriate output link for a flit can be determined in a wide variety
of manners. In one embodiment, channel selecting logic 420 stores the
appropriate output link for each packet, as discussed above. In an
alternate embodiment, channel selecting logic 420 asserts a signal to
crossbar 416 each time it receives a new header block. This signal
indicates to crossbar 416 the appropriate output link 411-414 for all
flits coming from the appropriate inbound queue until a new signal is
asserted. Alternatively, control logic within crossbar 416 may monitor the
flits as they are transferred through the crossbar 416. Each time the
control logic monitors a flit which is a header block indicating the next
portion of the path for this router, the crossbar stores which output link
all subsequent flits from that queue should be transferred to (until
another appropriate header block is received).
If a flit received by router 400 is the first flit in a new packet, then
router 400 does not transfer the flit to the appropriate output link
411,412, 413 or 414 until that appropriate output link is reserved. For
example, a flit for a new packet may be received on input link 401 which
is to be transferred to channel A of output link 412. However, another
packet may have already been received on input link 403 which is currently
transferring data to channel A of output link 412. Thus, router 400 waits
until the packet being received on input link 403 is finished before
transferring the packet from input link 401 to channel A of output link
412. It should be noted that, under certain circumstances, flits received
via input link 401 may fill the inbound queue for channel A before the
packet being received on input link 403 is completely transferred to
channel A of output link 412. When this occurs, router 400 asserts a
signal to the router it is receiving the packet from on input link 401.
This signal indicates to the prior router that the inbound queue is full
and that transfer of flits to router 400 should be suspended until the
signal is deasserted. Router 400 deasserts the signal once the channel A
of output link 412 is available. It should further be noted that the
suspension of transferring flits by the prior router may result in the
inbound queue of the prior router being filled. In this situation, the
prior router asserts a signal to the router it is receiving flits from to
suspend transfer of flits until the signal is deasserted, analogous to the
discussion above.
In the embodiment shown in FIG. 4, each output link 411-414 is associated
with four outbound channels, each of which corresponds to an inbound | | |