|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a multiprocessor system having a redundant
shared memory configuration, which includes a plurality of processors and
also a plurality of shared system memories that can be used in common by
the plurality of processors and which allows the shared system memories to
have redundancy in writing data in these memories.
More specifically, the present invention relates to a multiprocessor
system, which ensures that the contents of each pair of shared system
memories are equivalent to each other, for example, in the case where same
data is written in each pair of shared system memories having a dual
shared memory configuration.
2. Description of the Related Art
In recent years, it has been necessary for a relatively large amount of
data to be processed at high speed and with high reliability, especially
in a field of data communication system using a computer system. To
satisfy this requirement, a multiprocessor system has been developed,
which is constituted by a plurality of processors each including a central
processing unit (usually abbreviated to CPU). Such a multiprocessor system
have an ability for processing the data much higher than that in a single
processor, by effectively utilizing a plurality of central processing
units.
Further, in the above-mentioned multiprocessor system having a plurality of
processors, even if a certain processor has failed during operation, any
other processor can continue to process the data in place of the failing
one. Namely, the above-mentioned multiprocessor system has a redundant
configuration in regard to the processors, which can provide a fault
tolerant computer system.
Further, to make such a fault tolerant computer system more complete and to
ensure a data integrity of the whole multiprocessor system, it appears
indispensable that shared system memories provided for supporting a data
process at high speed also have a redundant shared memory configuration,
e.g., a dual memory configuration.
More specifically, in regard to these dual shared system memories, it is
necessarily required that data stored in one of each pair of the dual
shared system memories is equivalent to data stored in the other one so as
to ensure a conformity of the respective data, especially, when same data
is to be written in each pair of dual shared system memories.
However, in general, the situations that a conformity of the respective
data fails to be ensured may be brought about mainly in the following
three cases. Here, to simplify an explanation about such situations, it is
assumed that a multiprocessor system having a number of processor modules
includes only one pair of dual shared system memory modules.
(1) The first is the case in which a write operation in one module of the
dual shared system memory modules is finished in normal termination, while
a write operation in the other module of the dual shared system memory
modules is finished in abnormal termination, when a write access to these
dual memory modules is carried out by a given processor module. Namely,
data is not completely written yet in the other module of the dual shared
system memory modules.
However, in this case, the above-mentioned processor module by which a
write access was carried out still continues to operate. By means of an
abnormal termination message, the processor module can recognize a
specified address to which a write access has failed, and therefore
assuredly rewrite the data corresponding to the specified address by
executing a data recovery process. Consequently, it can be finally ensured
that the data written in one of the dual shared system memory modules is
equivalent to the data reconstructed by a data recovery process in the
other one, and a problem concerning the above-mentioned first case does
not become not so serious practically.
(2) The second is the case in which at least one of the dual shared system
memory modules determines that it is impossible to continue to perform a
normal operation due to a contradiction which has occurred by the shared
system memory module per se. In this case, since the shared system memory
module cannot assuredly preserve the data that was once stored therein any
more, the memory module stops operating after that time and assumes a
state of "HALT" (hereinafter, a state of "HALT" will be simply referred to
as HALT).
Here, the contradiction in the shared system memory module per se means a
logical contradiction which generally occurs when hardware of the shared
system memory module is brought out of control. More concretely, as that
type of contradiction, an abnormality of a sequencer in a system bus
controller which is a connecting unit to a system bus and which will be
described hereinafter, an abnormality of another sequencer in a memory
controller in the shared system memory module, or the like can be
mentioned.
In this case, data that was stored in the shared system memory module
assuming HALT is not reliable at all. Accordingly, to assuredly carry out
a data recovery process for this type of shared system memory module
assuming HALT, it is inevitable to copy or duplicate all the content of
the other shared system memory module in a normal state to the shared
system memory module assuming HALT. Such a copy process or duplication
process is usually executed after the shared system memory module assuming
HALT is brought in a state in which a normal operation thereof can be
performed.
For example, in the case where the shared system memory module assumes HALT
due to a recoverable trouble, etc., that has temporarily occurred by an
error of a software type, the normal state thereof can be realized by
resetting this memory module assuming HALT and by canceling a state of
HALT. On the contrary, in the case where the shared system memory module
assumes HALT due to a serious trouble, etc., that has eternally occurred
by an error of a hardware type and is usually difficult to remedy, the
normal state thereof can be realized only by replacing this memory module
assuming HALT with a new memory module.
Generally, in carrying out the above-mentioned copy process of all the
content of the normal shared system memory module, the larger the storage
capacity of shared system memory module becomes, the longer it takes to
complete the copy process. Therefore, a system bus of a multiprocessor
system is likely to be occupied by a copy access of a certain processor
module for executing such a copy process. Further, in the case where a
write access is carried out by some other processor module with respect to
the shared system memory module in which such a copy process is being
executed by a certain processor module, the copy access command from a
certain processor module is likely to contend with the write access
command from some other processor module. As a result of such a
contention, when the copy process is completed, a disadvantage may occur
in that all the data stored in a shared system memory module by the copy
process is not always equivalent to that of the other normal shared system
memory module.
However, in almost every case among the above-mentioned second case, one of
the dual shared system memory modules stops operating to assume HALT due
to a trouble that has occurred by some error of a hardware type. In such a
case, practically, the replacement of one of the dual shared system memory
modules in a state of HALT with a new system memory module becomes
necessary, so as to copy all the data of the other one of the dual shared
memory modules to the new system memory module after the replacement.
Namely, to deal with the shared system memory module in a state of HALT,
it is inevitable to carry out troublesome work, such as the replacement of
such an abnormal memory module.
Fortunately, it should be noted that a probability, in which a shared
system memory module per se assumes HALT due to some error of a hardware
type, is extremely low, and that a trouble concerning the above-mentioned
second case does not become so serious practically.
(3) The third is the case in which a certain processor module among a
plurality of processor modules assumes HALT during a write operation of
the dual shared system memory modules; namely, the case in which a write
operation is completed in one of the dual shared system memory modules,
while a write operation is not completed yet in the other one thereof, at
the time when this certain processor module among a plurality of processor
modules assumes HALT.
Heretofore, even when a certain processor module assumes HALT in the third
case in a multiprocessor system having dual shared memory modules, it is
not confirmed whether or not a certain processor module in a state of HALT
was carrying out a write access to the dual shared memory modules.
Therefore, it cannot be known whether or not the respective data written
in the dual shared memory modules is equivalent to each other.
Consequently, even in the case where a processor module, which does not
carry out a write access to the dual shared memory modules, assumes HALT,
the multiprocessor system has been forced to conclude that a conformity of
the respective data in the dual shared system memory modules is uncertain
and cannot be ensured.
If a conformity of the respective data cannot be ensured as mentioned
above, it is obliged to be supposed that the respective data in the dual
shared system memory modules is not equivalent to each other. In this
case, to carry out a data recovery process, in the similar manner to the
above-mentioned second case in which one of the dual shared system memory
modules assumes HALT, all the data of one of the dual shared memory
modules (normal system memory module) is copied to the other one of the
system memory modules.
In carrying out a copy process of all the data of one of the dual shared
memory modules, the problems as described in the second case exist. More
specifically, a first problem is in that it takes a relatively long time
to complete the copy process; a second problem is in that a system bus of
the multiprocessor system is occupied by the copy access by a certain
processor module; and a third problem is in that a contention of the copy
access command with a write access command occurs between two processor
modules, in the case where a write access is carried out by some other
processor module. Further, a state of HALT in the processor module is
brought about not only due to some error of a hardware type, but also due
to some error of a software type. Actually, in almost every case, the
processor module assumes HALT by the error of a software type, unlike the
case of a state of HALT in the shared system memory module per se.
Therefore, a probability, in which the processor module assumes HALT, is
much higher than the case of HALT in the shared system memory module.
Therefore, the problems concerning the above-mentioned third case becomes
very serious.
SUMMARY OF THE INVENTION
In view of the above-described problems existing especially in the third
case, the main object of the present invention is to provide a
multiprocessor system having a redundant shared memory configuration,
which confirms whether or not a certain processor module in a state of
HALT was carrying out a write access to dual shared system memory modules
and whether or not the respective data in the dual shared system memory
modules is equivalent to each other, in the case where a certain processor
module assumes HALT.
A further object of the present invention is to provide a multiprocessor
system having a redundant shared memory configuration, which stops
carrying out a copy process of all the data in one of dual shared system
memory modules to the other one thereof, when it becomes clear that the
respective data in the dual shared system memory modules is equivalent to
each other.
A still further object of the present invention is to provide a
multiprocessor system having a redundant shared memory configuration,
which copies only the data corresponding to a specified address where an
inconformity occurs by executing a data recovery process, even when it
becomes clear that the respective data in the dual shared system memory
modules is not equivalent to each other.
To attain these objects, the multiprocessor system having a redundant
shared memory configuration according to the present invention includes a
plurality of processor modules each of which has a central processing unit
and a first system bus controller for connecting each processor module to
a shared system bus; and a plurality of shared system memory modules each
of which has a shared system memory unit that can be used in common by the
processor modules and also has a second system bus controller for
connecting each shared system memory module to the shared system bus.
In this case, a certain processor module is operative to write data in a
given shared system memory module and subsequently write the data in a
specified other shared system memory module, so as to ensure that the data
respectively written in the two shared system memory modules is equivalent
to each other.
Further, in this case, any other processor module monitors a status of the
certain processor module, and discriminates whether or not a write
operation for the specified other shared system memory module is finished,
in the case where the certain processor module stopped operating.
Preferably, a plurality of processor modules are constituted by two
processor modules, and a plurality of shared system memory modules are
constituted by dual shared system memory modules. Further, a first
processor module is operative to write data in a first shared system
memory module and subsequently write the data in a second shared system
memory module. Further, a second processor module monitors a status of the
first processor module, and discriminates whether or not a write operation
for the second shared system memory module is finished in the case where
the first processor module stopped operating.
Further, preferably, the second processor module discriminates whether the
write operation for the second shared system memory module is finished in
normal termination or in abnormal termination.
Further, preferably, the first system bus controller of the each processor
module includes a program-mode controller which controls an access
utilizing a program-mode in accordance with a command from the
corresponding central processing unit; a direct memory access mode
(usually abbreviated to DMA-mode) controller which controls an access
utilizing a direct memory access mode that allows data transfer between
each processor module and one of the dual shared system memory modules; a
transmission/reception circuit which exchanges data with the shared system
bus; a RAM(Random Access Memory) such as a dual-port RAM which functions
as a buffer circuit between the program-mode controller or the direct
memory access mode controller and the transmission/reception circuit; and
a register which indicates a status of the each processor module and also
indicates whether or not the write operation for one of the dual shared
system memory modules is being executed.
Further, preferably, the second system bus controller of each shared system
memory module includes a direct memory access controller which controls an
access by a direct memory access mode that allows data transfer between
the each shared system memory module and one of the processor modules; a
transmission/reception circuit which exchanges data with the shared system
bus; and a RAM such as a dual-port RAM which functions as a buffer circuit
between the direct memory access controller and the transmission/reception
circuit.
Further, preferably, the second processor module reads out an address
corresponding to the first data in the data which failed to be normally
written, from the first shared system memory module, in the case where it
is confirmed that the write operation for the second shared system memory
module is finished in abnormal termination. Further, the second processor
module copies the data starting from the address corresponding to the
above-mentioned first data to the second shared system memory module.
In a preferred embodiment of the present invention, a monitoring operation,
a discriminating operation and a copying operation of the first processor
module can be performed, in either one of the case of a data transfer
between the first processor module and the first and second shared system
memory modules by an access mode in which a synchronous operation is
executed, or the case of a data transfer by an access mode in which an
asynchronous operation is executed.
In the multiprocessor system according to the present invention, a
specified processor module (second processor module) among a plurality of
processor modules is adapted to have a function of monitoring a status of
a certain processor module (first processor module), and to discriminate
whether a write operation for the second shared system memory module is
finished in normal termination or in abnormal termination, even in the
case where the certain processor module stopped operating and assumed
HALT.
Therefore, it can be easily confirmed whether or not the processor module
assuming HALT was carrying out a write access to the dual shared system
memory modules and also confirmed whether or not the respective data in
the dual shared system memory modules is equivalent to each other. Namely,
when the write operation for the second shared system memory module is
finished in normal termination, it can be made clear that a conformity of
the respective data is ensured, and it becomes unnecessary to carry out a
copy process. On the other hand, when such a write operation is finished
in abnormal termination, it can be made clear that the respective data
written in the dual shared system memory modules is not equivalent to each
other.
Thus, when the second processor module discriminates that the first
processor module is finished in abnormal termination, the second processor
module is adapted to read out an address corresponding to the first data
in the data which failed to be normally written. Further, the second
processor module copies the data starting from the address corresponding
to the above-mentioned first data to the second shared system memory
module.
In this case, since the data corresponding to the region where an
abnormality has occurred can be clearly detected, only the data related to
the abnormality need to be copied. Therefore, the data amount in which a
copy process is to executed, can be decreased, and it becomes possible for
the time, which is required for the data recovery for ensuring that the
respective data in the dual shared system memory modules is equivalent to
each other, to be remarkably reduced.
In regards to the methods for carrying out a data transfer between the
given processor module and the dual shared system memory modules, the
following two type are mentioned: a first method is related to the data
transfer by a synchronous mode which is executed by the processor module
in accordance with a program of computer system; a second method is
related to the data transfer by an asynchronous mode which is represented
by a direct memory access mode.
The copying operation, etc., of the first processor module in the
multiprocessor system according to the present invention can be executed
for the both methods.
Legend
The following are abbreviations appearing in the descriptions and drawings
wherein:
ADRS DEC represents "Address Decoder";
ADRS GEN represents "Address Generator";
BCT represents "Byte Count";
CONT REG represents "Control Register";
DAT represents "D-Port Active Indicator";
DID represents "Destination Identifier";
EC represents "Execution Command";
EPSSAD represents "External PM SSM Address in DMA-Mode";
EPSSAP represents "External PM SSM Address in Program-
Mode";
EPSTS represents "External Processor Module Status";
IBC represents "I/O Bus Controller";
IBC-P represents "I/O Bus Controller for a Processor
Module";
IOBH represents "I/O Bus Handler";
LSB represents "Least Significant Bit";
MSU represents "Main Storage Unit";
PAT represents "P-Port Active Indicator";
PM represents "PM Halt Status";
REG represents "Register";
RC represents "Response Command";
SBC represents "System Bus Controller";
SBC-P represents "System Bus Controller for a Processor
Module";
SBC-S represents "System Bus Controlloer for a Shared
System Memory Module";
SBHT represents "SBC Halt Status";
SID represents "Source Identifier";
SS BUS represents "Shared System Bus";
SSBH represents "Shared System Bus Handler";
SSM represents "Shared System Memory Module";
SSMAA represents "Shared System Memory Module-Access
Address";
SUCD represents "SSM-Unmatch D-Port Indicator";
SUCP represents "SSM-Unmatch P-Port Indicator";
TMG CTR represents "Timing Controller"; and,
UID represents "Unit Identifier".
BRIEF DESCRIPTION OF THE DRAWINGS
The above object and features of the present invention will be more
apparent from the following description of the preferred embodiments with
reference to the accompanying drawings, wherein:
FIG. 1 is a schematic block diagram showing a tightly coupled
multiprocessor system of a conventional type;
FIG. 2 is a schematic block diagram showing a loosely coupled
multiprocessor system of a conventional type;
FIG. 3 is a schematic block diagram showing a multiprocessor system related
to the present invention;
FIG. 4 is a block diagram showing a multiprocessor system of an embodiment
according to the present invention;
FIG. 5 is a block diagram showing the construction of one of processor
modules in FIG. 4;
FIG. 6 is a block diagram showing the construction of one of shared system
memory modules in FIG. 4;
FIG. 7 is a block diagram showing the construction of a first shared system
bus controller provided in each processor module illustrated in FIG. 5;
FIG. 8 is a block diagram showing the construction of an I/O bus controller
provided in each processor module illustrated in FIG. 5;
FIG. 9 is a block diagram showing the construction of a second shared
system bus controller provided in each shared memory module illustrated in
FIG. 6;
FIG. 10 is a diagram for explaining a format of each pair of an execution
command and a response command;
FIG. 11A is a diagram for explaining the operation of data transfer on the
bus for a read access in an embodiment according to the present invention;
FIG. 11B is a diagram for explaining the operation of data transfer on the
bus for a usual write access in an embodiment according to the present
invention;
FIG. 12 is a diagram for explaining the operation of data transfer on the
bus for a dual write access in an embodiment according to the present
invention;
FIGS. 13A, 13B and 13C are diagrams respectively showing the contents of
various status which three kinds of registers (EPSTS, EPSSAP and EPSSAD)
indicate in FIG. 7;
FIG. 14 is a block diagram showing an interrelationship between a plurality
of processor modules which are connected to each other in an embodiment
according to the present invention;
FIG. 15 is a block diagram for explaining the operation of an embodiment
according to the present invention shown in FIG. 4; and
FIG. 16 is a flowchart for explaining the operation of an embodiment
according to the present invention.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
Before describing an embodiment of the present invention, the related art
and the disadvantages therein will be described with reference to the
related drawings of FIGS. 1 to 3.
FIG. 1 is a schematic block diagram showing a tightly coupled
multiprocessor system of a conventional type; FIG. 2 is a schematic block
diagram showing a loosely coupled multiprocessor system of a conventional
type.
In a tightly coupled multiprocessor system (TCMP) shown in FIG. 1, a
plurality of processor modules 100-1, 100-2, . . . , 100-n (PM #0, #1, . .
. , #n) are coupled to one shared system memory module (SSM) in a shared
memory device 200, through a shared system bus (SS BUS) 300. In this case,
all the programs and data in a plurality of central processing units (CPU
or .mu. P) 101-1, 101-2, . . . , 101-n are stored in the shared system
memory module. Also, data communication between processor modules is
carried out through the shared system memory module.
Such a tightly coupled multiprocessor system have an ability for processing
the data higher than that in a single processor with a simple
construction, by effectively utilizing a plurality of central processing
units. Further, since the data communication between different CPUs can be
executed via only one memory module, it becomes unnecessary to carry out
an excess data process for data transfer between these processor modules
and various kinds of memory modules, in the case where these memory
modules are provided within the multiprocessor system.
However, in this construction, since all the data is processed through a
single shared system memory module, it is likely to take a relatively long
time to transfer a large amount of data. In other words, since all the
processor modules carry out the access of the same data to a single shared
system memory module, an ability to transfer data by a shared system bus
is likely to be limited to some extent, as the number of processor modules
connected to the shared system bus increases.
On the other hand, in a loosely coupled multiprocessor system (LCMP) shown
in FIG. 2, a plurality of processor modules 110-1, 110-2, . . . , 110-n
(PM #1, #1, . . . , #n) have respectively corresponding central processing
units 120-1, 120-2, . . . , 120-n and a plurality of main storage units
(MSU) 130-1, 130-2, . . . , 130-n which are exclusively used by the
respective central processing units 120-1, 120-2, . . . , 120-n. In this
case, all the programs and data in a plurality of central processing units
120-1, 120-2, . . . , 120-n are stored in the respectively corresponding
main storage units (MSU) 130-1, 130-2, . . . , 130-n. Generally, data
communication between processor modules is carried out through the
respective I/O (input/output) channels 140-1, 140-2, . . . , 140-n, an I/O
bus 150, and other communication ports.
In such a construction, an ability for processing the data become higher
than the case of tightly coupled multiprocessor system, since the memory
capacity in each central processing unit increases. Further, all the
processor modules individually have local memories such as main storage
units, an contention of a single shared system memory module does not
occur between different processor modules, unlike the case of a tightly
coupled multiprocessor system.
However, in this construction, since the data communication between
different processor modules has to be carried out via respective I/O
channels and other communication port, the multiprocessor system is likely
to have a lot of overhead time necessary for exchanging data with the CPUs
and the I/O channels, etc.
FIG. 3 is a schematic block diagram showing a multiprocessor system related
to the present invention. In FIG. 3, a plurality of processor modules 1-1,
1-2, . . . , 1-n (PM #0, #1, . . . , #n) have respectively corresponding
central processing units 10-1, 10-2, . . . , 10-n and a plurality of main
storage units (MSU) 11-1, 11-2, . . . , 11-n which are exclusively used by
the respective central processing units 10-1, 10-2, . . . , 10-n, similar
to the case of FIG. 2. Further, data communication between processor
modules is carried out through the respective I/O (input/output) channels
14-1, 14-2, . . . , 14-n or other communication ports.
Further, in FIG. 3, one shared system memory module (SSM) 2-1 in a shared
memory device 2-2 is connected to a shared system bus 4-1, to support data
communication with relatively high speed.
Generally, the number of the processor modules, which can be incorporated
into a loosely coupled multiprocessor system, is much greater than the
case of a tightly coupled multiprocessor system. In view of this, a
multiprocessor system of the present invention is constructed to be
related to such a loosely coupled multiprocessor system shown in FIG. 2.
Further, by taking the feature of the tightly coupled multiprocessor
system into consideration, the multiprocessor system, which is closely
related to the present invention, has been conceived, as illustrated in
FIG. 3. Further, the present invention is intended to have redundancy also
in shared system memory configuration by providing a plurality of shared
system memory modules.
The above-mentioned multiprocessor system shown in FIG. 3 has a redundant
configuration in regard to both the processors and the shared system
memory modules, which can provide a fault tolerant computer system.
Further, such a multiprocessor system also has extensible characteristics
in which a desired performance can be provided in accordance with an
ability required by a user.
Hereinafter, the description of a preferred embodiment according to the
present invention will be given with reference to the accompanying
drawings.
FIG. 4 is a block diagram showing a multiprocessor system of an embodiment
according to the present invention. In this case, a main part of a
multiprocessor system having a dual shared memory configuration will be
illustrated.
In FIG. 4, the multiprocessor system includes a plurality of processor
modules (PM) 1, e.g., three processor modules, and a pair of shared system
memory modules (SSM) 2. Further, each of three processor modules 1 and a
pair of shared system memory modules 2 are connected to a shared system
bus (SS BUS) 3, through a first system bus controller for a processor
module (SBC-P) and a second system bus controller for a shared system
memory module (SBC-S), respectively.
The shared system bus 3 is controlled by a shared system bus handler (SSBH)
5. Namely, an arbitration on the shared system bus 3 concerning various
data and various commands is concentratively managed by a shared system
| | |