|
Description  |
|
|
FIELD OF THE INVENTION
This invention relates to fault-tolerant computers in general and, in
particular, to memory backup systems for fault-tolerant computers.
BACKGROUND OF THE INVENTION
Many computer systems are designed to be "fault-tolerant". Typically, these
systems can experience one or more temporary or permanent circuit failures
and continue to function without loss of data or without introducing
serious errors into the data. Such systems typically vary as to how many
faults can be "tolerated" and as to how each fault is handled.
In order to be fully fault-tolerant, a computer must be able to survive a
fault which renders one or more portions of its main memory inoperative.
In such a situation, to avoid losing or corrupting data, it is necessary
to have a second, or backup copy, of the data available in a separate
memory location which cannot be disabled by the original fault. Therefore,
in fault-tolerant systems it is common to have at least two main memory
units and to maintain a copy of the system data simultaneously in both
units.
However, maintaining a duplicate copy of data in two separate memories
causes a significant reduction in the computational speed of the computer
since every data storage operation must be performed twice and such
operations are usually supervised by the processing element which cannot
simultaneously perform normal processing operations.
In order to reduce the time penalty associated with maintaining duplicate
data copies, some prior art fault-tolerant systems maintain only one copy
of the data during normal processing operations and, at periodic
intervals, update a data copy maintained in a backup memory. This scheme
works satisfactorily unless a fault occurs in the main memory which
disables portions that contain data which has not yet been copied or
unless a fault occurs during the copying operation itself which disables
either the main memory or the copy memory so that the copy cannot be
completed--such failures can cause loss of data integrity.
It is an object of the present invention to provide a memory backup system
in which no single failure can cause a loss of data or data integrity.
It is another object of the present invention to provide a memory backup
system in which backup can be carried out quickly and efficiently during a
context switch.
It is yet another object of the present invention to provide a memory
backup system in which data copying can be carried out without consuming
large amounts of processing time thereby slowing processing speed.
It is a further object of the present invention to provide a memory backup
system in which the required circuitry to provide complete backup
capabilities is minimized.
SUMMARY OF THE INVENTION
The foregoing objects are achieved and the foregoing problems are solved in
one illustrative embodiment of the invention in which all data modified
under control of a program is temporarily stored only in a non
write-through cache memory associated with the processor that is running
the program. When a context switch or an overflow situation in which the
cache memory becomes full makes it necessary to write the modified data
into the system main memory, a special data location in the main memory,
containing among other things the identity of the user program currently
being executed, is updated to indicate that the data in the cache memory
is being written to a first area in main memory associated with the
program.
Special circuitry in the processor then writes all of the data which has
been modified by the user's program to the first area. At the end of the
storage operation, the status block in the main memory is again updated to
indicate that the first memory area has been updated and that a second
area is about to be updated. Subsequently, the same data is written to the
second area. The second storage operation is followed by a final
modification of the status block to indicate that the update of the second
area has also been completed.
Therefore, no matter when a fault occurs there remains a consistent set of
data in main memory and an associated address at which each user program
can be reinitiated. For example, if a processing element fails before it
first begins writing modified data blocks to the first memory area, the
data and starting address in the first memory area are exactly what they
were before the processing element began executing the program and the
program can be restarted using the initial data in the first memory area.
Alternatively, if the processing element fails after the writing operation
to the first area begins, but before it is completed, the status block in
main memory indicates that the writing operation has not been completed
and a fault recovery routine need only write the contents of the program's
second area in main memory into its first area and cause reexecution of
the program on another processing element. Similarly, the computer system
can recover from a fault which occurs during modification of the program's
second area in main memory by assigning a new second area to the program
and recopying the contents of the program's first area into the new second
area.
In accordance with the invention, the first and second memory areas are
located in physically separate memory elements, therefore, a single memory
element failure leaves at least one consistent copy of the program data in
main memory except when the copies are being updated. In this case, the
system can recover by completing the updating of data in either or both
memory areas.
In order to provide reasonable efficiency during program operation while
storing all data modified by the program in an associated cache memory,
the cache memory must be much larger then that typically used in prior art
system. Such a large cache memory can impose significant time penalties
during a context switch because in accordance with normal cache operation,
each data entry in the cache must be written to the system main memory
before a new user program can be installed.
In order to decrease the time required to write the cache contents during
context switches, a separate block status memory is associated with each
cache memory. The block status memory contains status entries
corresponding to each data block in the associated cache memory. Whenever
information is written to a data block from a location other than main
memory, the associated status entry is modified thus identifying
information which has been changed during program operation.
Special-purpose hardware is also provided which allows the processing
element to write to the system main memory only those data blocks which
have been modified and which also have addresses within any specified
address range thereby greatly reducing the amount of time required to
effect a context switch.
In addition, to further facilitate context switches, special purpose
hardware is provided in the processing element that can be activated in
parallel with other processing operations and which can be used to
invalidate virtual memory address translation map entries during a context
switch. Therefore, the supervisor program in the processing element need
only issue a command to start the invalidation operation and then proceed
to other functions.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of an illustrative computer system in which the
present invention may be utilized.
FIG. 2 is an expanded block diagram of the processing element shown in FIG.
1 incorporating the illustrative cache and block status memories.
FIG. 3 of the drawing shows an expanded block diagram of the slave
interface shown in FIG. 1.
FIG. 4 of the drawing shows an expanded block diagram of the memory element
shown in FIG. 1.
FIG. 5 of the drawing shows the division of the cache memory into
supervisor and user code and data spaces and the mapping of the supervisor
code and supervisor data spaces into their respective locations in the
cache memory.
FIG. 6 of the drawing shows the mapping of the user data and user code
spaces into their respective locations in the cache memory.
FIG. 7 of the drawing, consisting of two drawing sheets, shows a detailed
block diagram of circuitry used to operate the cache and block status
memories in the processing element.
FIG. 8 of the drawing shows an expanded block diagram of the memory
management circuitry in the illustrative embodiment.
FIG. 9 of the drawing, consisting of three drawing sheets, shows the
arrangement of information on each of the internal buses during the
various internal operations.
DETAILED DESCRIPTION
As shown in FIG. 1, an illustrative fault-tolerant computer system is
comprised of three main elements; processing elements, memory elements and
peripheral elements. All of the elements are connected to common system
buses 130 and 131. Only one system bus is necessary for system operation,
but two buses are preferred to prevent a malfunction in one bus from
stopping operation of the entire system and to increase throughput of the
system. Similarly, for reliability and speed purposes, the interface units
which connect the processing, memory and peripheral elements to the system
bus are also duplicated. Buses 130 and 131, although shown as a single
line, are actually multi-wire buses comprising many separate data and
signal lines as will be described further herein.
Up to sixty-four processing elements (PEs) may be added to the illustrative
system of which only three processing elements, 100, 105 and 110, are
shown for clarity. Each PE is identical and contains a processor that is a
conventional data processing device capable of executing both user
application programs and supervisor programs which control and coordinate
the operation of the associated processor.
In accordance with the invention, each PE also contains a cache memory that
temporarily stores both user and supervisor program code and data. As is
typical, the cache memory is used to reduce the effective memory access
time. However, as will be explained in more detail below, the cache memory
is also used in conjunction with the memory elements to provide a
fault-tolerant data backup mechanism which both insures data integrity for
any single fault and provides high speed operation.
Also included in each PE is a read-only memory in which is stored
additional frequently-used supervisor code and a bootstrap loading program
which enables the processing element to become operational when it is
reset or upon power-up.
The illustrative computer system also utilizes a virtual memory system in
which the address information produced by the processors are translated
before being provided to the main memory elements. A table of translations
(a map) for translating virtual addresses into physical addresses which
are used by the main memory elements is stored in a random-access memory
located in each processing element.
All processing elements are connected to redundant processor buses, such as
buses 115-116, which are duplicated for reliability purposes and increased
throughput. Access to buses 115 and 116 and the system buses 130 and 131
by the PEs is controlled by master interface units 120 and 125,
respectively, which are duplicated for reliability and throughput
purposes. Each master interface unit contains sequence and control logic
and bus arbitration circuitry which can handle up to sixteen processing
elements. In order to accommodate additional processing elements,
additional processor bus and master interface unit pairs may be added to
system buses 130 and 131. Up to a total of four processor bus pairs may be
included to the illustrative system to accommodate sixty-four processing
elements.
If there are more than 16 processing elements in a particular computer
configuration, the processing elements are divided into groups of up to
sixteen processing elements. Each group of up to 16 processing elements is
connected to a common processor bus, which is, in turn, connected by
dedicated master interface units to the system buses.
Within each group of up to sixteen processing elements, supervisory control
is shared among the processors. In particular, at any one time a
supervisory or "executive" processing element is recognized by all the
processing elements in one group and the executive role passes from
processing element to processing element in a well-defined priority
scheme. The exact mechanism of the transfer of control among processors is
disclosed in detail in a copending patent application entitled "Modular
Computer System" by Jack J. Stiffler et al. and filed in the United States
Patent and Trademark Office on Sept. 3, 1982 and assigned Ser. No.
414,961, now U.S. Pat. No. 4,484,273, which disclosure is hereby
incorporated by reference.
All system modules (including memory elements and peripheral units) are
assigned uniquely to one processing element group, however, all processing
elements can communicate with all memory elements and bus adapters, even
those "belonging" to another group. In addition, some common areas in
system main memory are recognized by all processing element groups. Within
each processing element group, system resources are allocated by the
executive processing element as if that group where the only group in the
computer system. Communication between groups is accomplished by means of
the common memory areas which contain hardware "locks" to facilitate such
transfers.
System buses 130 and 131 are connected to memory elements 165-175 and bus
adapters 184-186 by means of slave interfaces 135, 140 and 145, 150,
respectively. Each slave interface is identical and has a redundant
duplicate for reliability purposes and to increase system throughput. More
particularly, one slave interface in a pair can be used to provide an
access path for transfer of data to or from one memory element or bus
adaptor in the associated group while the other slave interface
simultaneously provides an access path to a second memory element or bus
adapter. Slave interfaces 135-150 contain circuitry which converts the
signals produced by memory elements 165-175 and the peripheral buses
196-197 (via peripheral bus adapters 190-192 and bus adapters 184-186)
into signals which are compatible with the signals used on system buses
130 and 131.
In particular, slave interfaces 135 and 140 connect system buses 130 and
131 to memory buses 160 and 161. Although only two memory bus pairs, 160,
161 and 155,156, are shown for clarity, up to sixteen dual-redundant
memory bus pairs may be added to the illustrative system.
Memory buses 160 and 161 are, in turn, connected to a plurality of memory
elements and bus adapters of which three devices (memory elements 165, 170
and bus adapter 185) are shown. These elements together constitute the
main memory of the system. In the illustrative embodiment, each of the
memory elements contains 2.sup.21 bytes of random access memory and
consists of a conventional random access memory unit. Other well-known
memory units of different sizes may also be used in a well-known manner.
Slave interfaces 145 and 150 connect system buses 130 and 131 to memory
buses 155 and 156 which are identical to buses 160 and 161. Peripheral
buses 196 and 198 are coupled to buses 155 and 156 by interface circuitry
consisting of bus adaptors of which two adaptors, 184 and 186, are shown
and peripheral bus adapters of which units 190 and 192 are shown. Each bus
adaptor contains buffer memories and data processing logic which can
buffer and format data information received over the memory buses and the
peripheral buses and commands received from the processing elements via
the system buses. In particular, each bus adaptor can handle signals on
two independent command channels and two independent input/output data and
command channels.
Each bus adaptor, such as adaptor 184, is connected to a peripheral bus
adaptor 190 over a dedicated bus. The peripheral bus adapters contain a
microprocessor and an associated program memory. Under control of a
program stored in the program memory, the microprocessor unit can perform
format conversions and buffering for information passing between the
processing elements and the peripheral controllers and units. The
formatting functions performed by the peripheral bus adaptor units help to
speed up overall processing time by relieving the processing elements
(PEs) of some routine data transfer tasks. Each peripheral bus adaptor can
be individually programmed to provide an interface to a variety of
standard peripheral buses onto which in the illustrative emobodiment can
be attached input/output controllers of various types, including secondary
storage devices such as disks and tapes. Peripheral bus adaptors 190-192
can be programmed to convert between the signals used on internal memory
buses 155 and 156 and the signals used to the peripheral buses 196-198
and, therefore, allow many different peripheral bus formats to be used
with the illustrative system.
When a memory element or bus adaptor is inserted into the system, it
undergoes an initial power-up clear and initialization cycle during which
all of its bus drivers are turned off to prevent the unit from
communicating erroneous information to the system. In addition, the unit's
internal status registers are set to a predetermined state. After
initialization has been completed the unit sends an interrupt to the
current executive processing element thereby informing the executive
processing element that it is available.
In response to this interrupt, the executive processor initializes the
newly inserted unit by testing the unit to verify that its internal fault
monitoring apparatus is operational and records its existence in
appropriate memory tables.
If the unit is a memory element (determined by reading its status) it is
assigned a physical name, thus defining the physical addresses to which it
is to respond. Alternatively, if the unit is a bus adapter/peripheral bus
adapter, a program is loaded into its internal program memory which
program allows its internal microprocessor to query the associated
peripheral devices in order to determine the number and type of peripheral
units on the associated peripheral bus. Peripheral information is reported
back to the executive processing element via the interrupt mechanism which
thereupon responds by loading the appropriate operating programs into the
program memory in the newly inserted unit and again updating system
configuration tables in memory.
A more detailed functional block diagram of a processing element is shown
in FIG. 2. Each processing element contains identical circuitry and
therefore the circuitry in only one processing element will be discussed
in detail to avoid unnecessary repetition. The heart of the processing
element is a microprocessor unit (MPU) 210 which performs most of the
ordinary calculations handled by the computer system. Microprocessor unit
210 may illustratively be a conventional 16-bit microprocessor. Several
microprocessor units with suitable characteristics are available
commerically; a unit suitable for use with the illustrative embodiment is
a model MC68000 microprocessor available from the Motorola Semiconductor
Products Company, Phoenix, Ariz.
Supporting the operation of MPU 210 are several other units which assist
the MPU to decrease its processing time and decrease the effective memory
access time. In particular, these units include memory management unit
200, ROM 205 and cache memory 250.
In particular, MPU 210 operates with a "virtual address" arrangement. In
this well-known memory arrangement, MPU 210 produces "virtual" addresses
which require a translation in order to convert them into the actual
addresses which correspond to memory locations in the computer system main
memory. The translation of virtual addresses to physical addresses is
accomplished by memory management unit 200. Unit 200 utilizes a
translation "table" or "map" retrieved from main memory during a context
switch and stored in an internal random access memory to perform the
translation from virtual to physical addresses. Specifically, virtual
address information produced by MPU 210 is provided to memory management
unit 200 via local address bus 220. Memory management unit 200 translates
the virtual address information into physical addresses used for
addressing the main memory in the computer system. The translated
information is provided to cache/local bus adapter 230 which controls the
flow of information inside the processing element and gates the
appropriate translated cache address onto cache data bus 285.
During a context switch the entries in the translation map must be
invalidated to prevent improper operation. In accordance with one aspect
of the invention, to decrease the time required to perform a context
switch, special purpose hardware is provided in the processing element
which can be activated in parallel with other processing operations. The
special purpose hardware automatically invalidates all map entries during
a context switch. Therefore, the supervisor program in the processing
element need only issue a command to start the invalidation operation and
then proceed to other functions.
Cache memory 250 is a well-known memory element which is used to decrease
the effective memory access time and, in accordance with the invention, to
provide a memory backup arrangement. In particular, a sub-set of the
information stored in the main memory is also temporarily stored in cache
memory 250. Memory 250 responds directly to virtual addresses supplied by
microprocessor unit 210 and, if the requested information is present in
the cache memory (called a "cache hit"), the information becomes available
in a much shorter time interval than a normal access to main system memory
would require. If the requested information is not present in the cache
memory but is present in main memory, the attempted access is called a
"cache miss" and well-known circuitry automatically transfers or "writes"
a parcel of information called a "block" containing the requested
information from main memory into the cache memory. If the requested
information is located only in peripheral secondary storage the access
attempt results in a "page fault" which is handled via procedures to be
hereinafter described.
Cache memory 250 consists of a 2.sup.17 byte random access memory arranged
in a 36 bit by 32,000 word (actually 32.times. 1024 word) configuration
(each 36-bit word contains 4 information bytes each associated with a
parity bit). Information retrieved from cache memory 250 is provided, via
cache data bus 285 and cache/local bus adapter 230 to local data bus 225
and thence to MPU 210. Cache/local bus adapter 230 provides interface and
signal conversion circuitry between 32-information-bit cache bus 285 and
16-information-bit local data bus 225. In addition, bus adapter 230 checks
byte parity on data passing from cache memory 250 to local bus 225 and
generates byte parity information for data flowing in the opposite
direction.
In accordance with the invention cache memory 250 is a non-write through
cache memory. Most conventional cache memories are write-through memories;
that is, when information is written into a conventional cache memory, the
same information is also immediately written into the copy of the data
maintained in the system main memory. Write-through operation allows a
consistent copy of data to be maintained in one location. Unfortunately,
if a fault occurs before processing is finished, the only data copy may
have become modified during processing so that it is impossible to restart
processing on the original data. Contrary to conventional operation, the
inventive cache memory is non write-through so that modified data is
written only in the cache memory during processing. Therefore, if a fault
occurs during processing, the original data in main memory remains intact
so that processing can be restarted on the original data.
In particular, when a context switch or an overflow situation in which
cache memory 250 becomes full makes it necessary to write the modified
data into the system main memory, a special data location in the main
memory, containing among other things the identity of the user program
currently being executed, is updated to indicate that the data in cache
memory 250 is being written into a first area in main memory associated
with the program.
Special circuitry, which will hereinafter be described in detail, in the
processing element then writes all of the data in cache memory 250 which
has been modified by the user's program to the first area. At the end of
the writing operation, the status block in the main memory is again
updated to indicate that the first memory area has been updated and that a
second area is about to be updated. Subsequently, the same data is written
into the second area. The second storage operation is followed by a final
modification of the status block to indicate that the update of the second
area has also been completed.
Therefore, no matter when a fault occurs there remains a consistent set of
data in main memory and an associated address at which each user program
can be reinitiated. For example, if a processing element fails before it
first begins writing modified data blocks to the first memory area, the
data and starting address in the first memory area are exactly what they
were before the processing element began executing the program and the
data processing task can be restarted using the initial data in the first
memory area.
Alternatively, if the processing element fails before the writing operation
to the first area is complete, the status block in main memory indicates
that the writing operation has not been completed and a fault recovery
routine need only write the contents of the program's second area in | | |