|
Description  |
|
|
FIELD OF THE INVENTION
The present invention relates to a method and apparatus for supporting
error detection and error correction modes in a memory system, and more
particularly to a method and apparatus for supporting error detection and
error correction modes in a memory system that utilizes memory devices
which include at least one error bit, also called a "check bit", for each
byte of data.
DESCRIPTION OF THE RELATED ART
Today, the performance of microprocessors is increasing rapidly due to
rapidly advancing electronics technologies, so that processors require
higher memory bandwidth. The bandwidth of the commonly used dynamic random
access memory (DRAM) device is becoming insufficient to satisfy the needs
of system designers. Cache memory is commonly used to fill the bandwidth
gap, but its additional cost is prohibitive in cost sensitive systems,
such as consumer personal computers (PCs). New types of DRAM devices are
being proposed to help alleviate the memory bottleneck problem. These
proposed devices are intended to deliver high memory bandwidth by running
at very high clock speeds.
The extremely high clock frequency of these devices, together with the high
speeds of today's microprocessors, create a system environment that is
increasingly noisy. Thus, data in these high speed systems becomes more
vulnerable to errors caused by transient electrical and electromagnetic
phenomena. Although a well-designed memory subsystem is extremely
reliable, even the best memory subsystem, especially one incorporating
fast DRAM devices, has the possibility of a memory device failure.
Operating memory devices at higher frequencies generally increases the
likelihood and frequency of memory failures or faults. Memory device
failures fall generally into two categories. The first category is a soft
error, which refers to those errors where data stored at a given memory
location change, but where subsequent accesses can store the correct data
to the same location with no more likelihood of returning incorrect data
than from any other location. Soft errors of this type are generally
caused by loss of charge in the DRAM cell. The second category of error is
a hard error, which refers to those errors in which data can no longer
reliably be stored at a given memory location. Either of these types of
errors can lead to catastrophic failure of the memory subsystem.
In an effort to minimize failures due to memory subsystem errors, various
error checking schemes have been developed to detect, and in some cases
correct, errors in data read from memory. The most simple of the error
checking schemes is parity. In a byte-wide parity system, one extra parity
bit is appended to every eight bits of data. For "even parity" systems,
the parity bit is set such that the total number of ones in the nine-bit
word is even. For "odd parity" systems, the parity bit is set to make the
total number of ones odd. When data is read from memory, if one of the
nine bits changes from one to zero or vice versa, the parity will be
incorrect and the error will be detected. This system is limited, however,
because there is no way to know which of the nine bits changed. Therefore,
single bit errors can only be detected, not corrected. Also, if two bits
change, the parity will again be correct and no error will be detected.
Parity therefore is capable of detecting only odd numbers of bit errors.
Other considerations, such as whether to use error checking and correction
(ECC), error detection code (EDC), such as parity, or byte mask circuitry,
has also influenced the design of systems. For consumer desktops, for
example, memory bandwidth is an important consideration for increased
system performance. Conversely, designers of high-end systems, such as
servers, are more inclined to design systems that include ECC or EDC to
ensure data integrity. While detection of errors is very useful, it also
is desirable to be able to correct certain errors. ECC techniques have
been developed that both detect and correct certain errors. Generally
speaking, the goal of ECC is to correct the largest possible number of
errors with the smallest possible overhead (in terms of extra error bits
and wait states) to the system.
The "Hamming codes" constitute one well-known class of ECCs and are widely
used for error control in digital communications and data storage systems.
The Hamming Codes are described in Lin et al., "Error Control Coding,
Fundamentals and Applications", Chapter 3 (1982). One subclass of the
Hamming codes that is particularly well-suited for memory subsystems
includes the single-error-correcting and double-error detecting (SEC-DED)
codes. In these codes, the check bits are generated prior to data being
written to memory using a parity-check matrix implemented in ECC hardware.
In the Hamming code for 72-bit words, which include 64 data bits, eight
error bits are generated. The error bits, or check bits, are then stored
in memory together with the data. When a memory read occurs, the ECC
hardware retrieves both the data and the corresponding check bits from
memory. The ECC hardware then applies the parity check matrix to the data
and the check bits, producing a code of "syndrome bits." If the syndrome
bits are all zeros, this indicates there are no errors. If the syndrome
bits contain ones, the data are possibly invalid. In the case of a single
bit error, the syndrome bits will indicate which bit is in error, and thus
allow correction, but complimenting the erroneous bit. In the case of
double bit errors, the error will be detected, but correction is not
possible. A description of the SEC-DED Hamming codes is found in Lin et
al, supra, Chapter 16. Table 1 illustrates the required number of error
bits for a given range of data bits.
TABLE 1
______________________________________
No. of Error Bits
Range of No.
Required of Data Bits
______________________________________
3 2-4
4 5-11
5 12-26
6 27-57
7 58-120
______________________________________
In general, for error correction to be accomplished successfully, the
relationship between the number of data bits, n, to be checked and the
number of error bits, k, associated with those n data bits is as follows:
2.sup.k -1-k.gtoreq.n.
Another well-known ECC is the "Reed-Solomon code", which is widely used for
error correction by the compact disk industry. A detailed description of
this ECC is found in Hove et al., "Error Correction and Concealment in the
Compact Disc System", Philips Technical Review, Vol. 40 (1980), No. 6,
pages 166-172. The Reed-Solomon code is able to correct two errors per
code word. Other conventional ECC techniques include: the b-adjacent error
correction code described in Bossen, "b-Adjacent Error Correction", IBM J.
Res. Develop., pp. 402-408 (July, 1970), and the odd weight column codes
described in Hsiao, "A Class of Optimal Minimal Odd Weight Column SEC-DED
Codes", IBM J. Res. Develop., pp. 395-400 (July, 1970). The Hsiao codes,
like the Hamming codes, are capable of detecting double bit errors and
correcting single bit errors. The Hsiao codes use the same number of check
bits as the Hamming codes (e.g., eight check bits for 64 bits of data),
but are superior in that hardware implementation is simplified and the
speed of error detection is improved.
It is desired to provide a memory system and method that takes advantage of
increased speed and performance of memory devices and memory subsystems
while also providing error detection and/or correction.
SUMMARY OF THE INVENTION
A memory system according to the present invention for performing error
detection and correction includes at least one memory device that stores a
plurality of data words, where each data word has a plurality of data bits
and at least one associated check bit. The memory system further includes
burst circuitry that reads a plurality of data words in multiple cycles
into a block word to include a sufficient number of check bits to perform
detection of double bit errors and correction of single bit errors, and
error logic that receives and performs error detection and correction upon
the block word.
In the embodiments illustrated herein, the block word is 72-bits including
64 bits of data and eight (8) check bits. The 72-bit block word is formed
by grouping smaller data words retrieved from the memory device. For a
9-bit device with eight data bits and one check bit, eight burst cycles
may be used to retrieve a 72-bit data block. Similarly, for 18-bit devices
with 16 data bits and two check bits, four burst cycles may be used to
retrieve the data block and for 36-bit devices with 32 data bits and four
check bits, two burst cycles may be used to retrieve the data block. In
general, each data word includes "n" data bits and "k" check bits, and the
burst circuitry groups "p" data words to form a block word having
(p.multidot.n) data bits and (p.multidot.k) error bits for a total of
(p.multidot.n)+(p.multidot.k) bits, where 2.sup.(p.multidot.k)
-1-(p.multidot.k).ltoreq.(p.multidot.n). In a first illustrated 9-bit
case, n=8, k=1 and p=8. For an 18-bit case, n=16, k=2 and p=4 and for a
36-bit case, n=32, k=4 and p=2.
The error logic preferably generates a syndrome code from a parity matrix
and uses the syndrome code to detect single bit errors and to detect
double bit errors. The error logic may use a parity-check matrix and a
corresponding syndrome table for error detection and/or correction. The
memory devices may be one or more burst dynamic random access memory
(DRAM) devices or the like, although any type of memory may be used
depending upon the memory controller.
A memory system according to the present invention takes advantage of
increased speed and performance of newer memory devices and memory
subsystems while also providing error detection and/or correction for data
integrity.
BRIEF DESCRIPTION OF THE DRAWINGS
A better understanding of the present invention can be obtained when the
following detailed description of the preferred embodiment is considered
in conjunction with the following drawings, in which:
FIG. 1 illustrates a computer system incorporating a memory system which
includes a burst DRAM array and ECC logic in accordance with an embodiment
of the present invention;
FIG. 2 is a block diagram of an embodiment of the memory system of FIG. 1
which incorporates a memory controller in accordance with an embodiment of
the present invention;
FIG. 3 illustrates an embodiment of a data distribution representing a
72-bit word including eight 9-bit data words;
FIG. 4 illustrates an embodiment of a 72-bit wide ECC encoded data word in
accordance with the embodiment of FIG. 3;
FIG. 5 is a parity-check matrix in accordance with a 9-bit DRAM embodiment
according to the present invention;
FIG. 6 is a syndrome table in accordance with the embodiment of FIG. 5;
FIG. 7 illustrates an embodiment of a data distribution representing a
72-bit word which includes four 18-bit wide data words;
FIG. 8 is a parity-check matrix in accordance with an 18-bit DRAM
embodiment according to the present invention;
FIG. 9 is a syndrome table in accordance with the embodiment of FIG. 8;
FIG. 10 illustrates an embodiment of a data distribution representing a
72-bit word which includes two 36-bit wide data words;
FIG. 11 is a parity-check matrix in accordance with a 36-bit DRAM
embodiment according to the present invention; and
FIG. 12 is a syndrome table in accordance with the embodiment of FIG. 11;
FIG. 13 illustrates an embodiment of a 72-bit wide ECC encoded data word in
accordance with the embodiment of FIG. 7; and
FIG. 14 illustrates an embodiment of a 72-bit wide ECC encoded data word in
accordance with the embodiment of FIG. 10.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
New types of DRAM devices, such as Rambus and synchronous DRAMs (SDRAMs),
are being proposed to help reduce the memory bottleneck faced by system
designers. These "burst DRAM" devices are typically programmable and
deliver high memory bandwidth by running at a very high clock frequency,
such as 200 megahertz (MHz) or greater. An example of such a programmable
burst cycle device is a SLDRAM, which is a DRAM configured according to
the SyncLink standard for burst DRAM communication promulgated in the
publication "Draft Standard for A High-Speed Memory Interface (SyncLink)",
copyright 1996, by the Institute of Electrical and Electronics Engineers,
Inc. (IEEE). A SLDRAM typically operates in a packet oriented protocol
such that it is programmable for a burst of either 4 or 8 clock cycles of
data being transferred in any read or write operation cycle.
FIG. 1 is a block diagram representation of a computer system 20 employing
burst dynamic random access memory (DRAM) devices and error correction
code (ECC) logic in accordance with an embodiment of the present
invention. Each DRAM includes n data bits and k extra error bits
associated with the n data bits. The system comprises a CPU 55, which may
be the P54C processor available from Intel, or may be any one of several
microprocessors and supporting external circuitry typically used in PCs,
such as the 80386, 80486, Pentium.TM., Pentium II.TM., etc.
microprocessors from Intel Corp. The external circuitry preferably
includes an external or level two (L2) cache or the like (not shown). A
companion CPU socket 57 is also provided for dual processor
implementations, and may be the P54CM socket available from Intel. Both
CPU 55 and companion CPU socket 57 communicate with a debug port 59 via a
debug signal line 61 and with an interrupt controller 63 via a
non-maskable interrupt (NMI) line 65.
On-board cache 67 is provided, preferably a 512K L2 cache, that
communicates with CPU 55 via processor data (PD) bus 69 and processor
address (PA) bus 71. The PD bus 69 is 72 bits wide, and is also applied to
data memory control (DMC) 73. Cache and memory control (CMC) 75
communicates with PA bus 71 and also with processor control bus 77. CMC
provides DMC control line 79 to DMC 73 and also memory control line 81 to
ECC logic 49. A basecamp processor connector 83 provides peripheral
interface for both the peripheral component interface (PCI) bus 85 and
APIC bus 87. Additional control signals are provided from the CMC 75 to
the connector 83 via sideband signal line 89. A clock 91 provides system
timing. An input device 12 such as a keyboard, a display device 14 such as
a monitor, and an external memory storage device 9 such as a disk drive,
are coupled to PCI bus 85. The keyboard 12, display device 14 and the
memory storage device 9 may be coupled in any desired fashion as known to
those skilled in the art of computer system design and operation.
ECC logic 49 and Burst DRAM application specific integrated circuit (ASIC)
Cell (BDAC) 47 are connected via a 72-bit wide bus. The bus connecting
burst DRAM ASIC cell (BDAC) 47 and burst DRAM array 93 is "n" bits wide,
where n is any positive integer. Preferably, n is a multiple of nine (9)
bits, such as 9, 18, 36, depending on the system configuration. Burst DRAM
array 93 includes one or more burst DRAMs (shown in FIG. 2). Although BDAC
47, ECC logic 49, and DMC 73 are shown as separate functional elements in
the block diagram of FIG. 1, they need not necessarily be separate
physical elements. Each of these elements may be implemented in one or
more ASICs depending upon the size of available ASIC packaging. Various
other alternatives will be apparent to the skilled artisan. Accordingly,
it is clear that the particular configuration of these elements is not
crucial to the invention.
FIG. 2 illustrates a memory subsystem 90 for use in the computer system 20,
where the memory subsystem 90 includes one or more burst DRAM devices 43
implementing the burst DRAM array 93. The DRAM devices 43 communicate with
a memory controller 41 via memory bus 45 of n bits wide, where n equals
the bit width of each of the burst DRAM devices 43. In one embodiment, the
burst DRAM devices 43 are 9-bit devices, where n=9 corresponding to an
8-bit data byte and one error bit, so that the data path to each burst
DRAM device 43 is nine bits wide. A memory controller 41 may be of any
type known in the art, and is generic to CMC ASIC 19 and DMC ASIC 23 of
FIG. 1. In the embodiment shown, the memory controller 41 incorporates the
BDAC 47 and the ECC logic 49. The BDAC 47 groups multiple 9-bit bursts in
a data format conversion operation, converting the 9-bit wide high speed
data output from each burst DRAM 43 into larger block words, each
typically a 72-bit block word which includes 64 data bits and 8 parity
bits. These larger 72-bit block words are read by memory controller 41 and
CPU 55. A simplified interface 56 is shown representing the various buses
and devices between the CPU 55 and the memory controller 41.
The ECC logic block 49 performs ECC manipulation on the 72-bit words, which
include grouped 9-bit data bursts from the burst DRAM devices 43. For a
9-bit DRAM 43, there is eight data bits and an error bit. Four error bits
would be required. As explained above and indicated in Table 1, ECC
techniques require a certain number of error bits for a certain number of
associated data bits. Grouping two 9-bit words together would result in 16
bits of data with two error bits, which is still not enough to perform an
ECC operation on 16 bits of data (five error bits needed).
It is convenient to group an eight cycle burst which includes eight 9-bit
words into a larger 72-bit word to take advantage of the fact that an ECC
operation can be performed on 64 data bits using a minimum of seven error
bits. Since each of the eight 9-bit words includes an associated error
bit, the 72-bit word formed includes eight error bits, which is sufficient
to perform an ECC operation. Any multiple of 9-bit words greater than five
can be bursted to achieve a sufficient number of available error bits for
the corresponding data to perform an ECC operation. For an embodiment
using 9-bit DRAMs 43, a 72-bit block word is formed in eight burst cycles.
For an embodiment using 18-bit DRAMs 43, a 72-bit block word is formed in
four burst cycles. To accomplish an ECC operation, four 18-bit words are
bursted from the DRAMs 43 and grouped to form a 72-bit block word. Since
each of the four 18-bit words includes two associated error bits, the
72-bit word formed includes 64 data bits and eight error bits, which is
sufficient to perform an ECC operation. Thus, if 18-bit DRAMs are used in
a memory system, four 18-bit words are packaged together by BDAC 47 of
memory controller 41 to form the 72-bit block word processed in the ECC
operation.
Likewise, in another embodiment using 36-bit DRAMs 43, each 36-bit word
includes 32 data bits and four error bits. Two 36-bit words are bursted
and grouped together by BDAC 47 to form a 72-bit block word which is to
undergo the ECC operation in the ECC logic 49. The 72-bit block word would
again includes eight error bits, which is sufficient for an ECC operation
on the resulting 64 data bits included in the 72-bit word. If it is
desired to always burst in cycles of at least four, then four 36-bit words
are bursted and grouped together by the BDAC 47 to form two 72-bit block
words or one 144-bit word, where the ECC logic 49 performs ECC operations
on the entire 144-bit word or separately on the two 72-bit words.
ECC logic 49 encodes data sent from the CPU 55 to be stored in burst DRAM
43 with an optimized ECC shown in FIG. 3, and described in more detail
below. The result is a 72-bit block word 51, of which 64 bits are data and
eight bits are error bits, or check bits. BDAC 47 converts this 72-bit
word into a 9-bit wide data stream that is then communicated to one or
more of the burst DRAMs 43. An illustration of the 72-bit block word 51
within the burst data stream from 9-bit DRAMs is provided in FIG. 3. FIG.
4 illustrates re-conversion of the word 51 by BDAC 47 during a read
operation into a 72-bit wide word 53 applied to ECC logic 49. Similarly,
an illustration of a 72-bit word 54 within the burst data stream from
18-bit DRAMs 43 is provided in FIG. 7. FIG. 13 illustrates re-conversion
of the word 54 by BDAC 47 during a read operation into a 72-bit wide word
56 applied to ECC logic 49. An illustration of a 72-bit word 58 within the
burst data stream for 36-bit DRAMs is provided in FIG. 10. FIG. 14
illustrates re-conversion of the word 58 by BDAC 47 during a read
operation into a 72-bit wide word 60 applied to ECC logic 49.
FIG. 5 illustrates a parity-check matrix for an error correction code for
an embodiment using 9-bit DRAMs 43 in accordance with the present
invention. Based upon reliability data of traditional DRAMs, the prevalent
memory failures were caused most often by a single bit (soft or hard)
error, or the failure of a single DRAM device. The error correction code
in accordance with the invention has been designed to maximize memory
protection for the data bits, and has been optimized for use with burst
DRAM devices. The memory overhead of having this burst DRAM ECC is the
same as for byte parity and other ECC schemes such as the (72-bit word,
64-bit data) Hsiao codes. Thus, there is no additional memory cost. In
addition, the ECC in accordance with the invention is able to correct
random single bit errors and detect random double bit errors.
To generate the error bits for an associated 64 data bits, an exclusive OR
(XOR) operation is performed by ECC logic 49 on the data bits in the
locations of each row in the matrix of FIG. 5 having ones. Implementation
of the XOR function within an ASIC using Boolean logic is straightforward,
and will not be described herein in detail. Performing this operation for
each of the eight rows produces the eight error bits. The 64 bits of data
and eight error bits are then written to burst DRAM array 93 via BDAC 47.
When a read operation occurs, the 64 data bits and eight error bits are
read from the burst DRAM array 93, and the XOR operation is again
performed on each row, this time including all 72 bits. By performing this
operation on each of the eight rows of the matrix of FIG. 5, eight
syndrome bits S(0-7) are produced, which essentially form the unique error
code for each bit location. Similarly, for embodiments using 18-bit and
36-bit DRAMs 43, FIGS. 8 and 11, respectively, illustrate a parity-check
matrix for an error correction code for the 18-bit and 36-bit DRAMs of the
DRAM array 93, respectively, in accordance with the present invention.
FIG. 6 illustrates a syndrome table corresponding to the matrix of FIG. 5
for the 9-bit DRAM embodiment according to the present invention. The
left-hand number in each entry is a hexadecimal number (h) of the eight
syndrome bits, and the right-hand value is the type of error and the bit
in error, if known. As indicated in the upper left hand corner of the
table of FIG. 6, if all eight syndrome bits are all zero (00h) then no
error is detected. If the syndrome bits contain any ones, either a single
bit (DB), check bit (CB) or multiple bit error (UNCER) is detected. As is
apparent from FIG. 6, if the error is a single bit error, i.e., either a
single data bit error or a single check bit error, the particular bit in
error is identified by the syndrome table and can be corrected by
complementing the indicated bit using known methods. For multiple bit
errors, an error is detected but the particular bit is uncertain (UNCER)
then is not corrected. Preferably, a report of all uncorrectable errors is
made to the operating system or appropriate system software to provide
notice of the corrupt data so the users or system service personnel can
isolate the faults more easily, thus minimizing system down time. In
accordance with the invention, any error that is detected or corrected can
be reported to the system software or stored in a special archive for
later use during servicing.
FIG. 8 illustrates a parity-check matrix for an error correction code for
an embodiment using 18-bit DRAMs 43, and FIG. 9 illustrates a syndrome
table corresponding to the matrix of FIG. 8. Similarly, FIG. 11
illustrates a parity-check matrix for an error correction code for an
embodiment using 36-bit DRAMs 43, and FIG. 12 illustrates a syndrome table
corresponding to the matrix of FIG. 11.
In general, for error correction to be accomplished successfully, the
relationship between the number of data bits, m, to be checked and the
number of error bits, k, associated with those m data bits is as follows:
2.sup.k -1-k.gtoreq.m. If multiple (p) burst data words are grouped to
form a block word on which ECC operations are performed, the block word
has (p.multidot.m) data bits and (p.multidot.k) error bits for a total of
(p.multidot.m)+(p.multidot.k) bits, where the ".multidot." symbol denotes
multiplication. Thus, the relationship between the data bits and the error
bits for the block word is as follows: 2.sup.(p.multidot.k)
-1-(p.multidot.k).gtoreq.(p.multidot.m). The present invention is
applicable, in general, to any size of burst memory device with any number
of check bits. To illustrate a non-conventional example, an 11-bit memory
with 10 data bits and 1 check bit could be used. In this 11-bit case,
seven burst cycles are used to collect a 77-bit word with 70 data bits and
7 check bits, where p=7, k=1 and m=70. Although the present invention is
illustrated using DRAM devices, it is understood that any type of
programmable memory device may be used since the present invention is not
limited to any particular memory technology.
It can now be appreciated that a system and method according to the present
invention takes advantage of increased speed and performance of proposed
burst memory devices and memory subsystems while also providing error
detection and/or correction. The present invention was illustrated in a
memory subsystem embodiment for a computer system. It is understood,
however, that the present invention may be used in any memory system which
uses error detection and error correction operations to ensure data
integrity.
Although a system and method according to the present invention has been
described in connection with the preferred embodiment, it is not intended
to be limited to the specific form set forth herein, but on the contrary,
it is intended to cover such alternatives, modifications, and equivalents,
as can be reasonably included within the spirit and scope of the invention
as defined by the appended claims.
* * * * *
|
|
|
|
|
Description  |
|