|
Description  |
|
|
FIELD OF THE INVENTION
The present invention is generally directed to data communication systems
using optical fibers to carry information. More particularly, the present
invention is directed to reducing the timeout penalties under certain
recovery situations.
BACKGROUND OF THE INVENTION
Fiber optics have enabled the transmission of long strings of data in a
serial fashion from a driver to a receiver over long distances (typically
measured in kilometers) at very high data rates (typically specified in
billions of bits per second). This is in contrast to more traditional
communication over electrical wires which only allow data to be
transmitted for relatively short distances at these high data rates. The
distances for communication over wire means is typically in the range of
several tens of meters.
Fiber optic data transmission is however inherently noisy in that bit
errors in the data are frequent. Error rates of one in a trillion or even
one in a billion bits are common. Various checking methods including
cyclic redundancy codes are used to detect these errors.
Several methods have been suggested for recovering data that has been
transmitted but received with error indications. One method is to employ a
high level protocol that keeps track of the time from request to response.
Each time a request is sent from one side of the fiber optic link to the
other, the sender starts a timer. If a response is not received within a
specified time, the sender "assumes" that either the request or the
response was lost. The sender then requests status from the other end of
the link to determine if the request should be resent.
Another method of recovering data received in error is to package the data
into frames and to assign a sequence number to each frame. If a receiver
detects a frame with a sequence number that is out of order, it assumes
that one or more frames were lost. Using the sequence number of the last
correctly received frame, the receiver then requests the lost frames to be
retransmitted. This method allows multiple requests and responses to be on
the link at the same time, thus improving the utilization.
Still another method of determining the frame in error is to use separate
checking fields for the header and the information field. Thus if the
information field is in error, the chances are that the header identifying
the frame is still error free. Using the frame header information, the
detecting end of the link can request the frame to be retransmitted.
With both of the methods described above, it is still possible that a frame
is lost and that a request never completes. Sequence numbers require
subsequent frames for the detection of the frame in error, and using
separate checking fields for the frame header does not guarantee that the
frame header itself is not in error. Because of these shortcomings, it is
usual practice to time outstanding requests to detect missing responses.
These timers detect both the damaged frames and unusually long response
times in the system. The values used for these timers are rather large so
that normal delays in processing a request do not cause the timers to
expire too often. On the other hand, long timers may further delay
subsequent recovery actions when the frame cannot be resent.
Thus what is needed is a mechanism by which the delays in the higher level
recovery procedures due to long timeouts are avoided when the frame cannot
be resent. This is accomplished by using a much shorter timer that spans
only the low level recovery action.
SUMMARY OF THE INVENTION
The present invention is embodied in a system and method for asynchronously
transmitting data blocks between two information handling systems. Two
carriers are used to interconnect two systems and to provide serial data
transmission in both directions. Together, these carriers and the
supporting hardware is called a link. Information frames are provided as a
mechanism to transmit data serially on the fiber. A frame contains all or
part of the contents of: a request; a response; or a data area.
The protocol for data transmission includes a request sent from one
information handling system to another, followed by optional data areas
transmitted in either direction, followed by a response sent back to the
requesting system. Each sequence of "request, data, and response" is
called an operation, and multiple operations may be interleaved over the
link. For each operation, strict ordering of the request, data, and
response is maintained. The system maintains state information as to the
progress of each operation that has originated or to which it is
responding. To maintain the state information, the frame headers of the
requests and responses include an indicator (the A bit) informing the
receiver that data areas are associated with either the request or
response.
A collection of timers is used to time each individual operation. The
timers are initialized to a predetermined value and count down. A timer is
initialized and started when a request is sent, and it is stopped when the
response is received. If the timer reaches zero before the response is
received, a higher level recovery procedure is invoked.
According to an embodiment of the present invention, when a frame that has
been damaged by a transmission error is received, a request for
retransmission is sent to the other end of the link. If the operation is a
read and if the damaged frame contains a data area, the timer for this
operation is reinitialized to a smaller predetermined value before the
request for retransmission is sent. If the data area is received before
the timer expires, the operation completes normally. If the data area is
not received before the timer expires, a higher level recovery procedure
is invoked.
BRIEF DESCRIPTION OF THE DRAWINGS
The subject matter which is regarded as the invention is particularly
pointed out and distinctly claimed in the concluding portion of the
specification. The invention, however, both as to organization and method
of practice, together with the further objects and advantages thereof, may
best be understood by reference to the following description taken in
connection with the accompanying drawings in which:
FIG. 1 is a block diagram illustrating a physical link between two
information handling elements;
FIG. 2 is a block diagram illustrating a multimessage channel buffer
structure;
FIG. 3 illustrates the format of an exemplary frame, particularly showing
the presence of a link control word;
FIG. 4 illustrates the contents of the Link Control word;
FIG. 5 illustrates sequences of command, data and response exchanged
between two computing elements or systems;
FIG. 6 illustrates additional sequences of command, data and response
exchanges where multiple data transfers are performed;
FIG. 7 illustrates an exchange sequence where one of the frames has a
transmission error;
FIG. 8 illustrates the frame reception state table for the originator of an
operation;
FIG. 9 is a list of originator actions;
FIG. 10 is a list of originator events;
FIG. 11 is a list of originator states;
FIG. 12 illustrates the frame reception state table for the recipient of an
operation;
FIG. 13 is a list of recipient actions;
FIG. 14 is a list of recipient events;
FIG. 15 is a list of recipient states.
DETAILED DESCRIPTION OF THE INVENTION
Turning first to FIG. 1, a physical link between two computing elements
102, 104 is illustrated. These elements could be, for example, two
computers or a computer and a shared memory device or in general, any
information handling system. In any event, computing elements 102, 104 are
connected by way of intersystem channel link 106 including fiber optic
cable link 108. Fiber optic link 108 is formed with optical fiber cable
pair 110. A fiber cable pair includes two optical fibers, one for
transmitting information and one for receiving information. Fiber cable
pair 110 is coupled to computing elements 102, 104 by way of transceivers
(TCVR) 112 and 114 located at opposite ends of the link. Each of the
transceivers 112 and 114 includes a transmitter unit and a receiver unit.
All of the data traffic over fiber optic cable link 108 supports message
passing between the computing elements 102 and 104. A typical message is a
request sent from computing element 102 to computing element 104. Data may
be associated with the request and is either sent from computing element
102 to computing element 104 (a write operation) or from computing element
104 to computing element 102 (a read operation). After the data is
transferred, a response is sent from computing element 104 to computing
element 102. The messages, consisting of requests, data and responses are
stored in buffers located in both computing elements as shown. To transfer
a request, data, or response transmit buffers 116 and 122 and receive
buffers 118 and 120 are employed in computing elements 102 and 104. It
should be understood that transmitting buffers 116 and 122 may be located
anywhere in the transmitting computing elements 102 and 104, including in
main processor storage. It should be further understood that the receive
buffers 118 and 120 should at all times be immediately accessible by
transceivers 112 and 114. Therefore, receive buffers 118 and 120 are
usually implemented as arrays dedicated to the link, and they are not in
main processor storage where access is shared among many different
elements within computing elements 102 and 104.
To process a complete message with data requires buffers in both computing
elements 102, 104. The computing element that initiates the message is the
originator, and the computing element that processes the message is the
recipient. FIG. 2 illustrates a typical situation in which there are
multiple buffers on both sides of a link. For example, transmission of a
message with data from Channel A (202) to Channel B (204) requires Channel
A originator buffers shown in block 206 and the Channel B recipient
buffers shown in block 208. Each group of buffers, 206 and 208, are called
"buffer sets". When a message is sent, originator buffer request area 210
is loaded with the request, and the request is sent over the link to
recipient buffer request area 216. If data is to be transferred, it is
either sent from originator buffer data area 214 to recipient buffer data
area 220 for write operation, or it is sent from recipient buffer data
area 220 to originator buffer data area 214 for a read operation. After
data transfer, if any, the response is loaded into the recipient buffer
response area 218 and sent across the link to originator buffer response
area 212 or vice versa depending on the information flow direction.
The information that is transferred from one side of the link to the other
side is contained in frames. This information is always targeted to a
particular buffer area, and the targeting information is contained in the
link-control word (see below) of the frame. This targeting information
allows the frames to be multiplexed over the link in any order. As an
example, returning to FIG. 2, Channel A could send a request for buffer
set 1 followed by write data for buffer set 0, followed by a response for
buffer set 0, etc. It should be understood that a computing element can
dynamically set up various numbers of originator and/or recipient buffers
depending on the number and type of links to be established.
The format of an exemplary frame is illustrated in FIG. 3. When no frames
are being transmitted, idle words 310 are continuously sent on the link.
When frames are transmitted, they start with a data word which is the
link-control word (LC) 302. Various fields in the link-control word 302
identify the frame format and type, designate a buffer set area, and
control the state of the transceiver and link. These fields are described
in detail below.
A link-control-CRC (cyclical redundancy check) word 304 preferably follows
link-control word 302. Link-control-CRC word 304 is conventionally
generated from values in the link-control word. Link-control-CRC word 304
is checked at the receiver to test the validity of the link-control word
in the incoming frame.
There are two types of frames: control frames and information frames.
Control frames do not have an information field. Control frames consist
only of a link-control word and a link-control-CRC word. An information
frame has link-control word 302, a link-control-CRC word 304 and an
information field 306. Information field 306 contains, for example, from
one to 1,024 words. Information field 306 contains the information sent
from a buffer set area at one end of the link to a buffer set area at the
other end of the link.
An information field is followed by information-field-CRC word 308.
Information-field-CRC word 308 is conventionally generated from the values
in the information field. Information-field-CRC word 308 is checked at the
receiver to test the validity of the information field in the incoming
frame.
FIG. 4 shows details of link-control word 302. Format bit 402 indicates
whether or not the frame contains an information field. Requests,
responses, and data frames all have an information field while
acknowledgements and rejects do not have an information field. Type field
404 specifies that the frame is either a request, a response, or a data
frame. Buffer set number 406 specifies which buffer set is the target. "A
bit" 408 has two uses. In a request frame, "A bit" 408 is used to indicate
that data frame(s) are to follow (a write operation), and in a response
frame, the "A bit" is used to indicate that data frame(s) preceded the
response (a read operation). In a data frame, the "A bit" is used to
indicate that more data frame(s) are to follow.
It should be kept in mind that, as used herein, the terms "data frame" and
"information frame" are not synonymous. Requests, responses and data
frames are all information frames (that is, frames with an information
field). A request is called a Message Command Block (MCB) and a response
is called a Message Response Block (MRB) (see FIGS. 5 and 6).
Information transferred to a particular buffer may be contained in more
than one frame or frame group. The first frame for a buffer area always
has Start 410 bit "ON" and this bit also indicates the validity of Block
Count 412. This count indicates the total number of 256 byte blocks that
are transferred to the buffer. This count does not indicate the length of
the presently transmitted frame. The transmitter can end the frame with a
CRC 308 word on any 256 byte boundary of the information 306 field. When
the transmitter resumes the transfer to the buffer, it starts the new
frame with Start bit 410 in the link-control word reset to zero. The zero
value of the start bit indicates that this frame is a continuation of a
previous frame targeted to the same buffer. The receiver knows that all of
the information has been received when the total number of 256 byte blocks
transmitted, in all of the frames, have been received and satisfy Block
Count 412, transmitted in the link-control word of the first frame. A
buffer area can be transmitted by any number of frames from one to the
total number of 256 byte blocks. For example, a 1024 byte buffer area can
be transmitted in from one to four frames.
Although not provided with sequence numbers, each frame is interlocked with
subsequent and previous frames. The interlocking is accomplished by the
protocol on the link. For example, each message for a particular buffer
set starts with a request, is followed by data and is subsequently
followed by a response. Each of these types of transmissions has a unique
link-control word since each transmission is targeted to differing buffer
set areas. FIGS. 5 and 6 illustrate the protocols.
FIG. 5 shows three operational examples. The first example shows a request
and response with no data transfer. In this example, originator 502 sends
a request in Message Command Block (MCB) (in step 506). The LC 302 for
this frame has "A bit" 408 set to zero (indicated by the bar over the "A"
in the Figure) since there is no data to follow. After the request has
been processed, recipient 504 sends a response in Message Response Block
(MRB)(508). The LC word for this frame also has the "A bit" set to zero
since there was no data preceding the response.
The second example in FIG. 5 is a write of a single data area. In this
case, MCB (510) has the "A bit" set to "1" since there is at least one
data area to follow. After the MCB, the originator sends the data area
DATA (512). The "A bit" in this DATA frame is set to zero because there
are no more data areas to follow. After recipient 504 processes the
request and its associated data, it sends a response, in the form of MRB
(514). The "A bit" in the MRB is set to "0" since there was no data
preceding the response.
The third and last example in FIG. 5 is a read of a single data area. In
this case, MCB (516) has the "A bit" set to zero since there are no data
areas to follow. Recipient 504 processes the request and returns data area
DATA (518). The "A bit" in this DATA frame is set to "0" because there are
no more data areas to follow. After recipient 504 sends the DATA frame, it
sends response MRB (520). The "A bit" in this MRB is set to "1" since
there was at least one data area preceding the response.
FIG. 6 shows two examples of transferring multiple data areas. In the first
example, a write operation transferring two areas is performed by the
originator. As in the single data area example, MCB 602 and first data
area 604 are sent by the originator. First data area 604 has the "A bit"
set to "1" indicating that more data areas are to follow. The recipient
processes the first data area by moving it to main storage (or elsewhere)
thus freeing the buffer area for the receipt of the next data area. Next,
the recipient sends an acknowledge ACK (606) frame. This frame contains no
information field but the link control word identifies the buffer set. The
originator responds to the ACK frame by sending the next (and last) data
area DATA (608). The "A bit" in this DATA frame is set to "0" because
there are no more data areas to follow. After the recipient processes the
request and its associated data, it sends a response, an MRB 610. The MRB
has the "A bit" off as in the single data area write example.
The second example in FIG. 6 illustrates a read operation transferring two
data areas. The originator starts by sending MCB (612). The recipient
responds by returning data area DATA (614). The "A bit" in this DATA frame
is "ON" indicating that more data areas are to follow. After the
originator receives the data area and moves it to main storage, (or
elsewhere) the buffer area is free for the receipt of the next data area.
The originator sends an acknowledge ACK frame (616). This ACK frame is
similar to the ACK frame 606 used in the write case. The recipient
responds to the ACK frame 616 by sending the next data frame DATA (618) to
the originator. The "A bit" in this DATA frame is off indicating that this
is the last data area. After the recipient sends the DATA frame, it sends
response frame MRB (520). The "A bit" in this MRB is set to "1" since
there was at least one data area preceding the response.
It should be understood that while only one operation for a single buffer
set is shown in the examples of FIGS. 5 and 6, multiple buffer sets may be
using the link at the same time, and that the traffic on the link consists
of interleaved frames sent for the multiple buffer sets.
Returning to FIG. 3, it is noted that link control word 302 and information
field 306 have independent error checking capabilities. This checking is
provided by encoding of the data for serial transmission (preferably by
using the 8 bit/10 bit code described in U.S. Pat. No. 4,486,739), and the
CRC words 304 and 308. Transmission errors usually affect only a few bits
at a time and it is unlikely that a transmission error would damage both
the LC and the information field in the same frame. If the link control
word is in error, the entire frame is considered lost since the receiver
does not know anything about the frame such as the frame type and the
buffer set number. If the link control word is not in error and only the
information field is in error, the link control word provides the receiver
of the damaged frame with enough information to ask the sender to
retransmit the damaged frame. Since the information field is usually much
longer than the LC, there is a higher chance that a transmission error
will affect the information field and not the LC. This means that the most
transmission errors affecting a frame can be retried using the information
supplied by the link control word. This situation is illustrated in FIG. 7
which shows a write operation with an error in MCB (702). The recipient
detects the error and sends a Reject (REJ) 704 frame back to the
originator in which a request is made to retransmit the MCB. The
originator resends MCB (706).
The state tables used for receiving frames for the originator and recipient
are shown in FIGS. 8 and 12 respectively. Each buffer set maintains its
own individual state. The events are listed along the top (802 and 1202 of
the tables) and the originator and recipient events are described in FIGS.
10 and 14 respectively. These events are activity on the inbound link. For
example, the receipt of an MRB with the "A bit" set to a "1" and no
transmission errors is called MRB+A (1002). Because data area frames are
much longer than MCB and MRB frames, the receipt of the link control word
of these data area frames (DATA-A START 1004 and DATA+A START 1006) is an
event allowing the channel to start moving the information field before
all of the data area has been received thus improving performance. Listed
along the left side of the state tables for FIGS. 8 and 12 are the states.
The originator and recipient states for FIGS. 8 and 12 are described in
FIGS. 11 and 15 respectively. For example IDLE state 1102 indicates that
the originator is done with the previous operation and is ready to receive
data areas and the response to the next operation. The IERR state 1104
indicates that an invalid sequence of frames was received, and the HERR
state 1106 indicates that the hardware must be in error for the event to
occur at this time.
Within each block 804 of the state tables there are two areas. The top of
the block indicates the next state (806), and the bottom of the block
indicates the action(s). The originator actions are described in FIG. 9,
and the recipient action are described in FIG. 13. For example, originator
action #2 (902 in FIG. 9) starts data transfer as a result of receiving
the link control word of a data area.
All of the sequences shown in FIGS. 5, 6 and 7 can be traced in the state
tables in FIGS. 8 and 12. For example, the operation of the recipient
during the single data area write case shown in FIG. 5 can be traced in
the recipient state table in FIG. 12. The recipient is normally in the
IDLE state (1204) waiting for an MCB 510 frame. The arrival of this frame
is the MCB+A (1206) event, and the state changes to DS where the recipient
is waiting for a data area. The DATA-A START event (1216) signals
beginning of DATA frame 512, and the recipient moves to the DE 1212 state
waiting for the end of the data area (DE entry in FIG. 12). The DATA END
event (1214) signals the successful receipt of the entire data area, and
the recipient returns to the IDLE state and stops the timer.
As can be seen in the examples showing read operations in FIGS. 5 and 6, a
reject may be sent for the last data area frame (DATA with a A bit to
zero) any time before the MCB is sent for the next operation, and the next
operation may start much later if ever. This means that the time between
sending the last DATA frame and receiving a reject for this frame is
unbounded. With possibly such long times between sending the DATA frame
and receiving a reject, there are cases where the recipient may have
changed the data area and is not able to resend it. Keeping a copy of the
data area for response to a reject would allow the recipient to always
resend the data area, but creating this copy each time a data area is sent
slows system performance. Since the instances when the recipient cannot
resend the data area are relatively rare, it is better to allow the
recipient to not respond and have a higher level of recovery handle the
error rather than create a copy.
Returning to FIG. 8, when the originator detects an error in the last data
area of a read operation, it does so by the DATA END event in the DS, DSM,
and DSMRJ states. In each of these three states, the originator is waiting
to receive the end of a DATA frame that has the A bit set to zero. The
originator reinitializes the timer to a value much smaller than the normal
timer value (action 7, say 10 milliseconds) and sends the reject for the
DATA frame (action 5). If the retransmitted DATA frame is received within
the new timeout period, the operation completes normally. If the timer
expires before the retransmitted DATA frame is received, the higher level
recovery procedure is invoked.
The shortened timer algorithm described above can be improved by only
reinitializing the timer to the shorter value after the MRB has been
correctly received. The longest delay is usually incurred while the
recipient is processing the request, and the request is finished when the
MRB is received. The time it takes for the recipient to process the reject
and retransmit the DATA frame is relatively short. To reinitialize the
timer only after the MRB has been received changes the state table in FIG.
8 in the following ways. First, in the DS state when a DATA ERR is
received, the reject is sent for the DATA frame, but the timer is not
reinitialized. Second, in both the DSM and DSMRJ states, the MRB has been
received and the originator knows that the recipient has finished
processing the request. Even though the MRB had an error in the DSMRJ
state, the originator can still reinitialize the timer to a shorter value
when an error is detected in the DATA frame. The shorter timer only has to
span the resending of the damaged DATA and/or MRB.
While we have described our preferred embodiments of our invention, it will
be understood that those skilled in the art, both now and in the future,
may make various improvements and enhancement which fall within the scope
of the claims which follow. These claims should be construed to maintain
the proper protection for the invention first disclosed.
* * * * *
|
|
|
|
|
Description  |
|