|
Claims  |
|
|
What is claimed is:
1. A data processing system comprising:
a. a first data source/sink (DSS) comprising one of a memory unit or a CPU,
having an associated memory wherein is located
first data accessible by memory address over a network;
a first descriptor to said first data; and
a computer program for causing said first DSS to transfer data by
duplicating said first descriptor, thereby creating a duplicate descriptor
and incrementing a counter to said first data; and
transferring said first descriptor or said duplicate descriptor onto said
second DSS without transferring said first data to a second DSS;
b. a second DSS, comprising the other of said memory unit and said CPU,
having an associated memory wherein is located a computer program for
causing said second DSS to transfer data by
receiving said first descriptor from said first DSS; and
transferring said first descriptor to a third DSS;
c. a third DSS, comprising a memory unit or a CPU, having an associated
memory; and
d. a network, coupling said memory unit, said CPU and said third DSS,
wherein the number of times said first data is transferred to said second
DSS is less than the number of times said first descriptor is transferred
from said second DSS.
2. The data processing system of claim 1 wherein said computer program in
said associated memory of said first DSS transfers by further
destroying said duplicate descriptor; and
decrementing said counter.
3. The data processing system of claim 1 wherein said computer program in
said associated memory of said second DSS transfers by further
creating on said second DSS, a second descriptor to second data; and
chaining said first and second descriptors to form a first chained
descriptor.
4. The data processing system of claim 3 wherein said computer program in
said associated memory of said second DSS transfers by further
transferring said first chained descriptor to said third or said first DSS.
5. The data processing system of claim 1 wherein in said associated memory
of said third DSS is located a computer program for causing said third DSS
to transfer data by
retrieving a portion of the first data from said first DSS to said third
DSS by means of said first descriptor.
6. A data processing system comprising:
e. a first data source/sink (DSS), comprising one of a memory unit or a
CPU, having an associated memory wherein is located
first data accessible by memory address over a network;
a descriptor to said first data; and
a computer program for causing said first DSS to transfer data by
duplicating said descriptor, thereby creating a duplicate descriptor and
incrementing a counter to said first data; and
transferring said descriptor or said duplicate descriptor onto said second
DSS without transferring said first data to a second DSS;
f. a second DSS, comprising the other of said memory unit and said CPU,
having an associated memory wherein is located a computer program for
causing said second DSS to transfer data by
receiving said descriptor from said first DSS; and
dividing said descriptor into a plurality of descriptors, each of said
plurality of descriptors describing a portion of the first data; and
transferring one of said plurality of descriptors from said second DSS onto
a third DSS without transferring the portion of the first data described
by said one descriptor;
g. a third DSS, comprising a memory unit or a CPU, having an associated
memory; and
h. a network, coupling said memory unit, said CPU and said third DSS.
7. The data processing system of claim 6 wherein said computer program in
said associated memory of said second DSS transfers by further
destroying said duplicate descriptor; and
decrementing said counter.
8. The data processing system of claim 6 wherein said computer program in
said associated memory of said second DSS transfers by
duplicating said one descriptor, thereby creating a duplicate descriptor
and incrementing a counter to said first data; and
transferring said one descriptor or said duplicate descriptor onto said
third DSS.
9. The data processing system of claim 8 wherein said computer program in
said associated memory of said second DSS transfers by further
destroying said duplicate descriptor; and
decrementing said counter.
10. The data processing system of claim 6 wherein in said associated memory
of said third DSS is located a computer program for causing said third DSS
to transfer data by
retrieving the portion of the first data described by said one descriptor
from said first DSS to said third DSS by means of said one descriptor.
11. The data processing system of claim 6 wherein said computer program in
said associated memory of said second DSS transfers data by further
chaining before each of said plurality of descriptors a respective
descriptor to data.
12. The data processing system of claim 11 wherein said computer program in
said associated memory of said second DSS chains by
chaining before each of said plurality of descriptors a respective
descriptor to data comprising a protocol header.
13. In a data processing system having a distributed memory architecture
that includes a plurality of data source/sinks in the form of CPUs having
associated memories, memory units having associated controllers, and
global memories available to all resources on the network, with the data
source/sinks coupled as nodes to a network and with data locations
accessible by global memory addresses over the network, a method for
transferring a large video data stream, stored in a memory unit, including
data segments scattered among multiple data source/sinks without copying
the scattered data segments to each of the multiple data source/sinks
during scattered data stream building processing, said method comprising
the steps of:
providing a application server process, on a first data source/sink in the
form of a CPU having an associated memory, said application server process
for receiving requests for video files from requestors and directing
transfer of a requested video file to a requestor;
allocating of a pool of global memories for use as a cache, with said pool
of global memories referenced by a first message;
transferring the video file, stored in said memory unit, to said allocated
pool of global memories;
making a copy of said first message as said first data source/sink;
transferring only said copy of said first message and not said video file
data segment from said first data source/sink to a second data source/sink
in the form of a CPU with an associated memory;
generating, at said second data source/sink, a second message including a
second global network address specifying a second storage location where a
second data segment is stored;
processing, at said second data source/sink, said first and second messages
to form a first chained message including said first and second global
addresses specifying said first and second storage locations;
transferring only said first chained message and not said second data
segment to a third data source/sink, with said third data source/sink for
generating protocol headers for transmitting said first and second data
segments to a destination;
determining, at said third data source/sink, the size of said video file,
and based on said size, determining a fragment number of data fragments
into which said first data segment is to be divided prior to transmission;
generating, at said third data source/sink, said fragment number of
protocol headers for concatenation with each of said data fragments prior
to transmission;
generating, at said third data source/sink, said fragment number of
protocol header messages including global addresses of said fragment
number of storage areas in said pool of global memories storing said
fragment number of protocol headers;
generating, as said third data source/sink, said fragment number of data
fragment messages including global addresses specifying where each of said
data fragments is stored within said pool of global memories;
processing, at said third data source/sink, said first chained message,
said fragment number of protocol header messages, and said fragment number
of data fragment messages, to form linked packet messages, with a first
packet message including said first and second messages, a first protocol
header message, and first data fragment message and with subsequent packet
messages including associated protocol header and data fragment messages. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
The present invention relates to data transfer in a computer system. In
particular, the invention relates to methods and apparatus for
transferring data among various sources and sinks for data.
RELATED APPLICATIONS
The following applications cover related inventions:
U.S. patent application Ser. No. 08/575,533, entitled "COMPUTER SYSTEM DATA
I/O BY REFERENCE AMONG MULTIPLE CPUS," filed Dec. 20, 1995, naming Fishler
and Zargham as inventors, assigned to the Assignee of this invention, with
Attorney Docket No. 010577-036500US/TA 342, now pending;
U.S. patent application Ser. No. 08/578,366, entitled "COMPUTER SYSTEM DATA
I/O BY REFERENCE AMONG I/O DEVICES AND MULTIPLE MEMORY UNITS," filed Dec.
20, 1995, naming Fishler and Zargham as inventors, assigned to the
Assignee of this invention, with Attorney Docket No. 010577-036700US/TA
343, and now pending;
U.S. patent application Ser. No. 08/578,409, entitled "COMPUTER SYSTEM DATA
I/O BY REFERENCE AMONG MULTIPLE DATA SOURCES AND SINKS," filed Dec. 20,
1995, naming Fishler and Zargham as inventors, assigned to the Assignee of
this invention, with Attorney Docket No. 010577-037100US/TA 346, now
pending.
Queued, message-based I/O ("QIO") in a system with shared memory is
discussed fully in U.S. application Ser. No. 08/377,302, filed Jan. 23,
1995 and assigned as well to the Assignee of the instant application, now
abandoned. U.S. application Ser. No. 08/377,302 is incorporated herein by
reference and is loosely summarized below.
FIG. 1 is a block diagram showing a fault-tolerant, parallel data
processing system 100 incorporating a QIO shared memory system. FIG. 1
includes a node 102 and a workstation 104 that communicate over a Local
Area Network (LAN) 105. The node 102 includes processors 106 and 108,
connected by an interprocessor bus (IPB) 109. The IPB 109 is a redundant
bus of a type known by persons of ordinary skill in the art. Although not
shown in FIG. 1, the system 100 is a fault-tolerant, parallel computer
system, where at least one processor checkpoints data from other
processors in the system. In prior art, in such a system, memory is not
shared in order to avoid the memory being a bottleneck or a common point
of failure. Such a fault tolerant system is described generally in, for
example, U.S. Pat. No. 4,817,091 to Katzman et al.
The processor 106 includes a CPU 110 and a memory 112 and is connected via
a disk driver 132 and a disk controller 114 to a disk drive 116. The
memory 112 includes a shared memory segment 124 (including QIO queues
125), an application process 120 and a disk process 122. The application
and disk processes 120, 122 access the shared memory segment 124 through
the QIO library routines 126. As is the nature of QIO, messages sent
between the application process 120 and the disk process 122 using the
shared memory segment 124 and the QIO library 126 are sent without
duplication of data from process to process.
The processor 108 also includes a CPU 142 and a memory 144 and is connected
via a LAN controller 140 to LAN 105. The memory 144 includes a shared
memory segment 150 (including QIO queues 151), a TCP/IP process 146 and an
NFS distributor process 148. The TCP/IP process 146 communicates through
the shared memory segment 150 using the QIO library routines 152 with the
NFS distributor process 148 and the software LAN driver 158. Again,
communications using the QIO shared memory segment 150 do not involve
copying data between processes.
The TCP/IP process 146 and the LAN 150 exchange data by means of the LAN
driver 158 and a LAN controller 140.
The process 120 communicates over the IPB 109 with the TCP/IP process 146
using message systems (MS) 128 and 154 and file systems (FS) 130 and 156.
Unlike QIO communications, communications using message systems and file
systems do require data copying.
Thus, FIG. 1 shows a QIO shared memory system for communicating between
processes located on a single processor. A shared memory queuing system
increases the speed of operation of communication between processes on a
single processor and, thus, increases the overall speed of the system. In
addition, a shared memory queuing system frees programmers to implement
both vertical modularity and horizontal modularity when defining
processes. This increased vertical and horizontal modularity improves the
ease of maintenance of processes while still allowing efficient transfer
of data between processes on a single processor and between processes and
drivers on a single processor.
FIG. 2 illustrates a computer system generally designated as 200. The
computer system 200 contains nodes 210, 211, 212 and 213. The nodes 210,
211, 212 and 213 are interconnected by means of a network 220. The nodes
210, 211, 212 and 213 run a disk process 230, an application server
process 231, an intermediate protocol process 232 and a TCP/IP and ATM
driver 233, respectively.
The application server process 231 receives user requests for data and
directs the transfer of that data to the user over the TNet 220. The data
requested generally resides on disks accessible only via disk controllers
such as the disk controller 240. In fact, access to the data on a disk
controller is mediated by a particular disk process. Here, the disk
process 230 on the node 210 mediates access to the disk controller 240.
The disk process 230 is responsible for transferring data to and from the
disk attached to the disk controller 240.
With regard to the system 200 of FIG. 2, assume that a multimedia
application needs to obtain some large amount of data 260, say, an MPEG
video clip, from a data disk. Assume that the application does not need to
examine or transform any (or at least a majority) of the individual bytes
of that MPEG video clip. The application seeks that data 260 because an
end user somewhere on the net has requested that video clip. A user
interface and the application server process 231 communicate using an
intermediate protocol implemented on TCP/IP. (The user interface which may
be an application process or may be a hardware device with minimal
software. In any event, the user interface is not shown here.)
Accordingly, the intermediate protocol information 262 must be added to
messages from the application server process 231, and the intermediate
protocol process 232 has the responsibility for attaching such header
information 262 as the intermediate protocol requires. Likewise, TCP/IP
protocol information 263 must then be layered onto the outbound message,
and the TCP/IP driver process 233 in the node 213 supplies such TCP/IP
headers 263 as the TCP/IP protocol requires. Therefore, to transfer the
data 260 on demand from the disk attached to disk controller 240, the
application server process 231 employs the disk process 230 to retrieve
the data 260 from disk and employs the intermediate protocol and TCP/IP &
ATM driver processes 232, 233 to forward the data 260 to the user
interface.
Further assume that among its functions, the application process 231
attaches some application-specific data 261 at the beginning of the
outgoing data 260.
When the application server process 231 recognizes that the disk process
230 mediates access to the data 260 for the requesting user's consumption,
the application server process 231 communicates a message to the disk
process 230 via the TNet 220 in order to retrieve that data 260.
The disk process 230 builds a command sequence which the disk controller
240 on receipt will interpret as instructions to recover the data of
interest. The disk process 230 directs the disk controller 240 to transfer
the data 260 into the memory 250 of the sub-processing system 210. The
disk controller 240 informs the disk process 230 on successful completion
of the directed data transfer.
The disk process 230 in turn responds to the application server process 231
that the data transfer has completed successfully and includes a copy of
the data 260 in its response. Thus, the requested data 260 is copied into
the application server node 211. As one of ordinary skill in the art will
appreciate, several copies may be necessary in order to transfer the data
260 from the TNet driver buffers (not shown) of the application server
node 211 into the memory space of the application server process 231. Yet
another copy is typically necessary to make the application-specific data
261 contiguous with the disk data 260. The QIO system related above,
however, may obviate a number of these intra-processor copies but obviates
none of the interprocessor copies.
Indeed, the combined data 261, 260 migrates by means of another
interprocessor copy from the node 211 to the node 212. The node 212 adds
its intermediate protocol header data 262, probably by copies of the data
262, 261 and 260 into a single buffer within the memory of the
intermediate protocol process 232.
Again, the combined data 262, 261, 260 migrates from the node 212 to the
node 213 by means of another interprocessor copy. The TCP/IP process 233
desires to divide the combined data 262, 261, 260 into TCP/IP packet sizes
and insert TCP/IP headers 263a, 263b, . . . , 263n at the appropriate
points. Accordingly, the TCP/IP process 233 copies all or at least
substantially all of the combined data 262, 261, 260 and TCP/IP header
data 263a, 263b, . . . , 263n to fracture and reconstruct the data in the
correct order in the memory 253. The TCP/IP protocol process 233 then
transfers these packets to the ATM controller 270 which sends them out on
the wire.
(A system designer may wish to separate the processing of layered protocols
into separate sub-processing systems for reasons of parallelism, to
increase the throughput of the system 200. Such subprocessing systems do
not share memory in systems of this type in order to achieve greater fault
tolerance and to avoid memory bottlenecks.)
A computer system of this art requires that the disk data 260 be copied
five times among the sub-processing systems--and typically an additional
2-4 times within each sub-processing system not practicing QIO as related
above. The computer system 200 consumes memory bandwidth at (a minimum of)
five times the rate of a system wherein interprocessor copying was not
performed. The copying presents a potential bottleneck in the operation of
the system 200, wasting I/O bandwidth, memory bandwidth and causing cache
misses in the target CPU, all reducing performance.
Accordingly, there is a need for a system which avoids interprocessor
copying of data, while avoiding shared memory bottlenecks and fault
tolerance problems.
Accordingly, a goal of this invention is a computer system which obviates
unnecessary copying of data, both intra-processor and interprocessor.
This and other goals of the invention will be readily apparent to one of
ordinary skill in the art on reading the background above and the
description below.
SUMMARY OF THE INVENTION
In one embodiment, the invention a data processing system having a
distributed memory architecture that includes a plurality of data
sources/sinks in the form of memory units or CPUs having associated
memories, coupled as nodes to a network and with data accessible over the
network, for getting a descriptor to a data buffer on a first of said
plurality of data sources/sinks; putting said descriptor onto a second of
said plurality of data sources/sinks without transferring the data in said
data buffer; putting said descriptor from said second data source/sink
onto a third of said plurality of data sources/sinks; and retrieving a
portion of the data in said data buffer from said first data source/sink
to said third data source/sink.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram showing a fault tolerant, parallel data
processing system incorporating a QIO shared memory system;
FIG. 2 illustrates a modular networked multiprocessor system;
FIG. 3A illustrates a fault tolerant multiprocessor system;
FIG. 3B illustrates an alternative configuration of the system of FIG. 3A.
FIG. 4 illustrates the interface unit that forms a part of the CPUs of FIG.
3A to interface the processor and memory with the network;
FIG. 5 illustrates a more particularized version of the computer system 100
of FIG. 3A;
FIG. 6 is a representation of a global QIO queue.
FIG. 7 shows a format of a message.
FIG. 8 shows a format of a buffer descriptor.
DESCRIPTION OF THE PREFERRED EMBODIMENT
Overview
FIG. 3A illustrates a data processing system 10, constructed according to
the teachings of U.S. patent application Ser. No. 08/485,217, filed Jun.
7, 1995 (Attorney Docket No. 010577-028210) and assigned as well to the
Assignee of the instant invention. (U.S. patent application Ser. No.
08/485,217 is incorporated herein by reference and loosely summarized
herein.) As FIG. 3A shows, the data processing system 10 comprises two
sub-processing systems 10A and 10B, each of which are identically
structured to the other. Each of the sub-processor systems 10 includes a
central processing unit (CPU) 12, a router 14, and a plurality of
input/output (I/O) packet interfaces 16. Each of the I/O packet interfaces
16, in turn, is coupled to a number (n) of I/O devices 17 and a
maintenance processor (MP) 18.
Interconnecting the CPU 12, the router 14, and the I/O packet interfaces 16
are trusted network (TNet) links L. As FIG. 3A further illustrates, TNet
links L also interconnect the sub-processing systems 10A and 10B,
providing each sub-processing system 10 with access to the I/O devices of
the other as well as inter-CPU communication. Any CPU 12 of the processing
system 10 can be given access to the memory of any other CPU 12, although
such access must be validated.
Preferably, the sub-processing systems 10A/10B are paired as illustrated in
FIG. 3A (and FIG. 3B discussed below).
Information is communicated between any element of the processing system 10
and any other element (e.g., CPU 12A of sub-processing system 10A) of the
system and any other element of the system (e.g., an I/O device associated
with an I/O packet interface 16B of sub-processing system 10B) via message
"packets." Each packet is made up of symbols which may contain data or a
command.
Each router 14 is provided with TNet ports, each of which is substantially
identically structured (except in ways not important to this invention).
In FIG. 3B, one port of each of the routers 14A and 14B is used to connect
the corresponding sub-processing systems 10A and 10B to additional
sub-processing systems 10A' and 10B' to form a processing system 10
comprising a cluster of sub-processing systems 10.
Due to the design of the routers 14, the method used to route message
packets, and the judicious use of the routers 14 when configuring the
topology of the system 10, any CPU 12 of processing system 10 of FIG. 3A
can access any other "end unit" (e.g., a CPU or and I/O device) of any of
the other sub-processing systems. For example, the CPU 12B of the
sub-processing system 10B can access the I/O 16" of sub-processing system
10A"; or CPU 12A of sub-processing system 10A' may access memory contained
in the CPU 12B of sub-processing 12B to read or write data. This latter
activity requires that CPU 12A (sub-processing 10A') have authorization to
perform the desired access. In this regard each CPU 12 maintains a table
containing entries for each element having authorization to access that
CPU's memory, and the type of access permitted.
Data and commands are communicated between the various CPUs 12 and I/O
packet interfaces 16 by packets comprising data and command symbols. A CPU
12 is precluded from communicating directly with any outside entity (e.g.,
another CPU 12 or an I/O device via the I/O packet interface 16). Rather,
the CPU 12 will construct a data structure in the memory 28, turning over
control to an interface unit 24 (see FIG. 4), which contains a block
transfer engine (BTE) configured to have direct memory access (DMA)
capability capable of accessing the data structure(s) from memory and of
transmitting the data structure(s) to the appropriate destination.
The design of the processing system 10 permits a memory 28 of a CPU to be
read or written by outside sources (e.g., CPU 12B or an I/O device). For
this reason, care must be taken to ensure that external use of a memory 28
of a CPU 12 is authorized.
Movie-on-Demand Scenario
FIG. 5 illustrates a more particularized version of the computer system 100
of FIG. 3A. In FIG. 5, there is a computer system 500 which contains
sub-processing systems 510, 511, 512 and 513. In the simplified schematic
of FIG. 5, each of these sub-processing systems 510, 511, 512 and 513 may
actually include paired sub-processing systems, as discussed. Although not
illustrated in FIG. 5, each of the sub-processing systems 510, 511, 512
and 513 includes a respective router 14 and interface unit 24, as
discussed above. FIG. 5 represents the TNet links L interconnecting the
sub-processing systems 510, 511, 512 and 513 as links L from a TNet
network 520.
The sub-processing systems 510, 511, 512 and 513 run a disk process 530, an
application server process 531, an intermediate protocol process 532 and a
TCP/IP and ATM driver 533, respectively. Again as discussed above, in a
typical system, some of the processes 530, 531, 532 and 533 will have a
backup process running in a paired sub-processing system. The simplified
FIG. 5 illustrates these paired processes by their respective primary
processes. In the movie-on-demand example scenario described generally
herein, the application server process 531 receives user requests for data
(e.g., clips of movies) and directs the transfer of that data to the user
over the TNet 520. The data requested generally resides on disks
accessible only via disk controllers such as the disk controller 540. In
fact, access to the data on a disk controller is mediated by a particular
disk process. Here, the disk process 530 on the sub-processing system 510
mediates access to the disk controller 540. The disk process 510 is
responsible for transferring data to and from the disk attached to the
disk controller 540. (As system 500 is a fully fault-tolerant system, disk
controller 540 has a pair and the disk of disk controller 540 is typically
mirrored. Again, the fault-tolerant aspects of the system 500 are not
illustrated in the simplified FIG. 5.)
Assume that the user interface and the application server process 531 are
communicating using the RPC protocol implemented on TCP/IP. (The user
interface may be an application process or may be a hardware device with
minimal software. In any event, the user interface is not shown here.)
Accordingly, RPC protocol information 562 must be added to messages from
the application server process 531, and the intermediate protocol process
532 has the responsibility for attaching such header information 562 as
the RPC protocol requires. Likewise, TCP/IP protocol information 563 must
then be layered onto the outbound message, and the TCP/IP driver process
533 in the sub-processing system 513 provides such TCP/IP headers 563 as
the TCP/IP protocol requires. Therefore, to transfer data on demand from
the disk attached to disk controller 540, the application server process
531 employs the disk process 530 to retrieve the data from disk and
employs the intermediate and TCP/IP & ATM driver processes 532, 533 to
forward the data to the user interface.
Further assume that among its functions, the application process 531
attaches, at the beginning of the outgoing data, some application-specific
data 561. This introductory data can be, for example, movie trailers, the
familiar copyright notices, or command sequences to a video box connected
to a tele | | |