|
Description  |
|
|
FIELD OF THE INVENTION
The invention relates to multiprocessor systems and, more particularly, to
the efficient and selective ordering of memory reference operations issued
by a processor of a multiprocessor system.
BACKGROUND OF THE INVENTION
Multiprocessor systems, such as symmetric multi-processors, provide a
computer environment wherein software applications may operate on a
plurality of processors using a single address space or shared memory
abstraction. In a shared memory system, each processor can access any data
item without a programmer having to worry about where the data is or how
to obtain its value; this frees the programmer to focus on program
development, e.g., algorithms, rather than managing partitioned data sets
and communicating values. Interprocessor synchronization is typically
accomplished in such a shared memory system between processors performing
read and write operations to "synchronization variables" either before and
after accesses to "data variables".
For instance, consider computer program example #1 wherein a processor P1
updates a data structure and processor P2 reads the updated structure
after synchronization. Typically, this is accomplished by P1 updating data
values and subsequently setting a semaphore or flag variable to indicate
to P2 that the data values have been updated. P2 checks the value of the
flag variable and, if set, subsequently issues read operations (requests)
to retrieve the new data values. Note the significance of the term
"subsequently" used above; if P1 sets the flag before it completes the
data updates or if P2 retrieves the data before it checks the value of the
flag, synchronization is not achieved. The key is that each processor must
individually impose an order on its memory references for such
synchronization techniques to work. The order described above is referred
to as a processor's inter-reference order. Commonly used synchronization
techniques require that each processor be capable of imposing an
inter-reference order on its memory reference operations.
Computer program example #1
P1 P2
Store Data, New-value L1: Load Flag
Store Flag, 0 BNZ L1
Load Data
The inter-reference order imposed by a processor is defined by its memory
reference ordering model or, more commonly, its consistency model. The
consistency model for a processor architecture specifies, in part, a means
by which the inter-reference order is specified. Typically, the means is
realized by inserting a special memory reference ordering instruction,
such as a Memory Barrier (MB) or "fence", between sets of memory reference
instructions. Alternatively, the means may be implicit in other opcodes,
such as in "test-and-set". In addition, the model specifies the precise
semantics (meaning) of the means. Two commonly used consistency models
include sequential consistency and weak-ordering, although those skilled
in the art will recognize that there are other models, such as release
consistency, that may be employed.
In a sequentially consistent system, the order in which memory reference
operations appear in an execution path of the program (herein referred to
as the "I-stream order") is the inter-reference order. Additional
instructions are not required to denote the order simply because each load
or store instruction is considered ordered before its succeeding operation
in the I-stream order. Consider computer program example #1 above. The
program performs as expected on a sequentially consistent system because
the system imposes the necessary inter-reference order. That is, P1's
first store instruction is ordered before P1's store-to-flag instruction.
Similarly, P2's load flag instruction is ordered before P2's load data
instruction. Thus, if the system imposes the correct inter-reference
ordering and P2 retrieves the value 0 for the flag, P2 will also retrieve
the new value for data.
In a weakly-ordered system, an order is imposed between selected sets of
memory reference operations, while other operations are considered
unordered. One or more MB instructions are used to indicate the required
order. In the case of an MB instruction defined by the Alpha.RTM. 21264
processor instruction set, the MB denotes that all memory reference
instructions above the MB (i.e., pre-MB instructions) are ordered before
all reference instructions after the MB (i.e., post-MB instructions).
However, no order is required between reference instructions that are not
separated by an MB, except in specific circumstances such as when two
references are directed to the same address.
Computer program example #2
P1: P2:
Store Data1, New-value1 L1: Load Flag
Store Data2, New-value2 BNZ L1
MB MB
Store Flag, 0 Load Data1
Load Data2
In above example, the MB instruction implies that each of P1's two pre-MB
store instructions are ordered before P1's store-to-flag instruction.
However, there is no logical order required between the two pre-MB store
instructions. Similarly, P2's two post-MB load instructions are ordered
after the Load flag; yet, there is no order required between the two
post-MB loads. It can thus be appreciated that weak ordering reduces the
constraints on logical ordering of memory references, thereby allowing a
processor to gain higher performance by potentially executing the
unordered sets concurrently.
Most computer systems use virtual memory to effectively manage physical
memory of the systems. In a virtual memory system, programs use virtual
addresses to address memory space allocated to them. The virtual addresses
are translated to physical addresses which denote the actual locations in
physical memory. A common process for managing virtual memory is to divide
the virtual and physical memory into equal-sized pages. A system disk
participates in the implementation of virtual memory by storing pages of
the program not currently in physical memory. The loading of pages from
the disk to physical memory is managed by the operating system.
When a program references an address in virtual memory, the processor
calculates the corresponding main memory physical address in order to
access data at that address. The processor typically includes a memory
management unit that performs the translation of the virtual address to a
physical address. Specifically, for each program there is a page table
containing a list of mapping entries, i.e., page table entries (PTEs),
which, in turn, contain the physical address of each virtual page of the
program. FIG. 1 is a schematic diagram of a prior art page table 100
containing a plurality of PTEs 110. An upper portion, i.e., the virtual
page number (VPN 122), of a virtual address 120 is used to index into the
page table 100 to access a particular PTE 110; the PTE contains a page
frame number (PFN) 112 identifying the location of the page in main
memory. A lower portion, i.e., page offset 124, of the virtual address 120
is concatenated to the PFN 112 to form the physical address 130
corresponding to the virtual address. Because of its large size, the page
table is generally stored in main memory; thus, every program reference to
access data in the system typically requires an additional memory access
to obtain the physical address, which increases the time to perform the
address translation.
To reduce address translation time, a translation buffer (TB) is used to
store translation maps of recently accessed virtual addresses. The TB is
similar to a data cache in that the TB contains a plurality of entries,
each of which includes a tag field for holding portions of the virtual
address and a data field for holding a PFN; thus, the TB functions as a
cache for storing most-recently-used PTEs. When the processor requires
data, the virtual address is provided to the TB and, if there is a match
with the contents of the tag field, the virtual address is translated into
a physical address which is used to access the data cache. If there is not
a match between the virtual address and contents of the tag field, a TB
miss occurs. In response to the TB miss, the memory management unit
fetches (from cache or memory) appropriate PTE mapping information and
loads it into the TB. The read operation for the PTE is generally followed
by a subsequent read or write operation for data at the mapped physical
address. Handling of TB misses may be implemented in hardware or software;
in the latter case, the instructions used to handle the TB miss constitute
a TB miss flow routine.
In order to increase performance, modem processors do not execute memory
reference instructions one at a time. In fact, it is desirable that a
processor keep a large number of memory reference operations outstanding
and issue, as well as complete, those operations out-of-order. This is
accomplished by viewing the consistency model as a "logical order", i.e.,
the order in which memory reference operations appear to happen, rather
than the order in which those references are issued or completed. More
precisely, a consistency model defines only a logical order on memory
references; it allows for a variety of optimizations in implementation. It
is thus desired to increase performance by reducing latency and allowing
(on average) a large number of outstanding references, while preserving
the logical order implied by the consistency model.
In prior systems, an MB instruction is typically contingent upon
"completion" of an operation. For example, when a source processor issues
a read operation, the operation is considered complete when data is
received at the source processor. When executing a store instruction, the
source processor issues a memory reference operation to acquire exclusive
ownership of the data; in response to the issued operation, system control
logic generates "probes" to invalidate old copies of the data at other
processors and, possibly, to request forwarding of the data from the owner
processor to the source processor. Here the operation completes only when
all probes reach their destination processors and, when required, the data
is received at the source processor.
Broadly stated, these prior systems rely on completion to impose
inter-reference ordering. For instance, in a weakly-ordered system
employing MB instructions, all pre-MB operations must be complete before
the MB is passed and post-MB operations may be considered. Completion of
an operation essentially requires actual completion of all activity,
including receipt of data and acknowledgments for probes, corresponding to
the operation. The TB miss flow routine described above creates an
inter-reference ordering issue if another processor substantially
simultaneously creates the PTE that will be read during the routine. For
example, consider the computer program example #3 wherein each page
contains one thousand (1000) address locations.
Computer program example #3
P1: P2:
Write @2000, data1 Read R1, @1000
Write @2001, data2 Read R2, @1001
MB Read R3, @2001
Write PTE (2000) Read R4, @3001
Read R5, @4002
Processor P1 may be performing write data operations to update the data of
a page (e.g., page 2 having addresses 2000-2999) followed by an MB
operation and a write PTE operation to update the mapping information in
the PTE. The MB instruction imposes order such that creation of the page
data is completed before creation of the map of the page so that other
processors can access the page data. This creates an inter-reference
ordering problem if processor P2 substantially simultaneously attempts to
read an address (e.g., address 2001) of the page.
As shown in computer program example #4, P2 issues memory reference
operations to addresses 1000 and 1001 until it executes the read operation
to address 2001. At this point, P2 incurs a TB miss and jumps to a typical
TB miss handler routine. (Note that a TB miss may be handled entirely in
hardware or software and that the hardware implementation implicity
performs the equivalent of an MB operation.)
##STR1##
In the TB miss flow routine, P2 fetches the PTE corresponding to address
2001, loads it into the TB and executes the MB instruction. P2 then
returns from the routine to read the data at address 2001 and, thereafter,
issues read operations to addresses 3001 and 4002. The MB instruction in
the TB miss flow is used to enforce inter-reference ordering between the
fetch of the PTE and subsequent read/write operations to addresses within
the page corresponding to the PTE. As noted, all pre-MB operations of a
typical weak-ordering system can be overlapped prior to reaching the MB
instruction with the only requirement being that each of the pre-MB
operations complete before passing the MB instruction. Such an arrangement
is inefficient and, in the context of the TB miss flow routine, adversely
affects latency because the MB unnecessarily constrains references to
addresses located in other pages.
SUMMARY OF THE INVENTION
The invention relates to a technique for selectively imposing
inter-reference ordering between memory reference operations issued by a
processor of a multiprocessor system to addresses within a page pertaining
to a page table entry (PTE) that is affected by a translation buffer (TB)
miss flow routine. The TB miss flow is used to retrieve information
contained in the PTE for mapping a virtual address to a physical address
and, subsequently, to allow retrieval of data at the mapped physical
address. According to the technique, the PTE that is retrieved in response
to a memory reference (read) operation is not loaded into the TB until a
commit-signal associated with that operation is returned to the processor.
Once the PTE and associated commit-signal are returned, the processor
loads the PTE into the TB so that it can be used for a subsequent read
operation directed to the data at the physical address. The inventive
technique improves performance significantly over the prior arrangement of
using a memory barrier (MB) instruction in the TB miss flow routine
because the technique does not constrain progress of memory references to
other pages.
In accordance with the invention, the commit-signal is used to order the
PTE read operation with respect to the data read operation. The
commit-signal is generated by control logic of the multiprocessor system
to essentially facilitate inter-reference ordering; specifically, the
commit-signal indicates the apparent completion of the memory reference
operation for memory reference ordering purposes, rather than actual
completion of the operation. It should be noted that the apparent
completion of an operation occurs substantially sooner than the actual
completion of an operation, thereby improving performance of the
multiprocessor system.
As described herein, the multiprocessing system may comprise (i) a
symmetric multiprocessing (SMP) node wherein the processor and shared
memory entities are interconnected by a local switch or (ii) a SMP system
wherein a plurality of nodes are interconnected by a hierarchical switch.
Each processor preferably has a private cache for storing data and changes
to the data as a result of the memory reference operations are reflected
among the entities via the transmission of probe commands in accordance
with a conventional cache coherence protocol. Notably, associated with the
system is an ordering point. In the SMP node, the ordering point is
associated with the local switch, whereas in the SMP system, the ordering
point is associated with the hierarchical switch.
As an example of a SMP node with an ownership-based, write-invalidate cache
coherence protocol, a requesting processor issues a memory reference
operation to the system requesting particular data. Upon determining which
entity is the owner of the data and which entities have valid copies of
the data, the ordering point totally orders the memory reference operation
with respect to the other issued references using, e.g., a conventional
arbitration or prioritization policy. Thereafter, the ordering point
generates probes to invalidate any copies of the data at appropriate
processors and/or to request forwarding of the data from the owner
processor to the requesting processor, as required by the memory
operation. Significantly, the ordering point also generates the
commit-signal at this time for transmission to the requesting processor.
Probes and commit signals are generated atomically for transmission to the
appropriate processors. The net result is that all memory operations
appear totally ordered.
Ordering of the requested memory reference operation with respect to memory
references issued by other processors of the system constitutes a
commit-event for the requested operation. For the SMP node embodiment, the
commit-event is the point at which the memory reference operation is
ordered at the local switch, whereas for the SMP system the commit-event
occurs when the memory reference operation is ordered at the hierarchical
switch. The commit-signal is preferably transmitted to the requesting
processor upon the occurrence of, or after, such a commit-event.
Advantageously, the invention optimizes a TB miss operation by obviating
the need for an MB operation to impose inter-reference ordering between
the selected memory reference operations of a TB miss flow; rather, the
inventive technique specifies retrieval of the PTE via a read operation
and return of a commit-signal associated with the read operation before
loading the TB with the retrieved PTE. This, in turn, improves the
efficiency and latency of the TB miss flow because references to addresses
located in other pages are not constrained by an MB. In another aspect of
the invention, a mechanism is provided for associating a commit-signal
with a memory reference operation.
BRIEF DESCRIPTION OF THE DRAWINGS
The above and further advantages of the invention may be better understood
by referring to the following description in conjunction with the
accompanying drawings in which like reference numbers indicate identical
or functionally similar elements:
FIG. 1 is a schematic diagram of a prior art page table containing a
plurality of page table entries, each of which contains information for
mapping a virtual address to a physical address;
FIG. 2 is a schematic block diagram of a first multiprocessor node
embodiment comprising a plurality of processor modules coupled to an
input/output processor and a memory by a local switch;
FIG. 3 is a schematic diagram of a processor module comprising at least one
processor coupled to a data cache and a translation buffer;
FIG. 4 is a schematic block diagram of the local switch comprising a
plurality of ports coupled to the respective processor modules of FIG. 2;
FIG. 5 is a schematic diagram of an embodiment of a commit-signal
implemented as a commit-signal packet;
FIG. 6 is a schematic diagram of an illustrative embodiment of a processor
module including a miss address file (MAF) used for implementing a novel
technique for selectively imposing inter-reference ordering in accordance
with the invention;
FIG. 7 is a schematic diagram of the command packet having a MAF field
whose contents are used to implement the novel technique in accordance
with the invention;
FIG. 8 is a schematic block diagram of a second multiprocessing system
embodiment comprising a plurality of multiprocessor nodes interconnected
by a hierarchical switch which may be advantageously used with the present
invention;
FIG. 9 is a schematic block diagram of the hierarchical switch;
FIG. 10 is a schematic block diagram of an augmented multiprocessor node
comprising a plurality of processor modules interconnected with a shared
memory, an IOP and a global port interface via a local switch;
FIG. 11 illustrates an embodiment of a LoopComSig table;
FIG. 12 is a schematic diagram of an incoming command packet embodiment
modified with a multicast-vector; and
FIG. 13 is a schematic block diagram illustrating a total ordering property
of an illustrative embodiment of the hierarchical switch.
DETAILED DESCRIPTION OF THE ILLUSTRATIVE EMBODIMENTS
The novel technique described herein may apply to a large symmetric
multi-processing (SMP) system that includes a number of SMP nodes
interconnected by a switch, or any similar system embodiment having a SMP
node that comprises at least one processor module with a processor capable
of issuing memory reference operations and system control logic for
generating commit-signals in response to those issued operations. As
described herein, each SMP node functions as a building block in the SMP
system. The structure and operation of an SMP node embodiment that may be
advantageously used with the present invention is first described,
followed by a description of the SMP system embodiment.
SMP Node:
FIG. 2 is a schematic block diagram of a first multiprocessing system
embodiment, such as a small SMP node 200, comprising a plurality of
processor modules (P) 302-308 coupled to an input/output (I/O) processor
230 and a memory 250 by a local switch 400. The memory 250 comprises
storage locations that are preferably organized as a single address space
that is shared by the processors for storing application programs and data
associated with the inventive technique described herein. The I/O
processor, or IOP 230, controls the transfer of data between external
devices (not shown) and the system via an I/O bus 240. Data is transferred
between the components of the SMP node in the form of packets. As used
herein, the term "system" refers to all components of the SMP node
excluding the processors and IOP. In an embodiment of the invention, the
I/O bus may operate according to the conventional Peripheral Computer
Interconnect (PCI) protocol.
FIG. 3 is a schematic diagram of a processor module 300 comprising at least
one processor 310 coupled to a data cache 350 via a translation buffer
(TB) 370. The processor 310 is a modern processor comprising a central
processing unit (CPU) that preferably incorporates a traditional reduced
instruction set computer (RISC) load/store architecture. In the
illustrative embodiment described herein, the CPUs are Alpha.RTM. 21264
processor chips manufactured by Digital Equipment Corporations.RTM.,
although other types of processor chips may be advantageously used. For
example in an alternate embodiment, each processor may comprise a
multi-threaded modern processor capable of executing multiple threads of
instructions.
The cache 350 stores data determined likely to be accessed in the future
and is preferably organized as a write-back cache apportioned into, e.g.,
64-byte cache lines accessible by the processors. It should be noted,
however, that other cache organizations, such as write-through caches, may
be used in connection with the principles of the invention. It should be
further noted that memory reference operations issued by the processors
are preferably directed to a 64-byte cache line granularity. Since each
processor may update data from its cache without updating shared memory
250, a cache coherence protocol is utilized to maintain consistency among
the caches.
The cache coherence protocol of the illustrative embodiment is preferably a
conventional write-invalidate, ownership-based protocol.
"Write-Invalidate" implies that when a processor modifies a cache line, it
invalidates stale copies in other processors' caches rather than updating
them with the new value. The protocol is termed an "ownership protocol"
because there is always an identifiable owner for a cache line, whether it
is shared memory, one of the processors or the IOP entities of the system.
The owner of the cache line is responsible for supplying the up-to-date
value of the cache line when requested. A processor/IOP may own a cache
line in one of two states: "exclusive" or "shared". If a processor has
exclusive ownership of a cache line, it may update it without informing
the system. Otherwise, it must inform the system and potentially
invalidate copies in the other caches.
A shared data structure is provided for capturing and maintaining status
information corresponding to the states of data used by the system. In the
illustrative embodiment, the shared data structure is configured as a
conventional duplicate tag store (DTAG) 260 that cooperates with the
individual caches of the system to define the coherence protocol states of
the data in the system. The protocol states of the DTAG 260 are
administered by a coherence controller 280, which may be implemented as a
plurality of hardware registers and combinational logic configured to
produce a sequential logic circuit, such as a state machine. It should be
noted, however, that other configurations of the controller and shared
data structure may be advantageously used herein.
The DTAG 260, coherence controller 280, IOP 230 and shared memory 250 are
interconnected by a logical bus referred to as Arb bus 270. Memory
reference operations issued by the processors are routed via the local
switch 400 to the Arb bus 270. The order in which the actual memory
reference commands appear on the Arb bus is the order in which processors
perceive the results of those commands. In accordance with this embodiment
of the invention, though, the Arb bus 270 and the coherence controller 280
cooperate to provide an ordering point, as described herein.
Each application program running on the processor 310 executes in a virtual
address space using virtual addresses. When a program references a virtual
address in virtual memory, the processor calculates the corresponding main
memory physical address using TB 370. The TB contains a plurality of page
table entries (PTEs) 375, each of which includes a tag (TAG) field for
holding portions of the virtual address and a data field for holding
mapping information, such as a page frame number (PFN). When the processor
310 executes an instruction (such as a load/store instruction) for data,
the virtual address of the instruction is provided to the TB 370 and, if
there is a match with the contents of the tag field, the virtual address
is translated into a physical address which is used to access the data
cache 350. If there is not a match between the virtual address and
contents of the tag field, a TB miss occurs.
The load/store instruction executed by the processor is issued to the
system as a memory reference, e.g., read, operation in response to the TB
miss. A characteristic of a modem processor is the ability to issue memory
reference operations out-of-order, to have more than one memory reference
outstanding at a time and to accommodate completion of the memory
reference operations in arbitrary order. Each operation may comprise a
series of commands (or command packets) that are exchanged between the
processors and the system. The commands described herein are defined by
the Alpha.RTM. memory system interface and may be classified into three
types: requests, probes, and responses. Requests are commands that are
issued by a processor when, as a result of executing a load or store
instruction, it must obtain a copy of data. Requests are also used to gain
exclusive ownership to a data item (cache line) from the system. Requests
include Read (Rd) commands, Read/Modify (RdMod) commands, Change-to-Dirty
(CTD) commands, Victim commands, and Evict commands, the latter of which
specify removal of a cache line from a respective cache.
Probes are commands issued by the system to one or more processors
requesting data and/or cache status updates for a specific cache line
identified in the probe command. Probes include ForwardedRead (FRd)
commands, ForwardedReadModify (FrdMod) commands and Invalidate (Inval)
commands. A FRd probe is sent to a processor to obtain the current value
of a cache line from its owner processor. The FRd probe causes the owner
processor to return data corresponding to the cache line and to change its
state for that cache line to "shared". A FrdMod probe is sent to a
processor to obtain an exclusive copy of a cache line from its owner
processor. The FrdMod probe causes the owner processor to return data
corresponding to the cache line and to change its state for the line to
"invalid". An Inval probe causes a processor to change its state for the
specific cache line to "invalid".
Responses are commands from the system to processors/IOPs which carry the
data requested by the processor or an acknowledgment corresponding to a
request. For Rd and RdMod requests, the response is a Fill and FillMod
response, respectively, each of which carries the requested data. For a
CTD request, the response is a CTD-Success (Ack) or CTD-Failure (Nack)
response, indicating success or failure of the CTD, whereas for a Victim
request, the response is a Victim-Release response.
FIG. 4 is a schematic block diagram of the local switch 400 comprising a
plurality of ports 402-410, each of which is coupled to a respective
processor module (P) 302-308 and IOP 230 via a full-duplex, bidirectional
clock forwarded data link. Each port includes a respective input queue
412-420 for receiving, e.g., a memory reference request issued by its
processor module and a respective output queue 422-430 for receiving,
e.g., a memory reference probe issued by system control logic associated
with the switch. An arbiter 440 arbitrates among the input queues to grant
access to the Arb bus 270 where the requests are ordered into a memory
reference request stream. In the illustrative embodiment, the arbiter
selects the requests stored in the input queues for access to the bus in
accordance with an arbitration policy, such as a conventional round-robin
algorithm.
The following example illustrates the typical operation of multiprocessing
system including switch 400. A Rd request for data item x is received at
the switch 400 from the processor 310 of, e.g., processor module 302 and
loaded into input queue 412. The arbiter 440 selects the request in
accordance with the arbitration algorithm. Upon gaining access to the Arb
bus 270, the selected request is routed to the ordering point 450 wherein
the states of the corresponding cache lines are interrogated in the DTAG
260. Specifically, the coherence controller 280 examines the DTAG to
determine which entity of the system "owns" the cache line and which
entities have copies of the line. If a processor of module 306 is the
owner of the cache line x and a processor of module 308 has a copy, the
coherence controller generates the necessary probes (e.g., a Fill x and
Inval x) and forwards them to the output queues 426 and 428 for
transmission to the processors.
Because of operational latencies through the switch and data paths of the
system, memory reference requests issued by a processor may complete
out-of-order. In some cases, out-of-order completion may affect the
consistency of data in the system, particularly for updates to a cache
line. Memory consistency models provide formal specifications of how such
updates become visible to the entities of the multiprocessor system. In a
weakly-ordered system, inter-reference ordering is typically imposed by a
memory barrier (MB) instruction inserted between memory reference
instructions of a program executed by a processor. The MB instruction
separates and groups those instructions of a program that need ordering
from the rest of the instructions. The semantics of weak ordering mandate
that all pre-MB memory reference operations are logically ordered before
all post-MB references.
To ensure correct implementation of the consistency model, prior systems
inhibit program execution past the MB instruction until actual completion
of all pre-MB operations have been confirmed to the processor. Maintaining
inter-reference order from all pre-MB operations to all post-MB operations
typically requires acknowledgment responses and/or return data to signal
completion of the pre-MB operations. The acknowledgment responses may be
gathered and sent to the processor issuing the operations. The pre-MB
operations are considered completed only after all responses and data are
received by the requesting processor. It should be noted that the
techniques described herein apply to systems employing variants of the MB
instruction, such as an Instruction-MB and a Write-MB.
In the case of the TB miss described above, the processor 310 issues a read
request operation to the system to acquire the appropriate PTE mapping
information and load it into the TB. The read operation for the PTE is
generally followed by a subsequent read operation for data at the mapped
physical address. This TB miss flow sequence creates an inter-reference
ordering issue if another processor substantially simultaneously creates
the PTE that will be read during the TB miss flow. A prior art solution to
this problem is to insert an MB instruction in the TB miss flow to enforce
inter-reference ordering between the fetch of the PTE and subsequent
read/write operations to addresses within the page corresponding to the
PTE. (See computer program examples #3, #4.) In a typical weak-ordering
system, all pre-MB operations can be overlapped prior to reaching the MB
instruction with the only requirement being that each of the pre-MB
operations complete before passing the MB instruction. Such an arrangement
is inefficient and, in the context of the TB miss flow routine, adversely
affects latency because the MB unnecessarily constrains references to
addresses located in other pages.
The present invention relates to a technique for selectively imposing
inter-reference ordering between memory reference operations issued by a
processor to addresses within a page pertaining to a PTE that is affected
by a TB miss flow routine. The TB miss flow is used to retrieve
information contained in the PTE for mapping a virtual address to a
physical address and, subsequently, to allow retrieval of data at the
mapped physical address. According to the technique, the PTE 375 that is
retrieved in response to a memory reference (read) operation is not loaded
into the TB 370 until a commit-signal associated with that operation is
returned to the processor.
FIG. 5 is a schematic diagram of an embodiment of a commit-signal that is
preferably implemented as a special, probe-type commit-signal packet 500.
It will be apparent to those skilled in the art that the commit-signal 500
may be manifested in a variety of forms, including a discrete signal on a
wire, or, in another embodiment, a packet identifying the operation
corresponding to the commit-signal. In the illustrative embodiment, the
commit-signal packet 500 comprises a command field 502, a source
identifier (SRC ID) field 504, a destination identifier (DST ID) field 506
and an address field 508. Additionally, the packet includes a ComSig field
510 which, in one embodiment, is characterized by the assertion of a
single, commit-signal ("C") bit 512 and, in another embodiment, includes a
valid ("V") bit 514 and a ComSig MAF field 516 whose contents identify the
corresponding operation, as described below. An example of a commit-signal
that is suitable for use in the illustrative embodiment of the invention
is described in copending and commonly-assigned U.S. patent application
Ser. No. 08/957,097 titled, Method and Apparatus for Reducing Latency of
Inter-Reference Ordering in a Multprocessor System, which application is
hereby incorporated by reference as though fully set forth herein.
Ordering point control logic 450 of the multiprocessor | | |