A method of improving memory latency associated with a read-type operation in a multiprocessor computer system is disclosed. After a value (data or instruction) is loaded from system memory into a cache, the cache is marked as containing an exclusively held, unmodified copy of the value and, when a requesting processing unit issues a message indicating that it desires to read the value, and the cache transmits a response indicating that the cache can source the value. The response is transmitted in response to the cache snooping the message from an interconnect which is connected to the requesting processing unit. The response is detected by system logic and forwarded from the system logic to the requesting processing unit. The cache then sources the value to an interconnect which is connected to the requesting processing unit. The system memory detects the message and would normally source the value, but the response informs the memory device that the value is to be sourced by the cache instead. Since the cache latency can be much less than the memory latency, the read performance can be improved substantially with this new protocol.
A processor (300) in a distributed shared memory system (10) has ownership of a cache line. The processor modifies the cache line and wishes to update the home memory (17) of the cache line with the modification. The processor (300) generates a return request for routing by a processor interface (24). Meanwhile, a second processor (400) wishes to obtain ownership of the cache line and sends a read request to a memory directory (22) associated with the home memory (17) of the cache line. The memory directory (22) generates an intervention request towards the processor interface (24) corresponding to the last known location of the cache line. The processor interface (24) has now forwarded the return request to the memory directory (22) but subsequent to the read request from the second processor (400). Rather than waiting for an acknowledgment from the memory directory (22) that the return request has been processed, the processor interface (24) sends an intervention response to the second processor that includes the modified cache line.
A non-uniform memory access (NUMA) computer system includes first and second processing nodes that are coupled together. The first processing node includes a system memory and first and second processors that each have a respective associated cache hierarchy. The second processing node includes at least a third processor and a system memory. If the cache hierarchy of the first processor holds an unmodified copy of a cache line and receives a request for the cache line from the third processor, the cache hierarchy of the first processor sources the requested cache line to the third processor and retains a copy of the cache line in a Recent coherency state from which the cache hierarchy of the first processor can source the cache line in response to subsequent requests.
A non-uniform memory access (NUMA) computer system includes first and second processing nodes that are each coupled to a node interconnect. The first processing node includes a system memory and first and second processors that each have a respective one of first and second cache hierarchies, which are coupled for communication by a local interconnect. The second processing node includes at least a system memory and a third processor having a third cache hierarchy. The first cache hierarchy and the third cache hierarchy are permitted to concurrently store an unmodified copy of a particular cache line in a Recent coherency state from which the copy of the particular cache line can be sourced by shared intervention. In response to a request for the particular cache line by the second cache hierarchy, the first cache hierarchy sources a copy of the particular cache line to the second cache hierarchy by shared intervention utilizing communication on only the local interconnect and without communication on the node interconnect.
A processor (500) issues a read request for data. A processor interface (24) initiates a local search for the requested data and also forwards the read request to a memory directory (24) for processing. While the read request is processing, the processor interface (24) can determine if the data is available locally. If so, the data is transferred to the processor (500) for its use. The memory directory (24) processes the read request and generates a read response therefrom. The processor interface (24) receives the read response and determines whether the data was available locally. If so, the read response is discarded. If the data was not available locally, the processor interface (24) provides the read response to the processor (500).
In multiprocessor machines and chip multiprocessor systems in particular, the object of the present invention is to reduce data communication between the LSI chip and external components and to avoid restrictions in communication volume resulting from the LSI pin count. Sets in tag and data blocks of a shared cache include a shared bit S. When data is replaced for a cache miss, the contents of the shared bit S are checked and the side with the shared bit S set to 0 in the tag and data block is selected for data replacement. This allows data shared by a plurality of processors to be left in the shared cache, and the data transfer between the shared cache and the main memory can be reduced.