WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Multiprocessor system with partial broadcast capability of a cache coherent processing request    
United States Patent6038644   
Link to this pagehttp://www.wikipatents.com/6038644.html
Inventor(s)Irie; Naohiko (Kawasaki, JP), Hamanaka; Naoki (Tokyo, JP), Shibata; Masabumi (Kawasaki, JP)
AbstractInformation indicative of whether each processor unit caches data which belongs to each of the plural areas of the main memory larger than a cache line is stored in the multicast table. The destinations of a coherent processing request which should be sent to other processor units are limited by the information stored in this table. The interconnection network broadcasts the request to the limited destinations. When the processor unit of the destination of this processing request sends back a cache status of the data designated by the request, it also sends back the caching status in the processor unit concerning a specific memory area which includes the data. Depending on this send back, the request source processor unit renews a portion relating to the destination processor unit within the caching status concerning that specific memory area stored in the processor unit.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Inventor     Irie; Naohiko (Kawasaki, JP) , Hamanaka; Naoki (Tokyo, JP) , Shibata; Masabumi (Kawasaki, JP)
Owner/Assignee     Hitachi, Ltd. (Tokyo, JP)
Patent assignment
All assignments
Publication Date     March 14, 2000
Application Number     08/820,831
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     March 19, 1997
US Classification     711/141 711/144 711/146 711/148
Int'l Classification    
Examiner     Peikari; B. James
Assistant Examiner    
Attorney/Law Firm     Antonelli, Terry, Stout & Kraus, LLP
Address
Parent Case    
Priority Data     Mar 19, 1996 [JP] 8-062452
USPTO Field of Search     711/144 711/146 711/141 711/148 395/311 395/553 395/800.16 710/33 709/204 709/217 709/248
Patent Tags     multiprocessor partial broadcast capability cache coherent processing request
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5829040
Son

Oct,1998

[0 after 0 votes]
5826049
Ogata et al.

Oct,1998

[0 after 0 votes]
5742766
Takeuchi et al.

Apr,1998

[0 after 0 votes]
5598550
Shen et al.

Jan,1997

[0 after 0 votes]
5581777
Kim et al.

Dec,1996

[0 after 0 votes]
5559987
Foley et al.

Sep,1996

[0 after 0 votes]
5386511
Murata et al.

Jan,1995

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A multiprocessor system comprising:

a plurality of processor units each including at least one processor;

at least one memory unit shared by said plurality of processor units;

a cache memory provided in correspondence to each processor unit;

a logic provided in correspondence to each processor unit and responsive to a memory access request therefrom for generating a first cache coherent processing request related to a first memory address designated by said memory access request;

a caching status memory provided in correspondence to each processor unit for storing a processor unit caching status for discriminating whether each of said plurality of processor units holds data belonging to each of a plurality of memory areas belonging to said memory unit;

a destination information generation logic provided in correspondence to each processor unit for generating destination information designating part of said plurality of processor units which hold at least one data belonging to a first memory area to which said first memory address belongs; and

an interconnection network for connecting said plurality of processor units and said memory unit, said interconnection network comprising a parallel transfer network for transferring said first cache coherent processing request to said part of said plurality of processor units in response to said destination information.

2. A multiprocessor system according to claim 1, wherein each processor unit further comprises:

an update logic for said caching status memory;

a response logic responsive to a second cache coherent processing request transferred from another processor unit by way of said interconnection network for sending to said interconnection network, a portion addressed to said another processor unit, which is related to a second memory area including a second address designated by said second cache coherent processing request and to said each processor unit, within said stored processor unit caching status;

wherein said update logic for said caching status memory updates a portion related to said each processor unit within said stored processor unit caching status, in response to change of memory areas to which a plurality of valid data held in said cache memory belong;

wherein said update logic for said caching status memory further updates portions related to said first memory area and each related to one of said part of said plurality of processor units, within said stored processor unit caching status, in response to portions each of which has been transferred in response to said first cache coherence processing request from one of said part of said plurality of processor units by way of said interconnection network, and each of which is related to said first memory area and one of said part of said plurality of processor units.

3. A multiprocessor system according to claim 1, further comprising:

a logic for generating, as said destination information, destination information designating said plurality of processor units, when said processor unit caching status stored in said caching status memory does not include a portion related to said first memory area to which said first memory address belongs.

4. A multiprocessor system according to claim 1,

wherein said caching status memory includes a plurality of entries;

wherein each entry is provided in correspondence to one of a plurality of memory areas;

wherein each entry has a plurality of fields respectively corresponding to said plurality of said processor units;

wherein each field has information discriminating whether a processor unit corresponding to said each field caches data belonging to a memory area corresponding to said each entry.

5. A multiprocessor system according to claim 4, wherein said information included in each field of each entry of said caching status memory includes information of one bit indicative of whether a processor unit corresponding to said each field caches data belonging to a memory area corresponding to said each entry, thereby each entry has a bitmap indicative of whether said plurality of processor units cache data belonging to a memory area corresponding to said entry.

6. A multiprocessor system according to claim 5, wherein said destination information generation logic comprises a logic which supplies said bitmap stored in an entry corresponding to said first memory area to which said first memory address belongs, as said destination information.

7. A multiprocessor system according to claim 2,

wherein said caching status memory includes a plurality of entries;

wherein each entry is provided in correspondence to one of a plurality of memory areas;

wherein each entry has a plurality of traffic count fields respectively corresponding to said plurality of said processor units;

wherein said update logic includes a logic for controlling whether a traffic account in a field corresponding to each of said part of said plurality of processor units, within an entry corresponding to said first memory area, within said plurality of entries within said caching status memory, depending upon portions each of which has been transferred in response to said first cache coherence processing request from one of said part of said plurality of processor units by way of said interconnection network, and each of which is related to said first memory area and one of said part of said plurality of processor units.

8. A multiprocessor system according to claim 7, wherein said caching status memory is provided in a translation look aside buffer for translating a virtual address designated by said processor within said each processor unit into a real address.

9. A multiprocessor system according to claim 1, wherein said parallel transfer network comprises:

a logic for judging whether a plurality of pieces of destination information attached to a plurality of cache coherent processing requests generated by plural processor units designate a same processor unit; and

a logic for starting transfer of said plurality of cache coherent processing requests at a same timing in parallel, when said plurality of pieces of destination information does not designate a same processor unit.

10. A multiprocessor system according to claim 9, wherein said parallel transfer network comprises a crossbar switch.

11. A multiprocessor system, comprising:

a plurality of processor units each including at least one processor;

at least one memory unit shared by said plurality of processor units;

an interconnection network for connecting said plurality of processor units and said memory unit;

wherein each processor unit comprises:

a cache memory for holding a copy of a plurality of pieces of data held in said memory unit and cache statuses of respective pieces of data,

a caching status memory for storing a processor unit caching status for discriminating whether each of said plurality of processor units holds data belonging to each of a plurality of memory areas belonging to said memory unit,

a request send logic connected to said processor, said caching status memory and said interconnection network and responsive to a memory access request which is issued from said processor and requests first data of a first memory address, for sending a first cache coherent processing request related to said first memory address, by way of said interconnection network, to part of said plurality of processor units as discriminated by said stored processor unit caching status to a first memory area to which said first memory address belongs;

a response send logic connected to said cache memory, said caching status memory and said interconnection network and responsive to a second cache coherent processing request related to a second memory address and transferred from another of said plurality of processor units by way of said interconnection network, for sending said another processor unit by way of said interconnection network, a cache status in said cache memory, of second data of said second memory address and a portion of said stored processor unit caching status, related to caching in said each processor unit, of data belonging to a second memory area to which said second memory address belongs, and

an update logic connected to said cache memory and said interconnection network for updating said caching status memory, in response to change of memory areas to which a plurality of pieces of valid data held in said cache memory belong and in response to portions of processor unit caching statuses as included in a plurality of responses transferred in response to said first cache coherent processing request from said part of said plurality of processor units by way of said interconnection network.

12. A multiprocessor system according to claim 11,

wherein said caching status memory stores said processor unit caching status in correspondence to one of a plurality of memory areas and a number of pieces of valid data belonging to said one memory area, as stored in said cache memory;

wherein said update logic comprises:

a logic for updating numbers of pieces of valid data stored in said caching status memory in response to change of a number of pieces of valid data belonging to each of said plurality of memory areas, as stored in said cache memory;

a logic for invalidating said processor unit caching status corresponding to one of said plurality of memory areas, as stored in said caching status memory, when said number of pieces of valid data corresponding to said one memory area, as stored in said caching status memory has become 0.

13. A multiprocessor system according to claim 12, further comprising:

a logic responsive to storing of another data belonging to another of said plurality of memory areas in said cache memory, for replacing both said invalidated processor unit caching status corresponding to said one memory area and said number of pieces of valid data corresponding to said one memory areas both stored in said caching status memory, by a processor caching status corresponding to said another memory area and a number of pieces of valid data corresponding to said another memory area.

14. A multiprocessor system according to claim 11,

wherein said request send logic comprises:

a destination information generation logic responsive to said memory access request from said processor for generating destination information designating part of said plurality of processor units which hold at least one data belonging to said first memory area to which said first memory address designated by said memory access request belongs, said generating being executed, based upon said processor caching status stored in corresponding to said first memory area; and

a send logic for sending said interconnection network said first coherent processing request and said generated destination information;

wherein said interconnection network comprises a parallel transfer network for transferring said first cache coherent processing request to said part of said plurality of processor units in parallel, in response to said destination information.

15. A multiprocessor system according to claim 14,

wherein said caching status memory includes a plurality of entries;

wherein each entry is provided in correspondence to one of a plurality of memory areas;

wherein each entry has a plurality of fields respectively corresponding to said plurality of said processor units;

wherein each field has information discriminating whether a processor unit corresponding to said each field caches data belonging to a memory area corresponding to said each entry.

16. A multiprocessor system according to claim 15, wherein said information included in each field of each entry of said caching status memory includes information of one bit indicative of whether a processor unit corresponding to said each field caches data belonging to a memory area corresponding to said each entry, thereby each entry has a bitmap indicative of whether said plurality of processor units cache data belonging to a memory area corresponding to said entry.

17. A multiprocessor system according to claim 14, wherein said destination information generation logic comprises a logic which supplies said bitmap stored in an entry corresponding to said first memory area to which said first memory address belongs, as said destination information.

18. A multiprocessor system according to claim 14, wherein said parallel transfer network includes:

a logic for judging whether a plurality of pieces of destination information attached to a plurality of cache coherent processing requests generated by plural processor units designate a same processor unit; and

a logic for starting transfer of said plurality of cache coherent processing requests at a same timing in parallel, when said plurality of pieces of destination information does not designate a same processor unit.

19. A multiprocessor system according to claim 18, wherein said parallel transfer network comprises:

a crossbar switch.

20. A multiprocessor system according to claim 11, wherein said part of said plurality of processor units are ones discriminated by said stored processor unit caching status as ones which cache at least one data belonging to said first memory area.

21. A multiprocessor system according to claim 11, wherein said update logic comprises:

an update logic for updating a portion related to said each processor unit within said stored processor unit caching status, in response to said change of memory areas to which a plurality of valid data held in said cache memory belong, and for updating portions within said stored processor unit caching status, each related to caching in one of said part of said plurality of processor units, of data belonging to said first memory area, in response to said portions of said processor unit caching statuses as included in said plurality of responses.

22. A multiprocessor system according to claim 11, further comprising:

a cache status control logic for controlling a cache status of each of plurality pieces of data held in said cache memory;

wherein said cache status control logic responds to said cache status related to data of said first memory address included in each of said plurality of responses and determines a cache status of said first data in said each processor unit.

23. A multiprocessor system according to claim 11, wherein each of said plurality of memory areas caching statuses of which said caching status memory discriminates is one which can store at least one data which can be stored in said cache memory.

24. A multiprocessor system according to claim 11, wherein each of said plurality of memory areas caching statuses of which said caching status memory discriminates is one which can store plural pieces of data each of which can be stored in said cache memory.

25. A multiprocessor system according to claim 11,

wherein said caching status memory includes a plurality of entries;

wherein each entry is provided in correspondence to one of a plurality of memory areas;

wherein each entry has a plurality of traffic count fields respectively corresponding to said plurality of said processor units;

wherein said update logic includes a logic for controlling whether a traffic account in a field corresponding to each of said part of said plurality of processor units, within an entry corresponding to said first memory area, within said plurality of entries within said caching status memory, depending upon portions each of which has been transferred in response to said first cache coherence processing request from one of said part of said plurality of processor units by way of said interconnection network, and each of which is related to said first memory area and one of said part of said plurality of processor units.

26. A multiprocessor system according to claim 11, wherein said caching status memory is provided in a translation look aside buffer for translating a virtual address designated by said processor within said each processor unit into a real address.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

The present invention relates to a tightly-coupled multiprocessor system which comprises plural processor units sharing a main memory and connected by an interconnection network.

In many prior art tightly-coupled multiprocessor systems, a shared bus or a network (parallel transfer network) which can transfer plural messages in parallel is used for an interconnection network which connects the processor units and a main memory shared by them. In the latter method, the cache directory method is known as one of the methods to maintain cache coherency between the processor units. For instance, see L. Censier and P. Feautrier, "A New Solution to Coherence Problems in Multicache Systems," IEEE Transactions on Computers, Vol.C-27, No. 12, pp. 1112 to 1118 (1978) (hereinafter referred to as the reference document 1).

In this method, a directory is used which collectively holds, in correspondence to all areas of a cache line size in the main memory area, cache statuses of data of respective areas in all processor units. The cache line transfer requests/invalidate requests, etc., are sent through a parallel transfer network only to specific processor units designated by the directory. Therefore, there is an advantage that an unnecessary coherent read request is not sent to the caches of the other processor units.

In this method, however, cache miss latency becomes long, because data transfer is executed three times to one coherent read request. Concretely, a memory read request is sent from a processor unit requesting data to the main memory through the parallel transfer network. The main memory inspects the directory. When another processor unit has updated the data, the main memory issues a line transfer request to the cache in that another processor unit. That another processor unit transfers the data to the request source processor unit according to the line transfer request.

Another cache coherency maintenance method is a snoop cache method. Refer, for instance, to Ben Catanzaro, "Multiprocessor System Architectures," Sun Microsystems, pp.157 to 170, 1994 (hereinafter referred to as the reference document 2) or Don Anderson/Tom Shanley, "PENTIUM PROCESSOR SYSTEM ARCHITECTURE Second Edition," MINDSHARE, Inc., pp. 61 to 91, 1995 (hereinafter referred to as the reference document 3). In this method, each processor unit controls the cache status of the data held in the cache of its own. Maintenance of coherency is achieved by communication between the request source processor unit of data and all other processor units.

There are various methods in the snoop method but the typical one is as follows. That is, a processor unit which requests data sends shared bus a coherent read request. Each processor unit receives this coherent read request on the shared bus, checks the cache status of data designated by this request, and notifies the request source processor unit of the status. The main memory transfers the data designated by the request to the request source processor unit. However, when either one of the processor units has updated the data which the request designates, that processor unit transfers the data to the request source processor unit in place of the main memory.

Therefore, the snoop method is superior to the directory method in that read processing completes by transfer of two times in any case, that is, transfer of a coherent read request from a request source processor unit and transfer of the requested data from either the main memory or a processor or unit. In the snoop method, however, coherent read requests are sent from all the processor units to the shared bus. Therefore, the busy rate of the shared bus increases when the number of processor units increases. Here, the busy rate is defined as a ratio of the number of requests effectively acceptable per unit time to the maximum number of requests acceptable per unit time. As a result, the wait time for arbitration of the shared bus increases. Therefore, the problem occurs that the time until necessary data arrives at a processor unit, that is, cache miss latency increases. Moreover, it is necessary in this method for even a cache without the shared data to respond to the coherent read request on the shared bus and to search for the cache tag. Therefore, the busy rate of the cache tag increases, too and the cache miss latency increases in addition.

In Japanese Laid Open Patent Application No. HEI 04-328653 or its corresponding U.S. Pat. No. 5,386,511 (hereinafter referred to as the reference document 4), a method is disclosed in which only the coherent read request is transferred by using the shared bus and other information such as the memory data is transferred by using an interconnection network which can transfer messages in parallel.

SUMMARY OF THE INVENTION

The traffic on the shared bus decreases in the prior art described in the reference document 4 compared with the other prior art described in the reference document 2 or 3. However, all the processors send the shared bus coherent read requests. Therefore, the problem still remains in this prior art that the busy rate of the shared bus is large and the cache miss latency is large.

Therefore, each prior art has the problem of increase of the cache miss latency due to large traffic on the shared bus. This problem becomes more remarkable when more processors are used.

The object of the present invention is to provide a multiprocessor system which can reduce the traffic for maintenance of cache coherency on the interconnection network.

To achieve that object, in a preferable mode of a multiprocessor according to the present invention, a caching status memory is provided in correspondence to each of plural processor units. The caching status memory stores a processor unit caching status for discriminating whether each processor unit and each of the other processor units hold data which belongs to each of plural memory areas which belong to the memory unit which composes the main memory shared by those processor units.

In correspondence to each processor unit, a logic is provided which responds to a memory access request from that processor unit and generates the first cache coherent processing request related to data of the first memory address designated by the memory access request, and a logic is provided which generates information which designates destination processor unit of the cache coherent processing request. This destination information generate logic generates destination information designating part of the processor units which hold at least one data which belongs to the first memory area to which that first memory address belongs, based on the stored processor unit caching status. The interconnection network is composed by a network which responds to that destination information, and transfers the first cache coherent processing request to the part of the processor units in parallel.

More concretely, in correspondence to each processor unit, a logic is provided which renews a portion concerning that processor unit, among that processor unit caching status stored in correspondence to the processor unit, in accordance to change in the memory areas to one of which data maintained in the cache memory of the processor unit belongs.

In addition, in correspondence to each processor unit, a logic is provided which receives notification of the processor unit caching statuses from other processor units when that processor unit receives respective cache status notification from those other processor units after the processor unit sends the coherent processing request to those other processor units. The logic renews the processor unit caching status stored in that processor unit, based on the received processor unit caching statuses concerning those other processor units.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a schematic block diagram of a multiprocessor system according to the present invention.

FIG. 2 is a schematic block diagram of a cache memory used in the apparatus of FIG. 1.

FIG. 3 is a schematic block diagram of a portion of a cache control unit used in the apparatus of FIG. 1.

FIG. 4 is a schematic block diagram of the other portion of the cache control unit used in the apparatus of FIG. 1.

FIG. 5 is a schematic block diagram of a multicast table used in the apparatus of FIG. 1.

FIG. 6 is a schematic block diagram of a multicast table control unit used in the apparatus of FIG. 1.

FIG. 7 is a schematic block diagram of a switch logic used in the apparatus of FIG. 1.

FIG. 8 is a schematic block diagram of a multicast transfer control unit used in the apparatus of FIG. 1.

FIG. 9 is an overall structure of another multiprocessor system according to the present invention.

FIG. 10 is a schematic block diagram of TLB used in the apparatus of FIG. 9.

FIG. 11 is a schematic block diagram of a traffic count control unit used in the apparatus of FIG. 9.

FIG. 12 is a format of a page table entry used in the apparatus of FIG. 9.

FIG. 13 is a format of a process control block used in the apparatus of FIG. 9.

DESCRIPTION OF EMBODIMENTS

Hereafter, the multiprocessor system according to the present invention will be explained in more details by referring to several embodiments shown in the drawings. The same or like numerals represent the same or like elements. With the second embodiment or later, only differences between it and the first embodiment will be explained.

Embodiment 1

(1) Structure of the Apparatus

FIG. 1 shows an overall structure of a multiprocessor system according to the present invention, and it comprises plural processor units 10-1 to 10-n (n is an integer larger than two), plural memories unit 700-1 to 700-m (m is an integer larger than two), units (not shown) which contain peripheral devices such as input-output devices and so on, and an interconnection network 50 which connects those units.

The memory unit comprises a main memory module 70-1 , , , or 70-m which constitutes a different portion of the main memory 70 which holds a program and data. Each main memory module 70-1 , , , or 70-m is connected to an interconnection network 50 byway of a corresponding network interface unit 750-1 , , , or 750-m.

Each of the processor units 10-1 to 10-n comprises a CPU core 100 which sequentially reads out program instructions from the main memory 70 and sequentially executes them, a translation look aside buffer (TLB) 200 which address-translates a virtual translates at execution of an instruction accessing the main memory 70, into a real address, a cache memory 300 which stores copies of portions of the main memory 70, and a cache control circuit 30 which controls the cache memory 300. The cache control circuit 30 has a cache control unit 350, a multicast table 400 characteristic to the present embodiment, and a multicast table control unit 450 which controls this table.

The present embodiment adopts the snoop method as the method of maintaining cache coherency, but uses a parallel data transfer network, more concretely, a crossbar switch, as the interconnection network 50. The interconnection network 50 is used for transfer of not only memory data but also coherent read requests, and can transfer these different pieces of information in parallel. In addition, the interconnection network 50 can execute a one-to-one communication mode which transfers a message inputted to one input port to a specific output port which the message designates, and a partial broadcast mode or a multicast mode which transfers the message to plural output ports which the message designates among all the output ports.

When a coherent request concerning data is transferred from a processor unit to other processor units according to the snoop method in the present embodiment, the request is not transferred to all other processor units, but is multicast only to part of the processor units which are likely to have a copy of the data and to one of the memory units. The multicast table 400 is a memory which stores information identifying processor units which are likely to have cached the data of the address which the coherent request designates. The multicast table control unit 450 uses the information and generates information designating part of the processor units to which said coherent request is to be transferred. Only the part of the processor units transfer the cache status of the data in respective processor units, to that request source processor unit, as a response to the coherent request.

Therefore, the total number of coherent requests transferred onto the interconnection network 50 decreases than when each coherent request is transferred to all other processor units. In addition, the responses to this coherent request decrease, too. Therefore, the traffic on the interconnection network 50 decreases. In addition, because the responses to the request source processor unit decrease the processing in the request source processor unit decreases, too. In addition, because processor units to which the coherent request has not been transferred are not requested to process the originally useless coherent request, the necessity for executing processing useless to those processor units decreases.

(2) Outline of Operation

In the following, the schematic of the system operation by memory read/store requests from CPU core 100 will be briefly explained separately for cases of (a) cache hit and (b) cache miss. In the present embodiment, the cache status has either of four statuses of modified (M), exclusive (E), shared (S), and invalid (I). Concerning concrete transition of such cache statuses, refer to the reference document 3, for instance. The following points, however, are different in the present embodiment from those described in the reference document. When a cache of its own hits to a read request from another processor unit in case the cache is in the M status, the next status of the cache is not set to S but to I, and no write back is executed to the main memory 70.

(a) Cache hit

(a1) Memory load instruction

When the cache control unit 350 judges a cache hit based on the condition which will be explained later, to a memory load instruction issued by CPU core 100 in the processor unit 10-i (i is either one of the integers 1 to n, with the same being true below), the cache control unit 350 reads the data designated by this instruction from the cache memory 300 and sends it to CPU core 100 by way of line 1001. The cache status of this data does not change, in this case.

(a2) Memory store instruction

When the cache control unit 350 judges a cache hit to the memory store instruction which CPU core 100 in the processor unit 10-i issued, the cache control unit 350 rewrites the original data in the cache memory 300 by the store data sent from CPU core 100 by way of line 1001. However, this rewrite is executed when the cache status of the data which the instruction designates is M or E status. In addition, when the original cache status ST of this data is E, the cache control unit 350 rewrites the original cache status E to the status M.

(b) Cache miss

When the cache control unit 350 judges a cache miss to the memory load instruction issued by CPU core 100 in the processor unit 10-i, the multicast table control unit 450 generates destination information designating destination units to which the coherent read request is to be sent, by using the multicast table 400. The destination units designated by this destination information are part of the processor units 10-1 to 10-n which are likely to have caches data belonging to a memory area to which an address designated by this memory load instruction belongs and one of the memory units 700-1 to 700-m which contains the main memory module 70-k (k is one of 1 to m) to which the memory address designated by this instruction is allocated by the interleave method.

The cache control unit 350 sends the interconnection network 50 a partial broadcast message which contains the coherent read request and the destination information. The interconnection network 50 transfers this message to these destination units in parallel. In the processor unit 10-j (j is one of 1 to n) which has received the coherent read request, the cache control unit 350 reads the cache status of the data designated by this request from the cache memory 300, and transfers, as a response, a coherency response message which contains this cache status, to the source processor unit 10-i by way of the interconnection network 50. When the cache status of the data designated by this request is the M status, the cache control unit 350 reads this requested data from the cache memory 300 and sends the source processor unit 10-i a message which contains this data, byway of the interconnection network 50.

Moreover, the processor unit 10-j changes the cache status according to the kind of the coherent read request under control of the cache control unit 350. A detailed operation of the change of the cache status will be described later. Moreover, the request source processor unit 10-i stores the data corresponding to the coherent read request, transferred from the memory unit 700-k, into the data portion 330 of the cache memory 300. However, when the data corresponding to the coherent read request is sent from a processor unit 10-j, the request source processor unit 10-i stores this data in the cache memory 300, and does not write the data transferred from the above-mentioned memory unit 700-k in the cache memory 300.

When the cache status of the data of the address designated by the memory store instruction issued by CPU core 100 in the processor unit 10-i is other than M or E status, processing is executed which is almost similar to the processing at read miss. Moreover, the data written in the cache memory 300 by the memory store instruction is reflected into the main memory 70 by write back at the time of replacement of the cache line or by compulsory flush of the cache line. The explanation is omitted concerning the data transfer at the write back and at the flush, because it does not differ from the prior art.

(3) Details of the Operations

In the following, details of the above-mentioned operation at cache miss will be explained, with focusing the explanation on the operation of the multicast table 400 and the multicast table control unit 450.

(3A) Hit check of the cache memory 300

The processing which generates the coherent read request by cache miss will be explained first. On a certain processor unit 10-i, CPU core 100, which has executed a memory access instruction, transfers the virtual address VA designated by this instruction to TLB 200 by way of line 1000. TLB 200 translates the virtual address VA to the real address RA, and sends the obtained real address to the cache control unit 350 through line 2000.

In FIG. 2, the cache control unit 350 comprises the data portion 330 which holds data of plural cache lines, the tag portion 310 which holds a tag to the address of each cache line, and the status portion 320 which holds a cache status ST of each cache line. The cache memory 300 is a store-in cache. In addition, the cache memory 300 is a real address cache which holds a real address in the cache tag portion 310. However, the cache memory 300 may be a virtual address cache, if the virtual address can be used in cache coherency control. Moreover, the constructing method of the cache memory 300 uses the direct map method, but other general constructing methods, that is, the set associative method or the full associative methods can be used.

It is presumed that the capacity of the cache memory 300 is 1 MB, for instance, the cache line size is 32 B and the real address RA is 32 bits. In this case, the upper 0th to 11th bits of the real address is stored in the cache tag. The 12th to 26th bits of the real memory address RA are used for the index portion RA.sub.-- CI to be used to access the cache memory 300. The selector 150 receives through line 2000, the real address RA obtained by the address translation in TLB 200.

The selector 150 selects the real address on line 2000, and sets it in the register 201, when any of the signal 3565 supplied by the receive control logic 356 (FIG. 3) and indicative of whether data has been received from other units and the signal 4581 supplied by the line invalidate control logic 458 (FIG. 6) is not asserted. The index portion of the selected address RA, that is, the upper 12th to 26th bits of thereof is used to access the cache tag portion 310 and the status portion 320, and the tag and the cache status ST of the cache line corresponding to the real address RA are sent to the cache control unit 350 byway of lines 3100 and 3206. Similarly, the data portion 330 is accessed by using the index portion RA.sub.-- CI of the real address RA, and the corresponding cache data is sent to the cache control unit 350 by way of line 3300.

In FIG. 3, the comparator 351 in the cache control unit 350 compares the read out tag and the tag portion RA.sub.-- CT of the real address RA, and sends the result to the hit detect logic 352 by way of line 3510. The kind of the memory access instruction is transferred from CPU core 100 to the hit detect logic 352 by way of line 1002, in addition to the comparison result, and the cache status ST previously read out is further transferred by way of line 3206. The signal 3565 indicative of whether data has been received from other units is further supplied from the receive control logic 356.

The hit detect logic 352 judges hit/miss of the cache, based on the three pieces of information described first. This logic judges as follows, when the signal 3565 from the receive control logic 356 is not asserted. It judges cache hit when the comparison result by the comparator 351 shows agreement and the cache status ST is not I, in case the kind of the memory access instruction is load. It judges cache hit when the comparison result by the comparator 351 shows agreement and the cache status ST is E or M, in case the kind of the memory access instruction is store. When the hit detect logic 352 judges cache hit, it turns on the gate 372 (FIG. 2) by line 3520. Thus, the data read out previously onto line 3300 from the cache memory 300 is sent to CPU core 100 by way of line 1001.

Moreover, to the cache status control logic 354, the kind of the memory access instruction is transferred from CPU core 100 by way of line 1002, the cache status ST previously read out is transferred by way of line 3206, and the hit check result is transferred throug