WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Store buffer apparatus in a multiprocessor system    
United States Patent5202972   
Link to this pagehttp://www.wikipatents.com/5202972.html
Inventor(s)Gusefski; Richard J. (Endicott, NY); Lei; Chon I. (Endwell, NY); Ramirez; Alfonso (Endicott, NY)
AbstractA multiprocessor computer system has a common master storage and data/instruction caches for each processor. Separate buffer storage units for each processor read data from the processor in parallel with the caches, but write data to the master storage sequentially, as directed by a controller for the master storage.



 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5202972
Store buffer apparatus in a multiprocessor system - US Patent 5202972 Drawing
Store buffer apparatus in a multiprocessor system
Inventor     Gusefski; Richard J. (Endicott, NY); Lei; Chon I. (Endwell, NY); Ramirez; Alfonso (Endicott, NY)
Owner/Assignee     International Business Machines Corporation (Armonk, NY)
Patent assignment
All assignments
Publication Date     April 13, 1993
Application Number     07/725,919
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     July 3, 1991
US Classification     711/123 711/120
Int'l Classification     G06F 013/00 G06F 009/00 G06F 015/16
Examiner     Eng; David Y.
Assistant Examiner    
Attorney/Law Firm     Samodovitz; Arthur J. Garnett; Pryor A. , Romney; David S. ,
Address
Parent Case     This application is a continuation-in-part of application Ser. No. 07/291,805, filed Dec. 29, 1988 now abandoned.
Priority Data    
USPTO Field of Search     395/425 364/DIG. 1 364/DIG. 2
Patent Tags     store buffer multiprocessor
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
3735360
Connor
379/3
Feb,2006

[0 after 0 votes]
4622631
Frank
707/201
Nov,1986

[0 after 0 votes]
4520441
Bandoh
711/203
May,1985

[0 after 0 votes]
4503497
Krygowski
711/124
Mar,1985

[0 after 0 votes]
4484267
Fletcher
711/124
Nov,1984

[0 after 0 votes]
4466059
Bastian
711/122
Aug,1984

[0 after 0 votes]
4464712
Fletcher
711/122
Aug,1984

[0 after 0 votes]
4445174
Fletcher
711/121
Apr,1984

[0 after 0 votes]
4443846
Adcock
710/307
Apr,1984

[0 after 0 votes]
4442487
Fletcher
711/122
Apr,1984

[0 after 0 votes]
4399503
Hawley
711/113
Aug,1983

[0 after 0 votes]
4394731
Flusche
711/145
Jul,1983

[0 after 0 votes]
4392200
Arulpragasam
711/140
Jul,1983

[0 after 0 votes]
4298929
Capozzi
711/201
Nov,1981

[0 after 0 votes]
4164787
Aranguren
713/600
Aug,1979

[0 after 0 votes]
3967247
Andersen
711/136
Jun,1976

[0 after 0 votes]
3938097
Niguette, III
711/128
Feb,1976

[0 after 0 votes]
3896419
Lange
711/129
Jul,1975

[0 after 0 votes]
3845474
Lange
711/119
Oct,1974

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


Having described a preferred embodiment thereof, we claim as our invention:

1. A cache system for transferring data to and from a processor, said system comprising:

a level two (L2) storage unit;

a level one (L1) cache coupled to said processor and to said L2 storage unit;

first means for writing data from said L2 storage unit to said L1 cache, and writing said data from said L1 cache to said processor;

a level one (L1) storage unit coupled to said processor and said L2 storage unit;

second means, responsive to a write request by said processor, for writing data from said processor to said L1 storage unit without first requiring contents of said L2 storage unit from an address associated with said write request be stored in said L1 cache or said L1 storage unit if said address is not currently represented in said L1 cache or said L1 storage unit, and subsequently writing said data from said L1 storage unit to said L2 storage unit, said second mean also writing said data to said L1 cache if said address is currently represented in said L1 cache; and

status means, coupled t said L1 storage unit and responsive to a read request by said processor, for determining if said read request requires data that is stored in said L1 storage unit but has not yet been written from said L1 storage unit to said L2 storage unit, and

if so, delaying said read request until the second writing means writes said data from said L1 storage unit to said L2 storage unit, and then causing said processor to a access said at a from said L1 cache if written in said L1 cache by said second means, without writing said data from said L2 storage unit to said L1 cache,

if not, without said delay, causing said first means to write said data form said L1 cache to said processor if said data resides in said L1 cache or causing said first means to write said data from said L2 storage unit to said L1 cache and then from said L1 cache to said processor if said data did not reside in said L1 cache when said processor made said read request.

2. The system of claim 1 further comprising:

third means, coupled to said L2 storage unit and said status means, for controlling access to said L2 storage unit by other processors and other associated cache systems such that access to an address is granted when said L2 storage unit contains the latest data for said address but access is denied when said L1 storage unit but not said L2 storage unit contains the latest data for said address.

3. The system of claim 1 further comprising:

another L1 cache coupled to another processor and said L2 storage unit; and

third means for writing data from said L2 storage unit to said other L1 cache, and writing said data from said other L1 cache to said other processor; and

another L1 storage unit coupled to said other processor and said L2 storage unit; and

fourth means, responsive to a writ request by said other processor, for writing dat from said other processor to said other L1 storage unit and subsequently writing said data from said other L1 storage unit to said L2 storage unit; and wherein

said second means writes data form the associated L1 storage unit to said L2 storage unit when said first means, said third means or said fourth means is not writing to or reading from said L2 storage unit.

4. The system of claim 1 wherein said L2 storage unit is a second level cache.

5. The system of claim 4 wherein said second level cache is coupled to a main storage.

6. The system of claim 1 wherein said L1 storage unit is part of said L1 cache.

7. The system of claim 1 wherein said L2 storage unit is main storage.
 Description Submit all comments and votes
 


CROSS REFERENCE TO RELATED APPLICATIONS

This application incorporates by reference the disclosures of copending application Ser. No. 291,888 entitled "Cache Bypass Apparatus" filed on Dec. 29, 1988 concurrently now abandoned, herewith, copending and of application Ser. No. 159,016, entitled "Store Queue For A Tightly Coupled Multiple Processor Configuration With A Two-Level Cache Buffer Storage", filed Feb. 22, 1988 now U.S. Pat. No. 5,023,776.

BACKGROUND OF THE INVENTION

The present invention pertains to computing systems, and more particularly to a store buffer apparatus in an improved, high performance, multiprocessor or uniprocessor computer system.

In the design and development of computer systems, increasing emphasis is being placed on performance of such systems. The performance is very often a function of the technology used in manufacturing the integrated circuit chips which comprise the computer system. One such technology, new in the development of computer systems, is Complementary Metal Oxide Semiconductor (CMOS) technology. CMOS technology provides a greater degree of reliability, serviceability, and availability than seen before in prior computer systems, due mostly to a reduction in the physical number of chips which comprise the computer system. Since a scarcity of input/output pins on chips has been a problem with prior computer systems, a reduction in the number of chips reduces the number of interconnections (input/output pins) between chips In addition, performance may also be a function of the number of processors which comprise the computer system.

In such processors, master storage facilities store data needed by the central processor units (CPU). In a multiprocessor system, each processor often included a small cache memory in addition to the master storage facilities. The cache was used by a particular processor to store blocks of instructions or data. If the data or instructions were needed by the processor, a particular item of data or an instruction was retrieved from the cache in lieu of the master storage facility, since the time required to perform a fetch from cache is much smaller than the time required to fetch from master storage. However, if a particular processor retrieved data and/or instructions from cache and executed the instructions and/or operated on such data yielding a set of results which required storage in the master storage facility, it is necessary that the particular processor store the set of results directly to the master storage, not via the cache. However, if a first CPU of the multiprocessor system is using the master storage facility, a second CPU cannot store the set of results in the master storage facility until the first CPU is finished using the facility.

In a uniprocessor mode, the master storage facility can also be tied up by the channel facilities. This way the processor itself can not store the result back into the master storage facility until the master storage is freed up by the channel facilities Thus the processor can not execute the next instruction until the channel facilities finish using the master storage facility. Therefore, the execution of another instruction in the second CPU is delayed until the master storage facility is freed up by the first CPU or by the channel facilities.

Furthermore in a pipeline machine, the processor has the capability to store results back into the master storage facility every machine cycle. Usually the rate of storing data into the master storage facility is much slower than the rate the processor generates the data to be stored away. Hence the processor has to stop executing instructions while the master storage facility is still busy storing the result for the current instruction. In other words, the master storage facility can not keep up with the processor.

SUMMARY OF THE INVENTION

It is an object of the present invention to introduce an improved computer system which provides a high level of performance due to its use of a store buffer for temporarily holding data or instructions from the current CPU when the master storage facility is being utilized by the current CPU, or by another CPU, or by the channel facilities, thereby freeing up the current CPU for execution of other instructions.

In a multiprocessor system including at least two processors and a main memory, each processor includes a cache and a separate and distinct store buffer facility in accordance with the present invention. If one processor is using the master storage facility when the other processor seeks to store data in the master storage facility, in a multiprocessor system without a store buffer, the other processor is forced into a hold condition until such time that the one processor completes its use of master storage. When the one processor completes its use of the memory, the other processor may then begin the store operation. Valuable time is lost during the hold condition. However, in accordance with the present invention, a store buffer is incorporated into each processor in addition to its own cache. Therefore, when the other processor attempts to store data in master storage when master storage is being used by the one processor (or by the other processor or by the channel facilities), the other processor stores the data in its store buffer facility simultaneously with storage of the data in its cache. This store buffer acts to temporarily hold the data during the time when the master storage is being tied up by other resources. When the master storage is free, the data in the store buffer of the other processor is transmitted to master storage. The store buffer of each processor contains eight entries. Each entry includes an address portion comprising an effective address and an absolute address, a status bit portion comprising a plurality of status bits, and a data portion including data and a plurality of write flags. The status bits are used to determine if the data stored in an entry of the store buffer pertains to a current instruction or a previous instruction and whether or not the data has already been stored in main memory. Other status bits indicate whether the data in a particular entry pertains to a sequential store or a non-sequential store and indicate how many data entries pertain to one particular instruction. Whenever data is stored in a store buffer of a processor, all the status bits are set accordingly, that is, one bit is set to indicate if the data is associated with a present or a previous instruction, one bit is set to indicate whether the data has already been stored in master storage, and one bit is set to indicate if the store in the store buffer is made during sequential or non-sequential store mode. When conditions change within the processor, the status bits associated with a particular data entry in a store buffer are changed again. The store buffer is used during sequential and non-sequential store mode, during fetch conflicts for data and instructions, when a merge operation is required, when the contents of a store buffer is dumped into master storage, and during instruction retry situations.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a uniprocessor computer system;

FIG. 2 illustrates a triadic computer system;

FIG. 3 illustrates a detailed construction of the I/D Caches 9 (L1), the I-unit, E-unit, and Control Store (C/S) illustrated in FIGS. 1 and 2;

FIG. 4 illustrates a further configuration of the I-unit, E-unit, and Control Store (C/S) of FIGS. 1-3, further including a store buffer apparatus of the present invention;

FIG. 5 illustrates the contents of the store buffer of FIG. 4;

FIG. 6 illustrates a timing chart of a fetch conflict;

FIG. 7 illustrates a timing chart depicting an L1 cache miss with store buffer data merge; and

FIG. 8 illustrates the data portion of the store buffer.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

FIG. 1 illustrates a prior-art uniprocessor computer system which may serve as an environment for the present invention.

In FIG. 1, the uniprocessor system comprises a main-storage or L3 memory 10 connected to a storage controller (SCL) 12. On one end, the storage controller 12 is connected to integrated I/0 subsystem controls 14, the controls 14 being connected to integrated adapters and single card channels 16. On the other end, the storage controller 12 is connected via storage bus 17 to a cache storage unit 18, which comprises an instruction cache and a data cache, collectively termed the L1 cache. The I/D caches 18 are connected to an instruction unit (I-unit), Execution unit (E-unit), and control store of a processor 20. The processor 20 is also connected to the storage controller 12 through the address/command (AD/CMD or A/C) bus 30. The uniprocessor system of FIG. 1 also includes the multisystem channel communication unit 24.

The L3 memory 10 comprises 2 intelligent memory cards. The cards are intelligent due to the existence of certain specific features: error checking and correction, extended error checking and correction (ECC) refresh address registers and counters, and bit spare capability. The interface to the L3 memory 10 is 8 bytes wide. Memory sizes are 8, 16, 32, and 64 megabytes. The L3 memory is connected to a storage controller (SCL) 12.

The storage controller 12 comprises three conventional bus arbiters for granting access to the L3 memory 10, to the I/0 subsystem controls 14, and to the I/D caches 18. The storage controller further includes a directory which is used to determine whether to invalidate the lines of the instruction and data caches of the L1 cache 18 for data. If the desired data is located in the L1 caches 18 but the data is obsolete, the storage controller 12 invalidates the obsolete data in the L1 caches 18, thereby allowing the I/0 subsystem controls 14 to update the data in the L3 memory 10. Thereafter, instruction and execution units 20 must obtain the updated data from the L3 memory 10. The storage controller 12 further includes a plurality of buffers for buffering data being input to L3 memory 10 from the I/0 subsystem controls 14 and for buffering data being input to L3 memory 10 from instruction/execution units 20. The buffer associated with the instruction/execution units 20 is a 256 byte line buffer which allows the building of entries 8 bytes at a time for certain types of instructions, such as sequential operations. This line buffer, when filled by all the data from one instruction, will cause a block transfer of data to L3 memory to occur. Therefore, memory operations are reduced from a number of individual store operations to a much smaller number of line transfers.

The instruction cache/data caches of unit 18 are each 16K byte caches. The storage bus 17 to the storage controller 12 is 8 bytes wide; thus, an in-page operation from the storage controller 12 takes 8 data transfer cycles. The data cache 18 is a "store through" cache, which means that data from the instruction/execution units 20 are stored in L3 memory and, if the corresponding obsolete data is not present in the L1 caches 18, the data is not brought into and stored in the L1 caches. To assist this operation, a "store buffer" is present with the L1 data cache 18 which is capable of buffering up to 8 store operations.

The integrated I/0 subsystem 14 is connected to the storage controller 12 via an 8-byte bus. The storage controller 12 comprises three 64-byte buffers used to synchronize data coming from the integrated I/0 subsystem 14. That is, the instruction/execution unit 20 and the I/0 subsystem 14 operate on different clocks, the synchronization of the two clocks being achieved by the three 64-byte buffer structure.

The multisystem channel communication unit 24 is a 4-port channel to channel adapter, packaged externally to the system.

The uniprocessor system of FIG. 1 operates as follows.

Normally, instructions are resident in the instruction cache (L1 cache) 18, waiting to be executed. The instruction/execution unit 20 searches a directory within the L1 cache 18 to determine if the next instruction is stored therein. If the instruction is not stored in the L1 cache 18, the instruction/execution unit 20 will generate a storage request to the storage controller 12. Both the absolute address of the instruction and the in-page request will be provided to the storage controller 12 through the AD/CMD bus 30. The storage controller 12 will arbitrate (according to a conventional algorithm) for access to the bus connected to the L3 memory 10. Eventually, the in-page request and the absolute address from the instruction/execution unit 20 will be passed to the L3 memory 10, the request comprising a command indicating that an instruction in L3 memory is to be fetched for transfer to the instruction/execution unit 20. The L3 memory will latch the request and the absolute address, decode the request, and select the location in the memory card wherein the instruction is stored. After a few cycles of delay, the instruction will be delivered to the storage controller 12 from the L3 memory in 8-byte segments. The instruction is then transmitted from the storage controller 12 through storage bus 17 to the instruction cache (L1 cache) 18, wherein it is temporarily stored. The instruction is then transmitted from the instruction cache 18 to the instruction register within the instruction/execution unit 20. The instruction is decoded via a conventional decoder within the instruction unit 20. Quite often, an operand from memory 10 is needed in order to execute the instruction. The instruction/execution unit 20 searches the directory in the data cache 18; if the operand is not found in the directory of the data cache 18, another storage access is issued by the instruction/execution unit 20 to access the L3 memory 10 via AD/CMD bus 30, exactly as described above with respect to the instruction cache. The operand is eventually sent back to the instruction/execution 20 for execution while at the same time it is stored back into the data cache 18. If the instruction/execution unit 20 decodes an I/0 instruction in the instruction cache 18, information is stored in an auxiliary portion of L3 memory 10, sectioned off from instruction execution. The instruction/execution unit 20 informs the integrated I/0 subsystem 14 that such information is stored in L3 memory, and the subsystem 14 processors accesses the L3 memory 10 to fetch the information.

FIG. 2 shows a triadic (multiprocessor) system. Two L3 memories 10a,10b are connected to a bus switching unit (BSU) 26, the BSU including an L2 cache 27. The BSU 26 is connected to the integrated I/0 subsystem 14, to shared channel processors 28, and to three processor units: a first processor unit including instruction/execution units/control store 20a, a second processor including instruction/execution units/control store 20b, and a third processor including instruction/execution units/control store 20c Each processor unit has an associated instruction/data L1 cache 18a, 18b, 18c. The cache in the BSU 26 is termed the L2 cache 27, and the main memory 10a/10b is termed the L3 memory.

The BSU 26 connects the three L1 caches 18a, 18b, 18c of the three processors 20a, 20b, 20c, two L3 memory ports 10a,10b, two shared channel processors 28, and an integrated I/0 subsystem 14. The BSU 26 comprises circuits which decide the priority for requests to be handled, such as storage requests from each of the three processors 20a-20c, requests from the I/0 subsystem 14 and shared channel processors 28, from circuits which operate the interfaces, and from circuits to access the L2 cache 27. The L2 cache 27 is a "store in" cache: when operations occur that access the L2 cache for data, and when that data is modified during the course of the operation, the data which is resident in the L2 cache must also be modified accordingly. (The only exception is that, if the operation originates from the I/0 subsystem 14, and if the data is resident only in L3 memory 10a,10b and not in L2 cache 27, the data is modified only in L3 memory, and not in L2 cache).

The interface between the BSU 26 and L3 memories 10a,10b comprises two 16-byte lines in lieu of the single 8-byte port in FIG. 1. However, the memory 10 of FIG. 1 is identical to the memory cards 10a,10b of FIG. 2. The two memory cards 10a,10b of FIG. 2 are accessed in parallel.

The shared-channel processor 28 is connected to the BSU 26 via two 8-byte ports. The shared channel processor 28 is operated at a frequency which is independent of the BSU 26; the clocks within the BSU are synchronized with the clocks in the shared channel processor 28 in a manner which is similar to the clock synchronization between the storage controller 12 and the integrated I/0 subsystem 14 of FIG. 1.

The multi-processor computer system of FIG. 2 operates in the following manner.

Assume that a particular instruction/execution unit, say 20a requires an instruction and searches its associated L1 cache, 18a for the desired instruction. Assume further that the desired instruction is not resident in the L1 cache. The instruction execution unit will then request access to the BSU 26 via AD/CMD bus 30a to search the directory of the L2 cache 27. The BSU 26 contains a conventional arbiter which receives requests from each of the instruction/execution units 20a, 20b, 20c and from the shared channel processor 28 and from the integrated I/0 subsystem 14, granting access to only one of these units at a time. When the particular instruction/execution unit 20a is granted access to the BSU to search the L2 cache 27, the BSU 26 searches the directory of the L2 cache 27 for the desired instruction. If the instruction is found in the L2 cache, it is returned to the instruction/execution unit 20a through storage bus 17a. If the L2 cache directory indicates that the desired instruction is not located within the L2 cache, a request is made to the L3 memory 10a or 10b for the desired instruction. If the desired instruction is located in the L3 memory, it is immediately transmitted to the BSU 26, 16 bytes at a time, and is bypassed to the requesting instruction/execution unit via storage bus 17a, while simultaneously being stored in the L2 cache 27 of the BSU 26. Additional functions of the BSU relate to storage consistency in a multiprocessor system. For example, when a particular instruction/execution unit such as 20c modifies data, that data must be made visible to all other instruction/execution units 20a, 20b in the system. If unit 20c modifies data presently stored in its L1 cache 18c, a search for that particular data is made in the L2 cache directory 27 of the BSU 26. If found, the particular data is modified to reflect the modification in the L1 cache 18c. Furthermore, the other processors 20a and 20b are permitted to see the modified, correct data now resident in the L2 cache 27 in order to permit such other processors to modify the corresponding data resident in their L1 caches 18a and 18b. The unit 20c cannot re-access the particular data until the other processors 20a and 20b have had a chance to modify their corresponding data accordingly.

FIG. 3 shows the detailed construction of each instruction/execution unit (20 in FIG. 1 or 20a-20c in FIG. 2) and its corresponding L1 cache (18 in FIG. 1 or 18a-18c in FIG. 2).

In FIG. 1, and in FIG. 2, the instruction/execution unit 20, 20a, 20b, and 20c is disposed in a block labelled "I-unit E-unit C/S (92KB)". This block may be termed the "processor", the "instruction processing unit", the "central processing unit (CPU)", or, as indicated above, the "instruction/execution unit". For the sake of simplicity in the description provided below, the block 20, 20a-20c will be called the "processor". In addition, the "I/D caches (L1)" will be called the "L1 cache".

FIG. 3 provides a detailed construction of the processor 20 (or 20a, 20b, 20c) and of the L1 cache 18 (or 18a, 18b, 18c).

The processor 20 comprises several elements. A control store subsystem 20.1 includes a high-speed fixed control store 20.11 of 84k bytes and an 8k byte, 2k word, 4-way associative pageable area 20.12. Machine state controls 20.2 include the global controls 20.21 for the processor, an op branch table 20.22 connected to the CSAR via the control store origin address bus and used to generate the initial address for microcoded instructions. An address generation unit 20.3 comprises 3 chips: an instruction cache DLAT and directory chip 20.31, a data cache DLAT and directory chip 20.32, and an address generation chip 20.33 connected to the L1 cache 18 via the address bus 19.1. The instruction DLAT and directory chip 20.31 is connected to the instruction cache portion 18.11 of the L1 cache 18 via four "hit" lines 19.2 which indicate that the requested instruction will be found in the instruction cache portion 18.11 of the L1 cache. Likewise, four "hit" lines 19.3 connect the data DLAT and directory 20.32 to the data cache 18.22 portion of the L1 cache 18 indicating that the requested data will be found in the data cache 18.22 portion of the L1 cache.

Furthermore the instruction DLAT/Directory chip 20.31 contains the effective address portion and some status bits of the store buffer EA 20.31b. The data DLAT and Directory chip 30.32 contains the absolute address portion and all of the status bits of the store buffer AA 20.32b in accordance with the present invention. The address generation chip 20.3 contains copies 20.34 of the 16 general purpose registers used to generate addresses, and also includes three storage address registers (SARS) 20.35 used to provide addresses to the microcode for instruction execution. A fixed point instruction execution unit 20.4 is connected to the data cache 18.2 via the data bus (D-bus) 19.4. A local store stack (local store) 20.41 contains the 16 general-purpose registers mentioned above as well as a number of working registers used exclusively by the microcode. Condition registers 20.42 contain the results of a number of arithmetic and shift operations, and a condition code. Numeral 20.43 denotes a conventional four-byte arithmetic logic unit (ALU); 20.44 is an 8-byte rotate merge unit. Branch bit select hardware 20.45 allows the selection of bits from various registers to determine the direction of a branch operation; the bits are selected from general purpose registers, working registers, and the condition registers. A floating-point unit 20.5 executes floating-point instructions. The floating point processor 20.5 is disclosed in pending patent application Ser. No. 102,985, entitled "Dynamic Multiple Instruction Stream Multiple Data Multiple Pipeline Apparatus for Floating Point Single Instruction Stream Single Data Architectures", filed on Sep. 30, 1987. The ALU 20.43 contains an adder as disclosed in pending patent application Ser. No. 066,580, filed Jun. 26, 1987, entitled " A High Performance Parallel Binary Byte Adder". An externals chip 20.6 includes timers and interrupt hardware for interrupts from the I/0 subsystem 14 and other units.

In an L1 cache 18, an instruction cache 18.1 comprises a 16k byte/4-way cache 18.11, a 16-byte instruction buffer 18.12 at the output thereof, and an 8-byte in-page register 18.13 at the input 17.1 from storage. The eight-bytes storage bus 17.1 is connected to the instruction cache 18.1 via the in-page register 18.13. The in-page register 18.13 is connected to the control store subsystem 20.1. It provides data to the subsystem when a pageable control store miss requires that new data be brought into the control store. A data cache 18.2 comprises an in-page buffer 18.21 also connected to the storage bus 17.1. The data cache 18.22 is a 16k-byte 4-way cache. A cache-dataflow unit 18.23 comprises a series of input and output registers connected to the processor via an 8-byte data bus (D-bus) 19.4, and an 8-element store buffer data 18.24 in accordance with the present invention.

The following describes the functional operation of one of the processors 20a-20c operating in conjunction with its associated L1 cache, 18a, 18b, or 18c.

Assume that an instruction to be executed is located in the instruction cache 18.11. The instruction is fetched from the instruction cache 18.11 and is stored in the instruction buffer 18.12. Every attempt is made to keep the instruction buffer full at all times. The instruction is fetched from the instruction buffer 18.12 and is stored in the instruction registers (INSTR REG) 20.36 of the address-generation unit 20.3, registers 20.46 of the fixed-point execution unit 20.4, and registers 20.23 of the machine-state controls 20.2. At this point, the instruction decoding begins. If one or more operands are required, they are fetched from the GPR COPY 20.34 in the address generation unit 20.3. (Normally, GPR COPY 20.34 is accessed if operands are required for the base and index registers for an RX instruction.)

The address generation process begins in the next cycle. The base and index register contents are added to a displacement field from the instruction, and the effective address is generated and sent to the data cache 18.2 and/or to the instruction cache 18.1. In this example, an operand is sought, so the effective address will be sent to the data cache 18.2. The address is also sent to the data DLAT and directory chip 20.32, since, in this example, an operand is sought. Access to the cache, DLAT, and the directories begins in the third cycle. The DLAT 20.32a determines if the address is translatable from an effective address to an absolute address. Assuming that this translation has been previously performed, it will have been recorded. The translated address from the DLAT output is compared with the output of the cache directory 20.32a. If they are equal, the data DLAT and directory chip 20.32 raises one of the four "hit" lines 19.3. The hit lines are connected to the data cache 18.22; a "hit" line indicates which of the four associativity classes contains the data that the processor 20 wishes to retrieve. On the next cycle, the data-cache 18.22 output is gated through a fetch alignment shifter, in the cache dataflow 18.23. It is shifted appropriately, transmitted along the D-BUS 19.4 to the fixed point execution unit 20.4, and latched into the ALU 20.43. In parallel with the shifting process, the first operand is accessed from the general-purpose registers in local store 20.41. As a result, two operands are latched in the input of the ALU 20.43, if necessary. In the fifth cycle, the ALU 20.43 will process (add, subtract, divide, etc.) the two operands according to the instruction opcode. The output of the ALU 20.43 and the condition registers 20.42 are latched at the end of the fifth cycle, to indicate an overflow or zero condition. In the sixth cycle, the output of the ALU 20.43 is written back into the local store 20.41 and into the GPR copy 20.34 of the address generation unit 20.3 in order to keep the GPR copy 20.34 synchronized with the content of the local store 20.41. When the decode cycle of this instruction is complete, the decode cycle of the next instruction may begin. Up to six instructions may be either decoding or executing at any one time. Certain instructions require the use of microcode to complete execution. Therefore, during the decode cycle, the op