WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Ensuring data integrity by locked-load and conditional-store operations in a multiprocessor system    
United States Patent5193167   
Link to this pagehttp://www.wikipatents.com/5193167.html
Inventor(s)Sites; Richard L. (Boylston, MA); Witek; Richard T. (Littleton, MA)
AbstractA high performance CPU of the RISC (reduced instruction set) type employs a standardized, fixed instruction size, and permits only simplified memory access data width and addressing modes. The instruction set is limited to register-to-register operations and register load/store operations. Byte manipulation instructions, included to permit use of previously-established data structures, include the facility for doing in-register byte extract, insert and masking, along with non-aligned load and store instructions. The provision of load/locked and store/conditional instructions permits the implementation of atomic byte writes.



 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Inventor     Sites; Richard L. (Boylston, MA); Witek; Richard T. (Littleton, MA)
Owner/Assignee     Digital Equipment Corporation (Maynard, MA)
Patent assignment
All assignments
Publication Date     March 9, 1993
Application Number     07/547,618
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     June 29, 1990
US Classification    
Int'l Classification    
Examiner     Dixon; Joseph L.
Assistant Examiner     Asta; Frank J.
Attorney/Law Firm     Arnold, White & Durkee
Address
Parent Case    
Priority Data    
USPTO Field of Search    
Patent Tags     ensuring data integrity locked-load conditional-store operations multiprocessor
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5025366
Baror
711/128
Jun,1991

[0 after 0 votes]
5023776
Gregor
711/122
Jun,1991

[0 after 0 votes]
4561051
Rodman
711/152
Dec,1985

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A method of performing an atomic byte write in an operation of a processor, said atomic byte write being performed in a memory by way of a system bus external to said processor, said system bus connecting said processor to said memory, comprising the steps of:

loading a content of a selected aligned multibyte external memory location from said memory to internal register means in said processor, said content including a non-aligned byte location to be written, and thereafter detecting on said system bus any other store to said selected aligned-multibyte external memory location;

providing in said processor an indication if any other store is made to said selected aligned-multibyte external memory location;

replacing in said internal register means a value in said non-aligned byte location with a new value to be written, while leaving remainder of said content intact; and thereafter

conditionally storing by said processor said content of said internal register means to said selected aligned multibyte external memory location, depending upon whether said indication is present.

2. A method according to claim 1 including the step of providing an indicator in said processor of whether or not said step of conditionally storing executed a store.

3. A method according to claim 2 wherein said steps of conditionally storing and providing said indicator are performed by one instruction executed by said processor.

4. A method according to claim 2 including, after said step of conditionally storing, the step of executing by said processor a branch which is conditional upon said indicator.

5. A method according to claim 1 wherein said step of providing an indication resets a status bit, said status bit being set upon said step of loading.

6. A method according to claim 5 wherein said status bit is reset upon occurrence of an interrupt or exception in said processor.

7. A method according to claim 1 wherein said selected aligned-multibyte memory location is contained in a memory and said memory is aligned on quadword (8-byte) boundaries.

8. A method according to claim 1 wherein said stop of loading and providing said indication are performed by one instruction executed by said processor.

9. A method according to claim 1 wherein said selected aligned-multibyte memory location is contained in a memory and said memory is aligned on longword (4-byte) boundaries.

10. A method according to claim 1 wherein said processor is a single-chip integrated circuit device.

11. A method of writing by a processor to a selected non-aligned byte location in external memory, the external memory being coupled to said processor by a bus, comprising the steps of:

loading a selected aligned external memory location including said selected non-aligned byte location via said bus to internal register means in said processor, and thereafter detecting by said processor any other store via said bus to said selected aligned external memory location;

providing an indication in said processor if any other store is made to said selected aligned external memory location;

replacing at least one selected byte in said internal register means with a new value to be written to provide a modified content; and thereafter

conditionally storing said modified content from said internal register means in said processor to said selected aligned external memory location via said bus, depending upon whether said indication is present.

12. A method according to claim 11 wherein said steps of loading and providing said indication are performed by one instruction executed by said processor, and said step of conditionally storing is performed by another one instruction executed by said processor.

13. A method according to claim 11 wherein said selected aligned-multibyte memory location is contained in a memory and said memory is aligned on longword (4-byte) boundaries.

14. A method according to claim 11 wherein said step of providing an indication resets a status bit in the processor, said status bit being set upon said step of loading.

15. A method according to claim 11 wherein said processor is a signal-chip integrated circuit device.

16. A method according to claim 11 including the step of providing an indicator in said processor if said step of conditionally storing finds said indication to be present.

17. A method according to claim 16 including, after said step of conditionally storing, the step of executing by said processor a branch which is conditional upon said indicator.

18. A method according to claim 11 wherein said selected aligned-multibyte memory location is contained in a memory and said memory is aligned on quadword (8-byte) boundaries.

19. A computer system having a capability of performing an atomic byte write, comprising:

a CPU and a shared memory, said CPU accessing said shared memory by a system bus; said memory having an access path for aligned memory locations of multi-byte width;

means for generating in said CPU a read request for a non-aligned data item and for sending to said memory via said bus an aligned read request for a content of a location in said memory which includes said non-aligned data item, said non-aligned data item being of at least one byte in width;

register means in said CPU for receiving said content from said memory via said bus and storing said content;

means for detecting in said CPU another request on said bus for said location and for providing a recorded indication of detection of said another request;

means in said CPU for altering only said data item in said register means to produce an altered content;

means in said CPU for generating a write request for storing said altered content in said location in said memory, and for conditionally sending said write request to said memory via said bus depending upon said recorded indication.

20. A system according to claim 19 including means in said CPU for indicating whether said recorded indication was present when conditionally sending said write request to said memory, and further including means for executing a conditional branch after said conditional sending, said conditional branch being responsive to said means for indicating.

21. A system according to claim 20 wherein said means for indicating includes a register of a register set accessible within said CPU, and wherein said register means is another register of said register set.

22. A system according to claim 21 wherein said means for generating and said means for detecting employ a signal instruction executed by said CPU, and wherein said means for generating a write request and conditionally sending and said means for indicating employ another single instruction executed by said CPU.

23. A system according to claim 22 wherein said memory is aligned on quadword (8-byte) boundaries.

24. A system according to claim 22 wherein said CPU is a single-chip integrated circuit device.
 Description Submit all comments and votes
 


RELATED CASES

This application discloses subject matter also disclosed in the following copending applications, filed herewith and assigned to Digital Equipment Corporation, the assignee of this invention:

Ser. No. 547,589, filed Jun. 29, 1990, entitled BRANCH PREDICTION IN HIGH-PERFORMANCE PROCESSOR, by Richard L. Sites and Richard T. Witek, inventors:

Ser. No. 547,630, filed Jun. 29, 1990, entitled IMPROVING PERFORMANCE IN REDUCED INSTRUCTION SET PROCESSOR, by Richard L. Sites and Richard T. Witek, inventors:

Ser. No. 547,629, filed Jun. 29, 1990, entitled IMPROVING BRANCH PERFORMANCE IN HIGH SPEED PROCESSOR, by Richard L. Sites and Richard T. Witek, inventors:

Ser. No. 547,600, filed Jun. 29, 1990, entitled GRANULARITY HINT FOR TRANSLATION BUFFER IN HIGH PERFORMANCE PROCESSOR, by Richard L. Sites and Richard T. Witek, inventors:

Ser. No. 547,619, filed Jun. 29, 1990, entitled IN-REGISTER DATA MANIPULATION IN REDUCED INSTRUCTION SET PROCESSOR, by Richard L. Sites and Richard T. Witek, inventors:

Ser. No. 07/547684, filed Jun. 29, 1990, entitled IMPROVING COMPUTER PERFORMANCE BY ELIMINATING BRANCHES, by Richard L. Sites and Richard T. Witek, inventors; and

Ser. No. 07/547597, filed Jun. 29, 1990, entitled BYTE-COMPARE OPERATION FOR HIGH-PERFORMANCE PROCESSOR, by Richard L. Sites and Richard T. Witek, inventors:

BACKGROUND OF THE INVENTION

This invention relates to digital computers, and more particular to a high-performance processor executing a reduced instruction set.

Complex instruction set or CISC processors are characterized by having a large number of instructions in their instruction set, often including memory-to-memory instructions with complex memory accessing modes. The instructions are usually of variable length, with simple instructions being only perhaps one byte in length, but the length ranging up to dozens of bytes. The VAX.TM. instruction set is a primary example of CISC and employs instructions having one to two byte opcodes plus from zero to six operand specifiers, where each operand specifier is from one byte to many bytes in length. The size of the operand specifier depends upon the addressing mode, size of displacement (byte, word or longword), etc. The first byte of the operand specifier describes the addressing mode for that operand, while the opcode defines the number of operands: one, two or three. When the opcode itself is decoded, however, the total length of the instruction is not yet known to the processor because the operand specifiers have not yet been decoded. Another characteristic of processors of the VAX type is the use of byte or byte string memory references, in addition to quadword or longword references; that is, a memory reference may be of a length variable from one byte to multiple words, including unaligned byte references.

Reduced instruction set or RISC processors are characterized by a smaller number of instructions which are simple to decode, and by requiring that all arithmetic/logic operations be performed register-to-register. Another feature is that of allowing no complex memory accesses; all memory accesses are register load/store operations, and there are a small number of relatively simple addressing modes, i.e., only a few ways of specifying operand addresses. Instructions are of only one length, and memory accesses are of a standard data width, usually aligned. Instruction execution is of the direct hardwired type, as distinct from microcoding. There is a fixed instruction cycle time, and the instructions are defined to be relatively simple so that they all execute in one short cycle (on average, since pipelining will spread the actual execution over several cycles).

One advantage of CISC processors is in writing source code. The variety of powerful instructions, memory accessing modes and data types should result in more work being done for each line of code (actually, compilers do not produce code taking full advantage of this), but whatever gain in compactness of source code is accomplished at the expense of execution time. Particularly as pipelining of instruction execution has become necessary to achieve performance levels demanded of systems presently, the data or state dependencies of successive instructions, and the vast differences in memory access time vs. machine cycle time, produce excessive stalls and exceptions, slowing execution. The advantage of RISC processors is the speed of execution of code, but the disadvantage is that less is accomplished by each line of code, and the code to accomplish a given task is much more lengthy. One line of VAX code can accomplish the same as many lines of RISC code.

When CPUs were much faster than memory, it was advantageous to do more work per instruction, because otherwise the CPU would always be waiting for the memory to deliver instructions--this factor led to more complex instructions that encapsulated what would be otherwise implemented as subroutines. When CPU and memory speed became more balanced, a simple approach such as that of the RISC concepts becomes more feasible, assuming the memory system is able to deliver one instruction and some data in each cycle. Hierarchical memory techniques, as well as faster access cycles, provide these faster memory speeds. Another factor that has influenced the CISC vs. RISC choice is the change in relative cost of off-chip vs. on-chip interconnection resulting from VLSI construction of CPUs. Construction on chips instead of boards changes the economics--first it pays to make the architecture simple enough to be on one chip, then more on-chip memory is possible (and needed) to avoid going off-chip for memory references. A further factor in the comparison is that adding more complex instructions and addressing modes as in a CISC solution complicates (thus slows down) stages of the instruction execution process. The complex function might make the function execute faster than an equivalent sequence of simple instructions, but it can lengthen the instruction cycle time, making all instructions execute slower; thus an added function must increase the overall performance enough to compensate for the decrease in the instruction execution rate.

The performance advantages of RISC processors, taking into account these and other factors, is considered to outweigh the shortcomings, and, were it not for the existing software base, most new processors would probably be designed using RISC features. A problem is that business enterprises have invested many years of operating background, including operator training as well as the cost of the code itself, in applications programs and data structures using the CISC type processors which were the most widely used in the past ten or fifteen years. The expense and disruption of operations to rewrite all of the code and data structures to accommodate a new processor architecture may not be justified, even though the performance advantages ultimately expected to be achieved would be substantial.

Accordingly, the objective is to accomplish all of the performance advantages of a RISC-type processor architecture, but yet allow the data structures and code previously generated for existing CISC-type processors to be translated for use in a high-performance processor.

SUMMARY OF THE INVENTION

In accordance with one embodiment of the invention, a high-performance processor is provided which is of the RISC type, using a standardized, fixed instruction size, and permitting only a simplified memory access data width, using simple addressing modes. The instruction set is limited to register-to-register operations (for arithmetic and logic type operations using the ALU, etc.) and register load/store operations where memory is referenced; there are no memory-to-memory operations, nor register-to-memory operations in which the ALU or other logic functions are done. The functions performed by instructions are limited to allow non-microcoded implementation, simple to decode and execute in a short cycle. On-chip floating point processing is provided, and on-chip instruction and data caches are employed in an example embodiment.

Byte manipulation instructions are included to permit use of previously-established data structures. These instructions include the facility for doing in-register byte extract, insert and masking, along with non-aligned load and store instructions, so that byte addresses can be made use of even though the actual memory operations are aligned quadword in nature.

The provision of load/locked and store/conditional instructions permits the implementation of atomic byte writes. To write to a byte address in a multibyte (e.g., quadword) aligned memory, the CPU loads a quadword (or longword) and locks this location, writes to the byte address in register while leaving the remainder of the quadword undisturbed, then stores the updated quadword in memory conditionally, depending upon whether the quadword has been written by another processor since the load/locked operation.

Another byte manipulation instruction, according to one feature of the invention, is a byte compare instruction. All bytes of a quadword in a register are compared to corresponding bytes in another register. The result is a single byte (one bit for each byte compared) in a third register. Since this operation is done to a general purpose register (rather than to a special hardware location), several of the byte compares can be done in sequence, and no added state must be accounted for upon interrupt or the like. This byte compare can be used to advantage with a byte zeroing instruction in which selected bytes of a quadword are zeroed, with the bytes being selected by bits in a low-order byte of a register. That is, the result of a byte compare can be used to zero bytes of another register.

Speed of execution is highly dependent on the sequentiality of the instruction stream; branches disrupt the sequence and generate stalls while the prefetched instruction stream is flushed and a new sequence is begun. By providing a conditional move instruction, many short branches can be eliminated altogether. A conditional move instruction tests a register and moves a second register to a third if the condition is met, this function can be substituted for short branches and thus maintain the sequentiality of the instruction stream.

If branches cannot be avoided, the performance can be speeded up by predicting the target of a branch and prefetching the new instruction based upon this prediction. According to a feature of one embodiment, a branch prediction rule is followed that requires all forward branches to be predicted not-taken and all backward branches (as is common for loops) to be predicted as taken. Upon compilation, the code is rearranged to make sure the most likely path is backward rather than forward, so more often than not the predicted path is taken and the proper instruction is prefetched.

Another performance improvement is to make use of unused bits in the standard-sized instruction to provide a hint of the expected target address for jump and jump to subroutine instructions or the like. The target can thus be prefetched before the actual address has been calculated and placed in a register. If the target address of the hint matches the calculated address when the instruction is executed, then the prefetched address is already in the pipeline and will execute much faster. The hint is added to the jump instruction by the compiler.

In addition, the unused displacement part of the jump instruction can contain a field to define the actual type of jump, i.e., jump, jump to subroutine, return from subroutine, and thus place a predicted target address in a stack to allow prefetching before the instruction has been executed, or take other action appropriate to the operation defined by the hint. A hint may be ignored by the hardward, and if so the code still executes properly, just slower.

According to a feature of one embodiment, the processor employs a variable memory page size, so that the entries in a translation buffer for implementing virtual addressing can be optimally used. A granularity hint is added to the page table entry to define the page size for this entry. If a large number of sequential pages share the same protection and access rights, all of these pages can be referenced with the same page table entry, and so the use of the translation buffer becomes more efficient. The likelihood of a hit in the translation buffer is increased, so the number of faults to access the page tables is minimized.

An additional feature is the addition of a prefetch instruction which serves to move a block of data to a faster-access cache in the memory hierarchy before the data block is to be used. This prefetch instruction would be inserted by the compiler to perform a function similar to that of a vector processor, but does not require vector hardware. The prefetch instruction does not generate memory exceptions or protection or access violations, and so does not slow down execution if the prefetch fails. Again, the instruction is optional, and if the processor cannot execute it the normal code executes without problems.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as other features and advantages thereof, will be best understood by reference to the detailed description of specific embodiments which follows, when read in conjunction with the accompanying drawings, wherein:

FIG. 1 is an electrical diagram in block form of a computer system employing a CPU which may employ features of the invention;

FIG. 2 is a diagram of data types used in the processor of FIG. 1;

FIG. 3 is an electrical diagram in block form of the instruction unit or I-box of the CPU of FIG. 1;

FIG. 4 is an electrical diagram in block form of the integer execution unit or E-box in the CPU of FIG. 1;

FIG. 5 is an electrical diagram in block form of the addressing unit or A-box in the CPU of FIG. 1;

FIG. 6 is an electrical diagram in block form of the floating point execution unit or F-box in the CPU of FIG. 1;

FIG. 7 is a timing diagram of the pipelining in the CPU of FIGS. 1-6;

FIG. 8 is a diagram of the instruction formats used in the instruction set of the CPU of FIGS. 1-6;

FIG. 9 is a diagram of the format of a virtual address used in the CPU of FIGS. 1-6;

FIG. 10 is a diagram of the format of a page table entry used in the CPU of FIGS. 1-6; and

FIG. 11 is a diagram of the addressing translation mechanism used in the CPU of FIGS. 1-6;

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENT

Referring to FIG. 1, a computer system which may use features of the invention, according to one embodiment, includes a CPU 10 connected by a system bus 11 to a main memory 12, with an I/O unit (not shown) also accessed via the system bus. The system may be of various levels, from a stand-alone workstation up to a mid-range multiprocessor, in which case the other CPUs such as a CPU 15 also access the main memory 12 via the system bus 11.

The CPU 10 is preferably a single-chip integrated circuit device, although features of the invention could be employed in a processor constructed in multi-chip form. Within the signal chip an integer execution unit 16 (referred to as the "E-box") is included, along with a floating point execution unit 17 (referred to as the "F-box"). Instruction fetch and decoding is performed in an instruction unit 18 or "I-box", and an address unit or "A-box" 19 performs the functions of address generation, memory management, write buffering and bus interface. The memory is hierarchical, with on-chip instruction and data caches being included in the instruction unit 18 and address unit 19 in one embodiment, while a larger, second-level cache 20 is provided off-chip, being controlled by a cache controller in the address unit 19.

The CPU 10 employs an instruction set as described below in which all instructions are of a fixed size, in this case 32-bit or one longword. The instruction and data types employed are for byte, word, longword and quadword, as illustrated in FIG. 2. As used herein, a byte is 8-bits, a word is 16-bits or two bytes, a longword is 32-bits or four bytes, and a quadword is 64-bits or eight bytes. The data paths and registers within the CPU 10 are generally 64-bit or quadword size, and the memory 12 and caches use the quadword as the basic unit of transfer. Performance is enhanced by allowing only quadword or longword loads and stores, although, in order to be compatible with data types used in prior software development, byte manipulation is allowed by certain unique instructions, still maintaining the feature of only quadword or longword loads and stores.

Referring to FIG. 3, the instruction unit 18 or I-box is shown in more detail. The primary function of the instruction unit 18 is to issue instructions to the E-box 16, A-box 19 and F-box 17. The instruction unit 18 includes an instruction cache 21 which stores perhaps 8 Kbytes of instruction stream data, and a quadword (two instructions) of this instruction stream data is loaded to an instruction register 22 in each cycle where the pipeline advances. The instruction unit 18, in a preferred embodiment, decodes two instructions in parallel in decoders 23 and 24, then checks that the required resources are available for both instructions by check circuitry 25. If resources are available and dual issue is possible then both instructions may be issued by applying register addresses on busses 26 and 27 and control bits on microcontrol busses 28 and 29 to the appropriate elements in the CPU 10. If the resources are available for only the first instruction or the instructions cannot be dual issued then the instruction unit 18 issues only the first instruction from the decoder 23. The instruction unit 18 does not issue instructions out of order, even if the resources are available for the second instruction (from decoder 24) and not for the first instruction. The instruction unit 18 does not issue instructions until the resources for the first instruction become available. If only the first of a pair of instructions issues (from the decoder 23), the instruction unit 18 does not advance another instruction into the instruction register 22 to attempt to dual issue again. Dual issues is only attempted on aligned quadword pairs as fetched from memory (or instruction cache 21) and loaded to instruction register 22 as an aligned quadword.

The instruction unit 18 contains a branch prediction circuit 30 responsive to the instructions in the instruction stream to be loaded into register 22. The prediction circuit 30 along with a subroutine return stack 31 is used to predict branch addresses and to cause address generating circuitry 32 to prefetch the instruction stream before needed. The subroutine return stack 31 (having four-entries, for example) is controlled by the hint bits in the jump, jump to subroutine and return instructions as will be described. The virtual PC (program counter) 33 is included in the address generation circuitry 32 to produce addresses for instruction stream data in the selected order.

One branch prediction method is the use of the value of the sign bit of the branch displacement to predict conditional branches, so the circuit 30 is responsive to the sign bit of the displacement appearing in the branch instructions appearing at inputs 35. If the sign bit is negative, it predicts the branch is taken, and addressing circuit 32 adds the displacement to register Ra (one of the registers of register set 43, as selected by field Ra of the instruction) to produce the first address of the new address sequence to be fetched. If the sign is positive it predicts not taken, and the present instruction stream is continued in sequence.

The instruction unit 18 contains an 8-entry fully associative translation buffer (TB) 36 to cache recently used instruction-stream address translations and protection information for 8 Kbyte pages. Although 64-bit addresses are nominally possible, as a practical matter 43-bit addresses are adequate for the present. Every cycle the 43-bit virtual program counter 33 is presented to the instruction stream TB 36. If the page table entry (PTE) associated with the virtual PC is cached in the TB 36 then the page frame number (PFN) and protection bits for the page which contains the virtual PC is used by the instruction unit 18 to complete the address translation and access checks. A physical address is thus applied to the address input 37 of the instruction cache 21, or if there is a cache miss then this instruction stream physical address is applied by the bus 38 through the address unit 19 to the cache 20 or memory 12. In a preferred embodiment, the instruction stream TB 36 supports any of the four granularity hint block sizes as defined below, so that the probability of a hit in the TB 36 is increased.

The execution unit or E-box 16 is shown in more detail in FIG. 4. The execution unit 16 contains the 64-bit integer execution datapath including an arithmetic/logic unit (ALU) 40, a barrel shifter 41, and an integer multiplier 42. The execution unit 16 also contains the 32-register 64-bit wide register file 43, containing registers R0 to R31, although R31 is hardwired as all zeros. The register file 43 has four read ports and two write ports which allow the sourcing (sinking) of operands (results) to both the integer execution datapath and the address unit 19. A bus structure 44 connects two of the read ports of the register file 43 to the selected inputs of the ALU 40, the shifter 41 or the multiplier 42 as specified by the control bits of the decoded instruction on busses 28 or 29 from the instruction unit 18, and connects the output of the appropriate function to one of the write ports to store the result. That is, the address fields from the instruction are applied by the busses 26 or 27 to select the registers to be used in execution the instruction, and the control bits 28 or 29 define the operation in the ALU, etc., and defines which internal busses of the bus structure 44 are to be used when, etc.

The A-box or address unit 19 is shown in more detail in FIG. 5. The A-box 19 includes five functions: address translation using a translation buffer 48, a load silo 49 for incoming data, a write buffer 50 for outgoing write data, an interface 51 to a data cache, and the external interface 52 to the bus 11. The address translation datapath has the displacement adder 53 which generates the effective address (by accessing the register file 43 via the second set of read and write ports, and the PC), the data TB 48 which generates the physical address on address bus 54, and muxes and bypassers needed for the pipelining.

The 32-entry fully associative data translation buffer 48 caches recently-used data-stream page table entries for 8 Kbyte pages. Each entry supports any of the four granularity hint block sizes, and a detector 55 is responsive to the granularity hint as described below to change the number of low-order bits of the virtual address passed through from virtual address bus 56 to the physical address bus 54.

For load and store instructions, the effective 43-bit virtual address is presented to TB 48 via bus 56. If the PTE of the supplied virtual address is cached in the TB 48, the PFN and protection bits for the page which contains the address are used by the address unit 19 to complete the address translation and access checks.

The write buffer 50 has two purposes (1) To minimize the number of CPU stall cycles by providing a high bandwidth (but finite) resource for receiving store data. This is required since the CPU 10 can generate store data at the peak rate of one quadword every CPU cycle which may be greater than the rate at which the external cache 20 can accept the data; and (2) To attempt to aggregate store data into aligned 32-byte cache blocks for the purpose of maximizing the rate at which data may be written from the CPU 10 into the external cache 20. The write buffer 50 has eight entries. A write buffer entry is invalid if it does not contain data to be written or is valid if it contains data to be written. The write buffer 50 contains two pointers: the head pointer 57 and the tail pointer 58. The head pointer 57 points to the valid write buffer entry which has been valid the longest period of time. The tail pointer 58 points to the valid buffer entry slot which will next be validated. If the write buffer 50 is completely full (empty) the head and tail pointers point to the same valid (invalid) entry. Each time the write buffer 50 is presented with a new store instruction the physical address generated by the instruction is compared to the address in each valid write buffer entry. If the address is in the same aligned 32-byte block as an address in a valid write buffer entry then the store data is merged into that entry and the entry's longword mask bits are updated. If no matching address is found in the write buffer then the store data is written into the entry designated by the tail pointer 58, the entry is validated, and the tail pointer 58 is incremented to the next entry.

The address unit 19 contains a fully folded memory reference pipeline which may accept a new load or store instruction every cycle until a fill of a data cache 59 ("D-cache") is required. Since the data cache 59 lines are only allocated on load misses, the address unit 19 may accept a new instruction every cycle until a load miss occurs. When a load miss occurs the instruction unit 18 stops issuing all instructions that use the load port of the register file 43 (load, store, jump subroutine, etc., instructions).

Since the result of each data cache 59 lookup is known late in the pipeline (stage S7 as will be described) and instructions are issued in pipe stage S3, there may be two instructions in the address unit 19 pipeline behind a load instruction which misses the data cache 59. These two instructions are handled as follows: First, loads which hit the data cache 59 are allowed to complete, hit under miss. Second, load misses are placed in the silo 49 and replayed in order after the first load miss completes. Third, store instructions are presented to the data cache 59 at their normal time with respect to the pipeline. They are silo'ed and presented to the write buffer 50 in order with respect to load misses.

The on-chip pipelined floating point unit 17 or F-box as shown in more detail in FIG. 6 is capable of executing both DEC and IEEE floating point instructions according to the instruction set to be described. The floating point unit 17 contains a 32-entry, 64-bit, floating point register file 61, and a floating point arithmetic and logic unit 62. Divides and multiplies are performed in a multiply/divide circuit 63. A bus structure 64 interconnects two read ports of the register file 61 to the appropriate functional circuit as directed by the control bits of the decoded instruction on busses 28 or 29 from the instruction unit 18. The registers selected for an operation are defined by the output buses 26 or 27 from the instruction decode. The floating point unit 17 can accept an instruction every cycle, with the exception of floating point divide instructions, which can be accepted only every several cycles. A latency of more than one cycle is exhibited for all floating point instructions.

In an example embodiment, the CPU 10 has an 8 Kbyte data cache 59, and 8 Kbyte instruction cache 21, with the size of the caches depending on the available chip area. The on-chip data cache 59 is write-through, direct mapped, read-allocate physical cache and has 34-byte (1-hexaword) blocks. The system may keep the cache 59 coherent with memory 12 by using an invalidate bus, not shown. The cache 59 has longword parity in the data array 66 and there is a parity bit for each tag entry in tag store 67.

The instruction cache 21 may be 8 Kbytes or 16 Kbytes, for example, or may be larger or smaller, depending upon die area. Although described above as using physical addressing with a TB 36, it may also be a virtual cache, in which case it will contain no provision for maintaining its coherence with memory 12. If the cache 21 is a physical addressed cache the chip will contain circuitry for maintaining its coherence with memory: (1) when the write buffer 50 entries are sent to the internal interface 52, the address will be compared against a duplicate instruction cache 21 tag, and the corresponding block of instruction cache 21 will be conditionally invalidated; (2) the invalidate bus will be connected to the instruction cache 21.

The main data paths and registers in the CPU 10 are all 64-bits wide. That is, each of the integer registers 43, as well as each of the floating point registers 61, is a 64-bit register, and the ALU 40 has two 64-bit inputs 40a and 40b and a 64-bit output 40c. The bus structure 44 in the execution unit 16, which actually consists of more than one bus, has 64-bit wide data paths for transferring operands between the integer registers 43 and the inputs and output of the ALU 40. The instruction decoders 23 and 24 produce register address outputs 26 and 27 which are applied to the addressing circuits of the integer registers 43 and/or floating point registers 61 to select which register operands are used as inputs to the ALU 40 or 62, and which of the registers 43 or registers 61 is the destination for the ALU (or other functional unit) output.

The dual issue decision is made by the circuitry 25 according to the following requirement, where only one instruction from the first column and one instruction from the second column can be issued in one cycle:

______________________________________ Column A Column B ______________________________________ Integer Operate Floating Operate Floating Load/Store Integer Load/Store Floating Branch Integer Branch JSR ______________________________________

That is, the CPU 10 can allow dual issue of an integer load or store instruction with an integer operate instruction, but not an integer branch with an integer load or store. Of course, the circuitry 25 also checks to see if the resources are available before allowing two instructions to issue in the same cycle.

An important feature is the RISC characteristic of the CPU 10 of FIGS. 1-6. The instructions executed by this CPU 10 are always of the same size, in this case 32-bits, instead of allowing variable-length instructions. The instructions execute on average in one machine cycle (pipelined as described below, and assuming no stalls), rather than a variable number of cycles. The instruction set includes only register-to-register arithmetic/logic type of operations, or register-to-memory (or memory-to-register) load/store type of operations, and there are no complex memory addressing modes such as indirect, etc. An instruction performing an operation in the ALU 40 always gets its operands from the register file 43 (or from a field of the instruction itself) and always writes the result to the register file 43; these operands are never obtained from memory and the result is never written to memory. Loads from memory are always to a register in register files 43 or 61, and stores to memory are always from a register in the register files.

Referring to FIG. 7, the CPU 10 has a seven stage pipeline for integer operate and memory reference instructions. The instruction unit 18 has a seven stage pipeline to determine instruction cache 21 hit/miss. FIG. 7 is a pipeline diagram for the pipeline of execution unit 16, instruction unit 18 and address unit 19. The floating point unit 17 defines a pipeline in parallel with that of the execution unit 16, but ordinarily employs more stages to execute. The seven stages are referred to as S0-S6, where a stage is to be executed in one machine cycle (clock cycle). The first four stages S0, S1, S2 and S3 are executed in the instruction unit 18, and the last three stages S4, S5 and S6 are executed in one or the other of the execution unit 16 or address unit 19, depending upon whether the instruction is an operate or a load/store. There are bypassers in all of the boxes that allow the results of one instruction to be used as operands of a following instruction without having to be written to the register file 43 or 61.

The first stage S0 of the pipeline is the instruction fetch or IF stage, during which the instruction unit 18 fetches two new instructions from the instruction cache 21, using the PC 33 address as a base. The second stage S1 is the swap stage, during which the two fetched instructions are evaluated by the circuit 25 to see if they can be issued at the same time. The third stage S2 is the decode stage, during which the two instructions are decoded in the decoders 23 and 24 to produce the control signals 28 and 29 and register addresses 26 and 27. The fourth stage S3 is the register file 43 access stage for operate instructions, and also is the issue check decision point for all instructions, and the instruction issue stage. The fifth stage S4 is cycle one of the computation (in ALU 40, for example) if it is an operate instruction, and also the instruction unit 18 computes the new PC 33 in address generator 32; if it is a memory reference instruction the address unit 19 calculates the effective data stream address using the adder 53. The sixth stage S5 is cycle two of the computation (e.g., in ALU 40) if it is an operate instruction, and also the data TB 48 lookup stage for memory references. The last stage S6 is the write stage for operate instructions having a register write, during which, for example, the output 40c of the ALU 40 is written to the register file 43 via the write port, and is the data cache 59 or instruction cache 21 hit/miss decision point for instruction stream or data stream references.

The CPU 10 pipeline divides these seven stages S0-S6 of instruction processing into four static and three dynamic stages of execution. The first four stages S0-S3 consist of the instruction fetch, swap, decode and issue logic as just described. These stages S0-S3 are static in that instructions may remain valid in the same pipeline stage for multiple cycles while waiting for a resource or stalling for other reasons. These stalls are also referred to as pipeline freezes. A pipeline freeze may occur while zero instructions issue, or while one instruction of a pair issues and the second is held at the issue stage. A pipeline freeze implies that a valid instruction or instructions is (are) presented to be issued but can not proceed.

Upon satisfying all issue requirements, instructions are allowed to continue through the pipeline toward completion. After issuing in S3, instructions can not be held in a given pipe stage S4-S6. It is up to the issue stage S3 (circuitry 25) to insure that all resource conflicts are resolved before an instruction is allowed to continue. The only means of stopping instructions after the issue stage S3 is an abort condition.

Aborts may result from a number of causes. In general, they may be grouped into two classes, namely exceptions (including interrupts) and non-exceptions. The basic difference between the two is that exceptions require that the pipeline be flushed of all instructions which were fetched subsequent to the instruction which caused the abort condition, including dual issued instructions, and restart the instruction fetch at the redirected address. Examples of non-exception abort conditions are branch mispredictions, subroutine call and return mispredictions and instruction cache 21 misses. Data cache 59 misses do not produce abort conditions but can cause pipeline freezes.

In the event of an exception, the CPU 10 first aborts all instructions issued after the excepting instruction. Due to the nature of some error conditions, this may occur as late as the write cycle. Next, the address of the excepting instruction is latched in an internal processor register. When the pipeline is fully drained the processor begins instruction execution at the address given by a dispatch using a PAL code. The pipeline is drained when all outstanding writes to both the integer and floating point register file 43 and 61 have completed and all outstanding instructions have passed the point in the pipeline such that all instructions are guaranteed to complete without an exception in the absence of a machine check.

Referring to FIG. 8, the formats of the various types of instructions of the instruction set executed by the CPU 10 of FIGS. 1-7 are illustrated. One type is a memory instruction 70, which contains a 6-bit opcode in bits <31:26>, two 5-bit register address fields Ra and Rb in bits <25:21> and <20:16>, and a 16-bit signed displacement in bits <15:0>. This instruction is used to transfer data between registers 43 and memory (memory 12 or caches 59 or 20), to load an effective address to a register of the register file, and for subroutine jumps. The displacement field <15:0> is a byte offset; it is sign-extended and added to the contents of register Rb to form a virtual address. The virtual address is used as a memory load/store address or a result value depending upon the specific instruction.

The branch instruction format 71 is also shown in FIG. 8, and includes a 6-bit opcode in bits <31:26>, a 5-bit address field in bits <25:21>, and a 21-bit signed branch displacement in bits <20:0>. The displacement is treated as a longword offset, meaning that it is shifted left two bits (to address a longword boundary), sign-extended to 64-bits and added to the updated contents of PC 33 to form the target virtual address (overflow is ignored).

The operate instructions 72 and 73 are of the formats shown in FIG. 8, one format 72 for three register operands and one format 73 for two register operands and a literal. The operate format is used for instructions that perform integer register operations, allowing two source operands and one destination operand in register file 43. One of the source operands can be a literal constant. Bit-12 defines whether the operate instruction is for a two source register operation or one source register and a literal. In addition to the 6-bit opcode at bits <31:26>, the operate format has a 7-bit function field at bits <11:5> to allow a wider range of choices for arithmetic and logical operation. The source register Ra is specified in either case at bits <25:21>, and the destination register Rc at <4:0>. If a bit-12 is a zero, the source register Rb is defined at bits <20:16>, while if bit-12 is a one then an 8-bit zero-extended literal constant is formed by bits <20:13> of the instruction. This literal is interpreted as a positive integer in the range 0-255, and is zero-extended to 64-bits.

FIG. 8 also illustrates the floating point operate instruction format 74, used for instructions that perform floating point register 61 to floating point register 61 operations. The floating point operate instructions contain a 6-bit opcode at bits <31:26> as before, along with an 11-bit function field at bits <15:5>. There are three operand fields, Fa, Fb and Fc, each specifying either an integer or a floating-point operand as defined by the instruction; only the registers 13 are specified by Fa, Fb and Fc, but these registers can contain either integer or floating-point values. Literals are not supported. Floating point conversions use a subset of the floating point operate format 74 of FIG. 8 and perform register-to-register conversion operations; the Fb operand specifies the source and the Fa operand should be reg-31 (all zeros).

The other instruction format 75 of FIG. 8 is that for privileged architecture library (PAL or PALcode) instructions, which are used to specify extended processor functions. In these instructions a 6-bit opcode is present at bits <31:26> as before, and a 26-bit PALcode function field <25:0> specifies the operation. The source and destination operands for PALcode instructions are supplied in fixed registers that are specified in the individual instruction definitions.

The six-bit opcode field <31:26> in the instruction formats of FIG. 8 allows only 2.sup.6 or sixty-four different instructions to be coded. Thus the instruction set would be limited to sixty-four. However, the "function" fields in the instruction formats 72, 73 and 74 allow variations of instructions having the same opcode in bits <31:26>. Also, the "hint" bits in the jump instruction allow variations such as JSR, RET, as explained below.

Referring to FIG. 9, the format 76 of the virtual address asserted on the internal address bus 56 is shown. This address is nominally 64-bits in width, but of course practical implementations within the next few years will use much smaller addresses. For example, an address of 43-bits provides an addressing range of 8-Terabytes. The format includes a byte offset 77 of, for example, 13-bits to 16-bits in size, depending upon the page size employed. If pages are 8 -Kbytes, the byte-within-page field 77 is 13-bits, for 16 -Kbyte pages the field 77 is 14-bits, for 32-Kbyte pages it is 15-bits, and for 64-Kbyte pages it is 16-bits. The format 76 as shown includes three segment fields 78, 79 and 80, labelled Seg1, Seg2 and Seg3, also of variable size depending upon the implementation. The segments Seg1, Seg2, and Seg3 can be 10-to-13 bits, for example. If each segment size is 10-bits, then a segment defined by Seg3 is 1K pages, a segment for Seg2 is 1M pages, and a segment for Seg1 is 1G pages. Segment number fields Seg1, Seg2 and Seg3 are of the same size for a given implementation. The segment number fields are a function of the page size; all page table entries at a given level do not exceed one page, so page swapping to access the page table is minimized. The page frame number (PFN) field in the PTE is always 32-bits wide; thus, as the page size grows the virtual and physical address size also grows.

The physical addresses are at most 48-bits, but a processor may implement a smaller physical address space by not implementing some number of high-order bits. The two most significant implemented physical address bits select a caching policy or implementation-dependent type of address space. Different implementations may put different uses and restrictions on these bits as appropriate for the system. For example, in a workstation with a 30-bit <29:0> physical address space, bit <29> may select between memory and I/O and bit <28> may enable or disenable caching in I/O space and must be zero in memory space.

Typically, in a multiprogramming system, several processes may reside in physical memory 12 (or caches) at the same time, so memory protection and multiple address spaces are used by the CPU 10 to ensure that one process will not interfere with either other processes or the operating system. To further improve software reliability, four hierarchical access modes provide memory access con