|
|
|
| United States Patent | 5542059 |
| Link to this page | http://www.wikipatents.com/5542059.html |
| Inventor(s) | Blomgren; James S. (San Jose, CA) |
| Abstract | A CPU pipeline is able to process instructions from a complex instruction
set computer CISC instruction set and from a reduced instruction set
computer RISC set. A mode register is provided to indicate whether RISC or
CISC instructions are currently being processed. Two instruction decode
units are used, one for each instruction set. Compound CISC instructions
flow from the decode pipestage to the address generate stage, then to an
operand cache stage, and finally to an algebraic execute stage before the
results are written back to the GPR register. When the CPU switches to
RISC mode by clearing a mode bit in the mode register, the pipeline is
re-arranged for processing the simpler RISC instructions. Two outputs are
provided for the RISC instruction decoder. The first output is for simple
execute-type instructions, while the second output is for load/store-type
instructions, and connects to the address generate pipestage, which
generates an address for the operand cache stage. These instructions are
prevented from continuing to the execute stage by a mux. The mux normally
connects the operand cache stage to the execute stage when CISC
instructions are being processed, but the mux directly connects the second
output of the RISC instruction decoder to the execute stage when the mode
register enables RISC instruction decoding. This reduces the latency for
RISC instructions by 1 or 2 clocks. An alternate embodiment re-arranges
the pipeline dynamically as simple instructions are detected by the decode
units. The preferred embodiment uses a fixed pipeline with the execute
hardware relocatable to the D, C, or M pipestages. Thus the pipeline is
optimized for both RISC and CISC instructions. |
|
|
|
Title Information  |
|
|
|
|
|
Drawing from US Patent 5542059 |
|
|
Dual instruction set processor having a pipeline with a pipestage
functional unit that is relocatable in time and sequence order |
|
|
|
|
|
| Publication Date |
July 30, 1996 |
|
|
|
|
|
| Filing Date |
December 21, 1994 |
|
|
|
|
|
|
|
|
|
|
|
| Parent Case |
RELATED APPLICATION
This application is a continuation of Ser. No. 08/180,023 filed 1/11/94,
now abandoned. |
|
|
|
|
|
|
|
|
|
|
|
|
|
Title Information  |
|
|
References  |
|
|
| *references marked with an asterisk below are user-added references |
|
U.S. References |
|
|
| Add a new US reference: |
| | Reference | Relevancy | Comments | Reference | Relevancy | Comments | 5291586 Jen 712/220 Mar,1994 |      Your vote accepted [0 after 0 votes] | | 5287465 Kurosawa 712/215 Feb,1994 |      Your vote accepted [0 after 0 votes] | | 5269007 Hanawa 712/218 Dec,1993 |      Your vote accepted [0 after 0 votes] | | 5255384 Sachs 711/207 Oct,1993 |      Your vote accepted [0 after 0 votes] | | 5241636 Kohn 712/215 Aug,1993 |      Your vote accepted [0 after 0 votes] | | 5230069 Brelsford 718/100 Jul,1993 |      Your vote accepted [0 after 0 votes] | | 5226164 Nadas 712/209 Jul,1993 |      Your vote accepted [0 after 0 votes] | | 5222223 Webb, Jr. 711/140 Jun,1993 |      Your vote accepted [0 after 0 votes] | | 5210832 Maier 712/227 May,1993 |      Your vote accepted [0 after 0 votes] | | 5167023 de Nicolas
Nov,1992 |      Your vote accepted [0 after 0 votes] | | 5150468 Staplin 712/245 Sep,1992 |      Your vote accepted [0 after 0 votes] | | 5136696 Beckwith 712/240 Aug,1992 |      Your vote accepted [0 after 0 votes] | | 5097407 Hino 712/209 Mar,1992 |      Your vote accepted [0 after 0 votes] | | 5077657 Cooper
Dec,1991 |      Your vote accepted [0 after 0 votes] | | 5077654 Ohtsuki
Dec,1991 |      Your vote accepted [0 after 0 votes] | | 5073855 Staplin
Dec,1991 |      Your vote accepted [0 after 0 votes] | | 5067069 Fite
Nov,1991 |      Your vote accepted [0 after 0 votes] | | 4992934 Portanova 712/209 Feb,1991 |      Your vote accepted [0 after 0 votes] | | 4991081 Bosshart 711/3 Feb,1991 |      Your vote accepted [0 after 0 votes] | | 4972317 Buonomo 712/227 Nov,1990 |      Your vote accepted [0 after 0 votes] | | 4962519 Upadhya 378/133 Oct,1990 |      Your vote accepted [0 after 0 votes] | | 4841476 Mitchell 703/26 Jun,1989 |      Your vote accepted [0 after 0 votes] | | 4821187 Ueda 712/246 Apr,1989 |      Your vote accepted [0 after 0 votes] | | 4812975 Adachi 703/26 Mar,1989 |      Your vote accepted [0 after 0 votes] | | 4794522 Simpson 703/26 Dec,1988 |      Your vote accepted [0 after 0 votes] | | 4780819 Kashiwagi 703/23 Oct,1988 |      Your vote accepted [0 after 0 votes] | | 4763242 Lee 703/27 Aug,1988 |      Your vote accepted [0 after 0 votes] | | 4633417 Wilburn 703/28 Dec,1986 |      Your vote accepted [0 after 0 votes] | | 4538241 Levin 711/207 Aug,1985 |      Your vote accepted [0 after 0 votes] | | 4377844 Kaufman 711/220 Mar,1983 |      Your vote accepted [0 after 0 votes] | | 5230045 Sindhu 711/203 Dec,1969 |      Your vote accepted [0 after 0 votes] | | | | | |
|
|
|
|
U.S. References |
|
|
Foreign References |
|
|
|
|
|
|
Foreign References |
|
|
Other References |
|
|
|
|
|
|
Other References |
|
|
|
|
|
References  |
|
|
|
|
|
| Market Size |
|
Estimate the gross annual revenues of the relevant market
sector:
|
| | |
| |
|
|
| Market Share |
|
Estimate the percentage of the relevant market sector this invention will capture:
|
| | |
| |
|
|
| Reasonable Royalty |
|
What percentage of gross sales should the inventor or assignee be paid?
|
| | |
| |
|
|
|
Public's "Guesstimation" of Royalty Value
|
| Market Size | N/A | [No votes] | | x | Market Share | N/A | [No votes] | | x | Reasonable Royalty | N/A | [No votes] |
| | N/A | |
| |
|
|
|
|
|
|
|
|
|
|
|
|
Market Review  |
|
|
Technical Review  |
|
|
Claims  |
|
|
I claim:
1. A CPU having a pipeline for processing a plurality of instructions from
two separate instruction sets, said pipeline comprising:
first instruction decode means for decoding instructions from said first
instruction set, said first decode means providing decoded instructions to
said plurality of functional units, said first instruction set having a
first encoding of opcodes to instructions;
second instruction decode means for decoding instructions from said second
instruction set, said second instruction set having a second encoding of
opcodes to instructions, said second encoding of opcodes to instructions
being separate and independent from said first encoding of optodes to
instructions, said second decode means providing decoded instructions to
said plurality of functional units;
enable means for enabling either said first instruction decode means or
said second instruction decode means, said enable means coupled to said
first instruction decode means and coupled to said second instruction
decode means;
a plurality of functional units for processing said plurality of
instructions;
a relocatable functional unit in said plurality of functional units, said
relocatable functional unit for executing native instructions from both of
said two separate instruction sets;
means for indicating which one of said two separate instruction sets is
being processed by said plurality of functional units; and
means for relocating, responsive to said indicating means, said relocating
means for temporally relocating said relocatable functional unit in time
and sequence order to other functional units in said plurality of
functional units.
2. The CPU of claim 1 wherein said two separate instruction sets comprise a
first and a second instruction set, and wherein said relocatable
functional unit is relocated by said relocating means when instructions
from said first instruction set are being processed by said plurality of
functional units.
3. The CPU of claim 2 wherein said relocatable functional unit is a
functional unit for executing algebraic operations relocated by said
relocating means when instructions from said first instruction set are
being processed by said plurality of functional units.
4. The CPU of claim 2 wherein said first instruction decode means includes
means for detecting a simple execute instruction and wherein a functional
unit for executing algebraic operations is relocated by said relocating
means when said simple execute instruction is detected by said first
instruction decode means.
5. A CPU pipeline for executing a complex instruction set computer CISC
instruction set and a reduced instruction set computer RISC instruction
set, said pipeline comprising:
CISC decode means for decoding said CISC instruction set;
RISC decode means for decoding said RISC instruction set;
enable means for enabling either said RISC decode means or said CISC decode
means, said enable means coupled to said RISC decode means and coupled to
said CISC decode means:
a plurality of functional units, each functional unit in said plurality of
functional units for executing both native RISC and native CISC
instructions, said plurality of functional units receiving decoded
insauctions from said CISC decode means and said RISC decode means, said
plurality of functional units arranged in a sequence of functional units;
a relocatable functional unit in said plurality of functional units;
means for indicating if said RISC instruction set or said CISC instruction
set is being processed by said plurality of functional units;
means for relocating, responsive to said means for indicating, said
relocating means for relocating temporally said relocatable functional
unit in time and sequence to other functional units in said plurality of
functional units if said means for indicating indicates that said RISC
instruction set is being processed.
6. The pipeline of claim 5 wherein said RISC instruction set has an
encoding of opcodes to instructions separate and independent from the
encoding of said CISC instruction set.
7. The pipeline of claim 6 wherein said relocatable functional unit is
relocated relative to other functional units by said means for relocation
when said enable means enables said RISC decode means.
8. The pipeline of claim 7 wherein said relocatable functional unit
relocated by said means for relocation is a functional unit for execution
of algebraic and logic instructions.
9. The pipeline of claim 8 wherein said functional unit for execution is
relocated by said means for relocation to an earlier position in said
sequence of functional units in said pipeline.
10. The pipeline of claim 9 wherein said functional unit for execution is
relocated by said means for relocation to immediately after said RISC
decode means, said functional unit for execution receiving decoded RISC
execute instructions from said RISC decode means when said means for
indicating indicates that said RISC instruction set is being processed.
11. A CPU pipeline for executing a complex instruction set computer CISC
instruction set and a reduced instruction set computer RISC instruction
set, said pipeline comprising:
CISC decode means for decoding said CISC instruction set;
RISC decode means for decoding said RISC instruction set;
enable means for enabling either said RISC decode means or said CISC decode
means, said enable means coupled to said RISC decode means and coupled to
said CISC decode means;
a plurality of functional units, said plurality of functional units
receiving decoded instructions from said CISC decode means and said RISC
decode means, said plurality of functional units arranged in a sequence of
functional units;
a relocatable functional unit in said plurality of functional units, said
relocatable functional unit in said plurality of functional units for
executing both native RISC and native CISC instructions;
means for indicating if said RISC instruction set or said CISC instruction
set is being processed by said plurality of functional units;
means for relocating, responsive to said means for indicating, said
relocating means for relocating temporally said relocatable functional
unit in time and sequence to other functional units in said plurality of
functional units if said means for indicating indicates that said RISC
instruction set is being processed.
12. The pipeline of claim 11 wherein said RISC instruction set has an
encoding of opcodes to instructions separate and independent from the
encoding of said CISC instruction set.
13. The pipeline of claim 11 wherein said relocatable functional unit is
relocated relative to other functional units by said means for relocation
when said enable means enables said RISC decode means.
14. The pipeline of claim 13 wherein said relocatable functional unit
relocated by said means for relocation is a functional unit for execution
of algebraic and logic instructions.
15. The pipeline of claim 14 wherein said functional unit for execution is
relocated by said means for relocation to an earlier position in said
sequence of functional units in said pipeline.
16. The pipeline of claim 15 wherein said functional unit for execution is
relocated by said means for relocation to immediately after said RISC
decode means, said functional unit for execution receiving decoded RISC
execute instructions from said RISC decode means when said means for
indicating indicates that said RISC instruction set is being processed. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to microprocessor architectures, and more
particularly to a pipeline with variable latencies to support execution of
multiple instruction sets.
RELATED APPLICATION
This application is related to a co-pending application for a
"Dual-Instruction-Set Architecture CPU with Hidden Software Emulation
Mode", invented by Blomgren and Richter, Ser. No. 08/179,926, filed
1/11/94, and assigned to the same assignee as this application.
2. Description of the Related Art
The performance of microprocessors has been increased through the use of
the well-known technique of pipelining. A pipelined central processing
unit or CPU is divided into several units referred to as stages or
pipestages, each pipestage typically requiring one processor clock period
to perform its function. As an instruction is processed by the
microprocessor, it flows through the pipeline: first the instruction is
fetched from memory by the Fetch pipestage, then the instruction is
decoded by the D stage, the decoded instruction may then be executed by an
arithmetic-logic-unit (ALU) or adder, then the result from the execute
stage is written to a register file or to memory. While a first
instruction is in the execute stage, the following instruction is in the D
stage, and the next following instruction in the Fetch stage. Thus many
instructions are being processed at the same time, but each instruction is
processed over several clock periods. The result is that the clock period
may be reduced, improving performance.
Pipelining has worked very well with simple, well-organized instruction
sets such as with reduced instruction set computers or RISC instruction
sets. However, older, more complex instructions set computers or CISC
instruction sets contain instructions that require additional use of
functional units. Some complex, compound instructions actually perform the
equivalent work of two or more simple instructions. A high-performance
design may require adding more functional units and stages to the pipeline
than are necessary for the simpler instructions. The difficulty arises in
trying to process both simple and complex instructions in the same
pipeline. If the pipeline is to execute both a simple RISC and a complex
CISC instruction set, the difficulty is intensified.
When instructions are pipelined, the results from one instruction may be
needed by a subsequent instruction, even before the instruction completes.
Techniques such as bypassing and forwarding of results can route the
result from one instruction to a subsequent instruction, when both
instructions are in different pipestages of the pipeline. However, the
subsequent instruction will still have to wait for the next pipestage to
be released by the previous instruction.
All microprocessors perform 3 basic types of instructions: accessing
memory, performing algebraic operations, and control transfer. These 3
types can be referred to as LOAD/STORE, ALU, and BRANCH operations.
Regardless of the architecture or instruction set, all instructions are
composed of these 3 component operations. An example is in the x86
instruction set, made popular by personal computers (PC's) using Intel
Corporation (of Santa Clara, Calif.) 386 and 486 microprocessors. The x86
POP instruction performs a LOAD from a stack in memory followed by an ALU
operation to increment the stack pointer register. Compound instructions
are common in CISC instruction sets such as the x86 set, but are rare in
RISC instruction sets.
OPERATIONAL LATENCIES
ALU operations include addition, subtraction, Boolean operations, and bit
shifts. Multiplication, division, and other complex floating-point
operations may also be performed if sufficient hardware resources are
provided. This type of instruction usually takes one or two operands from
a high-speed internal CPU general-purpose register file (GPR), and stores
the result back to this register file. Since data is not transferred off
the CPU die, the operation is very fast, typically requiring one clock
period. The latency, or time required to perform the operation defined by
the instruction, is one clock period. Latency does not include fetch,
decode, or write-back time normally required for instruction processing;
latency here refers to the time to perform a component operation.
LOAD/STORE operations must first compute a memory address where the data
resides, and then write data to that memory location or read data from
that location or address. Data is transferred between a register in the
GPR and the memory, which can be slow DRAM-based system memory or cache
memory. The cache may be on the CPU die or off-chip. With a cache, the
transfer will usually take one clock, while the address computation, which
normally requires addition, takes an additional clock. Thus the LOAD/STORE
operations require a total of 2 clocks, for a latency of two.
Control transfer or BRANCH operations calculate a new address to load into
the instruction pointer. If the branch is conditional, a new target
address for the code to jump to is calculated, and a branch condition is
evaluated, usually from the condition code register associated with the
ALU. Branch operations may be quite complex to pipeline, but optimizations
and prediction techniques are possible. A supplementary adder may be
provided to calculate the target address early, during the decode
pipestage, and the Fetch stage may be designed to fetch both the target
instruction and the sequential (branch not taken) instruction. However,
since the branch may have to wait for the condition codes to be set by a
previous instruction's ALU pipestage, and the next instruction must be
decoded after the branch decision is made, the latency is at least two
clocks.
LATENCY DIAGRAM
FIG. 1 is a latency diagram that is useful in designing pipelines. Each box
in the figure represents an operation or function that requires one clock
to complete. Connections between boxes show how one operation may depend
upon the results of another operation. The computational work performed by
any instruction can be analyzed with this latency diagram. Instruction
cache 10 contains a buffer of instructions that have not yet been decoded,
and may contain instructions that will not be executed if a branch occurs.
Branch adder 12 is used to calculate the target address for a branch.
Instruction decode and register file 14 decodes an instruction fetched
into the instruction cache, and provides the register operands to adder
16, which performs an ALU operation, or can calculate an address. Operand
cache 18 is a cache of main memory data or operands and can be written
into for a STORE operation or read from in a LOAD operation.
If an operation has a greater latency than 1 clock, then the diagram may be
modified accordingly. For example, if the operand cache 18 were slow and
required 2 clocks, then box 18 could be replaced with two boxes in
sequence. Similarly, the adder could be replaced with two or three boxes
for floating-point operations. Connections may also be modified depending
upon the design; for example, a very high-speed design might not allow
connection "D", the bypass around the ALU. Another design might have adder
16 located after operand cache 18 rather than before it, or in both
locations.
A LOAD/STORE will flow through the latency diagram, FIG. 1, starting as an
instruction in the I-cache 10, decoded in block 14, which provides address
components from the register file or immediate values from the instruction
itself, and ALU control information, along path "B" to adder 16. Adder 16
generates a memory address from these address components and provides this
address along path "C" to the operand cache 18. The operand cache stores
or loads the data specified by the address. If the operation is a load,
then the data read from the cache is available to the adder 16 along path
"E", and is loaded into the register file (not shown). Thus the load
operation takes 4 clocks to execute and provide its data result. Four
clocks are required because of dependencies within the load instruction
itself: the operand cache could not be accessed before the address was
generated, and the address could not be generated before the register file
provided the components.
An operand dependency may exist with the instruction following the load. If
the subsequent instruction is an Add using the data loaded by the load
instruction, then the Add instruction will be in the adder block 16 while
the load instruction is in the operand cache 18. However, the adder cannot
perform the add until the end of the clock period when the data is
provided from the cache 18 to the adder 16 along path "E". Thus the add
instruction must wait or "stall" in the add stage 16 for an additional
clock before starting the add operation. The stall would still be
necessary even if several adder blocks 16 were provided, because the data
was not yet available to the add instruction.
Recently, compilers have been designed to re-order instructions to try to
reduce dependencies that cause stalls. In the above example, if the Add
instruction could occur after another instruction, rather than immediately
following the load instruction, then the stall would be avoided. RISC
compilers in particular have been successful at instruction re-ordering,
allowing for CPU's with multiple functional units to increase performance
using dual-pipeline techniques such as super-scalar designs. However, CISC
instructions may perform both the load and the add as one atomic
instruction, eliminating the possibility of re-ordering. Thus the CISC
instruction set itself imposes latencies.
RISC PIPELINE
RISC instructions are typically simple instructions that do not perform
both an ALU operation and a cache or memory operand access. Thus path "E"
of FIG. 1 is not used within a single instruction, but may be used by a
second instruction following a load instruction. However, re-ordering
compilers reduce or eliminate the need for path "E". Thus a simple
pipeline for RISC is:
##STR1##
A write-back stage is normally also included at the end of the pipeline
when the results are written back into the register file and the condition
codes are modified. However, this stage does not affect the dependencies
and is thus not shown. Likewise the fetch stage is not shown. The adder is
designed for both ALU operations and address generation, since address
generation is usually simple in RISC instructions. An instruction that
uses the ALU will store its result back to the register file rather than
the operand cache memory, while an instruction accessing the operand cache
will not use the ALU except for generating the address in the operand
cache. The Execute pipestage for RISC instructions can perform an address
generation or an ALU operation, but not both at the same time.
The diagram below indicates the | | |