ICED Canned CPU

ICED Canned CPU - IcpuED

April 13, 1999 (revised June 5, 2000)

Outline

1. Introduction
2. Architectural Model
        2.1 Basics
        2.2 Instruction Set
3. Microarchitecture
        3.1 Instruction Fetch
        3.2 Instruction Decode/Operand Fetch
        3.3 Execute
        3.4 Memory
        3.5 Write Back
4. Programming Conventions
5. Design Data
6. Test Status

1. Introduction

The ICED Canned CPU was created for the use of students in the ICED computer engineering curriculum who do not desire to use their own CPU designs in the later core courses of the ICED curriculum (ELE 405 and on). The architecture and microarchitecture are based on the classic Hennessey and Patterson 5-stage pipelined CPU. The current version does not make use of Instruction Level Parallelism, and only rudimentary static branch prediction is employed. Forwarding is used, therefore minimal data dependencies (flow only) are enforced.

NOTE: The information in this document has not been completely updated. There are bug fixes that have not been incorporated herein.

2. Architectural Model

2.1 Basics

IcpuED uses a 24-bit address space. Addressing is done at the byte level, therefore the main memory can hold up to 16 MBytes.

Two data sizes are supported: byte (8-bits) and word (4-bytes or 32-bits). Half-word is not supported.

Words are aligned in memory (two lsb's of every word address are 0's).

There are 16 addressable registers in the architectural register file, each 32-bits in length.

Other than the Program Counter (PC) there are no other architectural registers (conditional branches test contents of the main register file storage for taken/not-taken decisions).

Data are assumed to be either 2's complement signed integers or unsigned integers, depending on the context (instruction being executed).

Instructions are always 1-word (4-bytes or 32-bits) in length and are word-aligned in memory, i.e., the two lsb's of an instruction's address are always 0. (That is why the PC is actually only 22-bits long.)

2.2 Instruction Set

There are 25 instructions in the IcpuED instruction set. They are organized into 6 groups as follows:

logical - (and, or, xor, not) and shift - (left, right logical, right arithmetic)
arithmetic - (add and subtract: plain, and w/unsigned overflow [carry] and w/signed overflow [2's complement overflow])
conditional branch - (branch on non-zero register, branch on register holding zero)
unconditional branch - (branch-and-link [for subroutine calls], indirect branch [for subroutine returns], return from interrupt)
loads - (load immediate lower and upper [for constant loading], memory load [word and byte, with address offset], memory store [word and byte, with address offset]
cpu control - (halt)

The complete instruction set is available via:

PDF: IcpuED Instruction Set [8 KBytes]
HTML: IcpuED Instruction Set [107 KBytes]
Original MS Excel 97 spreadsheet format: IcpuED Instruction Set [30 KBytes]

3. Microarchitecture

A high-level dataflow diagram of the ICED CPU is shown in IcpuED Dataflow. Each pipeline register is labeled: "pipe x" for each stage x. pipe 1 receives the output of stage 1, pipe 2 the output of stage 2, and so on. The five stages are:

IF - Instruction Fetch
ID/OF - Instruction Decode/Operand Fetch
EX - Execute
MEM - Memory Access
WB - Write Back

Each pipeline register includes a 1-bit register called "bubble"; when the bubble input to a stage is asserted, it indicates that the instruction accompanying it (in the rest of the pipeline register) has been squashed, and should not be processed within the stage. As an instruction goes through the pipeline, once it is squashed it stays squashed.

Each stage is now described, including both its combinational logic components and its register.

3.1 Instruction Fetch

The main IF pipeline register hardware consists of the storage part of the Program Counter (PC) and the Instruction Register (IR). During normal assignment statement (sequential) execution, the PC is incremented at the end of every cycle. During the cycle an instruction is fetched from memory and is saved in the IR at the end of the cycle. Therefore during a cycle the value held by the PC is actually 4 greater than the PC value corresponding to the instruction in the IR.

Other PC Sources

The PC can be loaded (set to) three other possible sources than just PC+4 via the mux feeding the PC's input:

the output of the ALU in stage 3 (EX), when a conditional branch is to be taken; the ALU computes the absolute address from the relative address held in the instruction itself
an output of the register file in stage 2 (ID/OF), when an indirect branch (bri or rti) is encountered
the part of the IR holding the absolute target address of a branch-and-link (brl) instruction

Other IR Sources

The IR is force-fed branch-and-link instructions (brl) upon an exception. This effectively caused the CPU to perform a subroutine call to the appropriate service routine. The two cases are:

signed or unsigned overflow trap - execution vectors through memory location x"000004"
interrupt - execution vectors through memory location x"000008"

Each vector location would normally contain the equivalent of an unconditional branch to the appropriate service routine. Both forced brl's use R14 to save the return PC value.

Other Stage 1 Hardware

The PC is placed on the Address Bus (ABUS) via tri-state buffers in order to fetch an instruction. Instruction fetching is temporarily suspended when a memory access (load or store) is taking place in stage 4, thereby incurring a one cycle penalty.

The IR normally gets its input from the Data Bus (DBUS) (except at the start of an exception - see above).

3.2 Instruction Decode/Operand Fetch

Normally instruction decode would take place in stage 2; however, in the ICED CPU this is only partially done; instructions are re-decoded as necessary in later stages. This increases combinational logic cost and possibly critical cycle time, but reduces the pipeline register cost. The logic cost is not great, as the instructions are already partially decoded within the machine instructions themselves. It remains to be seen if these assumptions are accurate and if the tradeoff is a good one.

What is done is that various fields of the machine instruction in the IR are split off and fed to the proper places, including the register file address inputs and the operand multiplexers feeding the Execute stage (via pipe register 2).

Of major importance is the fetching of the register operands of the current instruction. The register address held in the r3 part of the IR is fed directly to caddr (the second port of the register file, which is read-only) and that operand is accessed. The register address in the r2 part of the IR is logically connected to the baddr input of the register file during the second half of the clock cycle, performing the fetch of the second register operand at that time.

The architectural registers are written in the first half of the clock cycle, using the register address and data forwarded from the Write Back stage (5).

>>>>> NOTE: in order for behavioral simulation of the CPU to work, a slight delay (arbitrary - say 10 ns) must be added to the baddr register address input.

The two operand mux's select the appropriate inputs for the ALU in the next, Execute, stage.

Operand 1 Mux

The six inputs are:

0: register file output C (port 2) - for common assignment statements
1: aligned and sign-extended relative address - for conditional branches
2: aligned and 0-padded byte address offset - for loads and stores
3: aligned and sign-extended constant - for load immediate instructions
4: bypass path from ALU, so that instruction n can use the result of instruction n-1 without going through the register file (i.e., before the sink has reached the Write Back stage)
5: similarly, bypass path from the MEM stage, so that instruction n can use the result of instruction n-2 before Write Back

Operand 0 Mux

The five inputs are:

0: register file output B (port 1) - for common assignment statements and certain branches, loads and stores
1: all 0's - used with load immediate (ldi) to effectively send the output of operand 1 (the constant) unchanged through the ALU (the two are added: 0 + constant = constant)
2: the PC( + 4) aligned and padded with 0's - for conditional branches, to compute the absolute target address from the relative address held in the machine instruction (target = PC + 4 + relative)
3: bypass path from ALU (stage 3), same as with input 4 of operand 1 above
4: bypass path from MEM, same as with input 5 of operand 1 above

Pipeline Register 2

Holds:

the opcode components opmaj and opmin
r1 (sink) address, for Write Back
memory sink (msnk) data, used in MEM as the data to be stored into memory; operand 0 and 1 are busy (why?)
opnd 1
opnd 0
"allzeroes" - the data from port 1 (B) is compared to 0 in preparation for conditional branch execution in stage 3 (EX); the data of each of the two bypass paths is also compared to 0, for the same reasons; only one of the three results (the valid one) is sent on to stage 3

3.3 Execute

The opcode information and the r1 address are fed directly through to the MEM stage.

There is a multiplexer in the msnk path to select the memory sink data from the prior stage (2 or ID/OF), which is the norm, or the bypassed result of the MEM stage. This is only used when a store follows a load and when the register result of the load is the register data source of the store. It also remains to be seen if this is a good tradeoff.

The ALU performs all of the Arithmetic and Logic operations necessary for the ICED instruction set. Its inputs are operands 0 and 1 from the prior stage (2 or ID/OF) and its main output is sent on to stage 4 (MEM) via rsnk. It also generates a carry (c) signal (used for unsigned overflow) and an overflow (v) signal (used for signed two's complement overflow). c and v are used to generate an overflow trap if the machine instruction specifies it.

3.4 Memory

There are two parts to the Memory stage hardware: memory (bus) access, and data alignment.

Bus Access

Loads and stores have priority over instruction fetching. When either a load or store is to be executed, the output of the ALU from the prior stage (3 or EX) is used to drive the ABUS tri-state buffers (the ALU has computed the absolute memory address from the base address [held in the register file] and the offset [from the machine instruction]). If the instruction is a load, the DBUS is read and the data sent on to the Write Back stage via "sink" in pipeline register 4. If the instruction is a store, the data from msnk is placed on the DBUS via tri-state buffers and memory is written.

Data Alignment

Both loads and stores use the same complex of data-aligning mux's. As is seen in the IcpuED Dataflow, there is one mux for each byte of the 4-byte word. Data alignment (and possibly sign-extension) is only needed for byte loads and writes.

For a byte load, the mux labeled b0 is used to select the proper byte (indicated by the two lsb's of the memory address) and place it in the least significant byte position. Sign extension is performed. Other bytes returned from the memory system are ignored; they may not be assumed to be any particular value, including 0.

For a byte store the least significant byte of msnk is routed to the byte indicated by the two memory address lsb's, padded with 0's, then sent to the memory system. The memory system must ensure that only the selected byte is actually written.

3.5 Write Back

There is little in this stage. Bit 5 of opmaj (the most significant bit of the opcode) is used to enable writing of the piped result (sink) to the register file. The sink register address (r1) and data (sink) are sent back to the register file. This stage has no pipeline register.

4. Programming Conventions

it is suggested to initialize R0 to 0 and use it for a constant 0
R1 may similarly be initialized to 1
R14 holds the last-generated return address (normally from a brl instruction)
R15 holds the stack pointer; this should be initialized to the top of physical main memory

The programmer must add the appropriate stack manipulation code when nested subroutines are used.

5. Design Data

THIS SECTION IS UNDER CONSTRUCTION

This is available via the departmental ftp site. NOTE: the design data has not yet been upgraded with the latest version of the EDA tools.

You will have to regenerate the generated VHDL and resulting downstream data. However, the Logiblox components have been pre-compiled for you; the work directory is the downstream directory /dp. The Logiblox source files are in the /lb directory.

You should copy all of ...../cpu to your own directory system. You must also set up library mappings in Renoir to create a library called "canned" having ..../cpu/{dd,hdl,dp} as the design data, hdl and downstream preparation directories, respectively. The components of IcpuED will then appear in the "canned" library.

..../cpu/lb holds the Logiblox components used in IcpuED. There are also *.mem files that can be used to initialize your memory with the machine code of the program in Appendix I ???. Specifically, the files are (most significant byte first): intmemb3.mem, intmemb2.mem, intmemb1.mem, and intmemb0.mem.

..../cpu has a *.gal project file for Galileo that you can modify for your own use. It's primarily good to get the settings from; all of the source and target files will have to be changed, though.

..../cpu/xprojsys is an M1 project that similarly you can copy settings from, but will have to modify the source file settings.

..../cpu/dp also holds some possibly useful *.do command files for qhsim.

NOTE: you must use the icedsysbus.ucf pad/pin constraints file in M1. It should be put in ..../cpu/dp

6. Test Status of IcpuED

To first order IcpuED works in the hardware. It has executed benchmarks, and is able to display text on the hardware's LCE display.

Stay tuned for the rest.

Last modified June 5, 2000, Gus Uht - uht@ele.uri.edu