ICED Canned CPU - IcpuEDApril 13, 1999 (revised June 5, 2000)Outline1. Introduction 1. IntroductionThe ICED Canned CPU was created for the use of students in the ICED computer engineering curriculum who do not desire to use their own CPU designs in the later core courses of the ICED curriculum (ELE 405 and on). The architecture and microarchitecture are based on the classic Hennessey and Patterson 5-stage pipelined CPU. The current version does not make use of Instruction Level Parallelism, and only rudimentary static branch prediction is employed. Forwarding is used, therefore minimal data dependencies (flow only) are enforced.NOTE: The information in this document has not been completely updated.
There are bug fixes that have not been incorporated herein.
2. Architectural Model2.1 BasicsIcpuED uses a 24-bit address space. Addressing is done at the byte level, therefore the main memory can hold up to 16 MBytes.Two data sizes are supported: byte (8-bits) and word (4-bytes or 32-bits). Half-word is not supported. Words are aligned in memory (two lsb's of every word address are 0's). There are 16 addressable registers in the architectural register file, each 32-bits in length. Other than the Program Counter (PC) there are no other architectural registers (conditional branches test contents of the main register file storage for taken/not-taken decisions). Data are assumed to be either 2's complement signed integers or unsigned integers, depending on the context (instruction being executed). Instructions are always 1-word (4-bytes or 32-bits) in length and are
word-aligned in memory, i.e., the two lsb's of an instruction's address
are always 0. (That is why the PC is actually only 22-bits long.)
2.2 Instruction SetThere are 25 instructions in the IcpuED instruction set. They are organized into 6 groups as follows:
3. MicroarchitectureA high-level dataflow diagram of the ICED CPU is shown in IcpuED Dataflow. Each pipeline register is labeled: "pipe x" for each stage x. pipe 1 receives the output of stage 1, pipe 2 the output of stage 2, and so on. The five stages are:
Each stage is now described, including both its combinational logic
components and its register.
3.1 Instruction FetchThe main IF pipeline register hardware consists of the storage part of the Program Counter (PC) and the Instruction Register (IR). During normal assignment statement (sequential) execution, the PC is incremented at the end of every cycle. During the cycle an instruction is fetched from memory and is saved in the IR at the end of the cycle. Therefore during a cycle the value held by the PC is actually 4 greater than the PC value corresponding to the instruction in the IR.Other PC SourcesThe PC can be loaded (set to) three other possible sources than just PC+4 via the mux feeding the PC's input:
Other IR SourcesThe IR is force-fed branch-and-link instructions (brl) upon an exception. This effectively caused the CPU to perform a subroutine call to the appropriate service routine. The two cases are:
Other Stage 1 HardwareThe PC is placed on the Address Bus (ABUS) via tri-state buffers in order to fetch an instruction. Instruction fetching is temporarily suspended when a memory access (load or store) is taking place in stage 4, thereby incurring a one cycle penalty.The IR normally gets its input from the Data Bus (DBUS) (except at the
start of an exception - see above).
3.2 Instruction Decode/Operand FetchNormally instruction decode would take place in stage 2; however, in the ICED CPU this is only partially done; instructions are re-decoded as necessary in later stages. This increases combinational logic cost and possibly critical cycle time, but reduces the pipeline register cost. The logic cost is not great, as the instructions are already partially decoded within the machine instructions themselves. It remains to be seen if these assumptions are accurate and if the tradeoff is a good one.What is done is that various fields of the machine instruction in the IR are split off and fed to the proper places, including the register file address inputs and the operand multiplexers feeding the Execute stage (via pipe register 2). Of major importance is the fetching of the register operands of the current instruction. The register address held in the r3 part of the IR is fed directly to caddr (the second port of the register file, which is read-only) and that operand is accessed. The register address in the r2 part of the IR is logically connected to the baddr input of the register file during the second half of the clock cycle, performing the fetch of the second register operand at that time. The architectural registers are written in the first half of the clock cycle, using the register address and data forwarded from the Write Back stage (5). >>>>> NOTE: in order for behavioral simulation of the CPU to work, a slight delay (arbitrary - say 10 ns) must be added to the baddr register address input.The two operand mux's select the appropriate inputs for the ALU in the next, Execute, stage. Operand 1 MuxThe six inputs are:
Operand 0 MuxThe five inputs are:
Pipeline Register 2Holds:
3.3 ExecuteThe opcode information and the r1 address are fed directly through to the MEM stage.There is a multiplexer in the msnk path to select the memory sink data from the prior stage (2 or ID/OF), which is the norm, or the bypassed result of the MEM stage. This is only used when a store follows a load and when the register result of the load is the register data source of the store. It also remains to be seen if this is a good tradeoff. The ALU performs all of the Arithmetic and Logic operations necessary
for the ICED instruction set. Its inputs are operands 0 and 1 from the
prior stage (2 or ID/OF) and its main output is sent on to stage
4 (MEM) via rsnk. It also generates a carry (c) signal (used for
unsigned overflow) and an overflow (v) signal (used for signed two's
complement overflow). c and v are used to generate an overflow
trap if the machine instruction specifies it.
3.4 MemoryThere are two parts to the Memory stage hardware: memory (bus) access, and data alignment.Bus AccessLoads and stores have priority over instruction fetching. When either a load or store is to be executed, the output of the ALU from the prior stage (3 or EX) is used to drive the ABUS tri-state buffers (the ALU has computed the absolute memory address from the base address [held in the register file] and the offset [from the machine instruction]). If the instruction is a load, the DBUS is read and the data sent on to the Write Back stage via "sink" in pipeline register 4. If the instruction is a store, the data from msnk is placed on the DBUS via tri-state buffers and memory is written.Data AlignmentBoth loads and stores use the same complex of data-aligning mux's. As is seen in the IcpuED Dataflow, there is one mux for each byte of the 4-byte word. Data alignment (and possibly sign-extension) is only needed for byte loads and writes.For a byte load, the mux labeled b0 is used to select the proper byte (indicated by the two lsb's of the memory address) and place it in the least significant byte position. Sign extension is performed. Other bytes returned from the memory system are ignored; they may not be assumed to be any particular value, including 0. For a byte store the least significant byte of msnk is routed to the
byte indicated by the two memory address lsb's, padded with 0's, then sent
to the memory system. The memory system must ensure that only the selected
byte is actually written.
3.5 Write BackThere is little in this stage. Bit 5 of opmaj (the most significant bit of the opcode) is used to enable writing of the piped result (sink) to the register file. The sink register address (r1) and data (sink) are sent back to the register file. This stage has no pipeline register.4. Programming Conventions
5. Design DataTHIS SECTION IS UNDER CONSTRUCTIONThis is available via the departmental ftp site. NOTE: the design data has not yet been upgraded with the latest version of the EDA tools. You will have to regenerate the generated VHDL and resulting downstream data. However, the Logiblox components have been pre-compiled for you; the work directory is the downstream directory /dp. The Logiblox source files are in the /lb directory. You should copy all of ...../cpu to your own directory system. You must also set up library mappings in Renoir to create a library called "canned" having ..../cpu/{dd,hdl,dp} as the design data, hdl and downstream preparation directories, respectively. The components of IcpuED will then appear in the "canned" library. ..../cpu/lb holds the Logiblox components used in IcpuED. There are also *.mem files that can be used to initialize your memory with the machine code of the program in Appendix I ???. Specifically, the files are (most significant byte first): intmemb3.mem, intmemb2.mem, intmemb1.mem, and intmemb0.mem. ..../cpu has a *.gal project file for Galileo that you can modify for your own use. It's primarily good to get the settings from; all of the source and target files will have to be changed, though. ..../cpu/xprojsys is an M1 project that similarly you can copy settings from, but will have to modify the source file settings. ..../cpu/dp also holds some possibly useful *.do command files for qhsim. NOTE: you must use the icedsysbus.ucf pad/pin constraints file
in M1. It should be put in ..../cpu/dp
6. Test Status of IcpuEDTo first order IcpuED works in the hardware. It has executed benchmarks, and is able to display text on the hardware's LCE display.Stay tuned for the rest.
Last modified June 5, 2000, Gus Uht - uht@ele.uri.edu |