2008-03-16

读Build your own RISC processor simulator -2

Throughput and turnaround time: (吞吐量和整个运算时间)
It is common to hear and read that most RISC processor instructions are single-cycle in nature. This confuses first time RISCers as it contradicts their understanding of the processor pipeline.
许多 RISC处理器是单周期的.这好像和流水线有些挂不上钩.

The confusion can be reduced by rereading such statements as: "Most RISC processor instructions take one clock cycle per pipeline-stage".Alternately, the 'single-cycle' can be interpreted as referring to a single pipeline-cycle instead of being seen as a single clock-cycle.
其实在这里'单周期' 可以被理解为一个单流水线周期而不是一个单时钟周期.

The fact is that all instructions are multi-cycle in nature. Every instruction takes at least as many clock cycles to complete, as the number of pipeline stages. This is a measure of the instruction turnaround time.

事实上所有的指令都是多周期的,执行完一个指令 花费的时钟周期和流水线站数是一样. 这是测量"指令整个运算时间"的方法 - 比如,在图1,完成一条指令要3个时钟周期,流水线站数也是3.

NOTE: Pipeline stages怎么翻译?流水线阶段?流水线站?流水级?我还是倾向于理解成流水站.WiKi上的说法是:The more pipeline stages a processor has, the more instructions it can be working on at once.

However, setting special conditions aside, a pipeline has a throughput of one instruction per clock-cycle, once it reaches steady state. Fig.2 through Fig.5 delineate these two concepts.

一旦达到稳定状态一个流水线的的吞吐量是一个周期一条指令

As can be seen, it takes three clock cycles for this pipeline to get 'filled' . At the end of the third cycle, the first instruction completes execution and retires. With this the pipeline reaches a steady state as shown in Fig.1. From thereon, one instruction retires per clock-cycle. While the ideal number of stages for a pipeline is debatable, a few limiting factors help make this decision easier for a RISC processor designer:
o The number of parallely usable functional units (including internal buses!)
o The number of frequently used instructions that take more than one clock-cycle to complete any of their stages (e.g., memory load/store, multiplication, branch instructions)
o Non-interlocking vs. stalling pipeline approach6

在上图中流水线花了3个时钟周期才填满.在第3个周期结束的时候第1条指令完成并退出了.这时 候流水线进入了稳定状态. 从这时候开始,一个时钟周期完成一条指令.

Processor Simulation
The process of design, development and testing of a processor takes a long time during which many models are made to fine-tune its functionality and performance, before the production is commenced. These models simulate the processor behaviour in various levels of detail. For instance, typical FPGA models match their processor's functionality but not the timing characteristics. Yet, these models help the designers identify and correct most of the flaws.

前面都是聊天,下面要进入正题了.

设计,开发和测试一个处理器要花费很长时间用许多原型用来调整它的功 能和性能.这些原型在不同的层次来模拟处理器的行为.

The production of hardware models is usually discontinued after the processor is proven and accepted in the market. However, the software models, also more popularly known as simulators, continue to be used,
enhanced and produced as long as the processor is in use. In spite of certain limitations (such as being unable to exactly reproduce time critical behaviour such as interrupt latency and bus cycles), these simulators serve as close functional approximations and inexpensive alternatives to their processors, the reference hardware boards and associated environment.

件原型往往在处理器上市之后就废弃了,而软件models (原型,也就是模拟器)还会被继续使用.它的生命周期往往和处理器一样长.除了一些特定的限制(比如无法准确地重现一些时间性要求很严格的行为,e.g. 中断等待时间,总线周期等).这些模拟器相对于处理器本身和硬件板子等来说是还是功能尚可,价格便宜的.

Design Considerations
It is fairly trivial to design a processor simulator as a simple transformation function / mapping between the processor's instruction set � Ip and the instruction set � Ih of the simulator's host machine. This mapping may simply be based on a lookup-table if Ip is a functional subset of Ih i.e., if there is a one-one mapping between Ip and Ih (with allowance to difference in instruction formats). If the two instruction sets are significantly different from each other, a slightly involved mapping has to be employed. In this case, each instruction of Ip has to be implemented in terms of two or more instructions from Ih.
设计一个处理器模拟器如果仅仅在处理器指令集(Ip)和主机指令集(Ih)之间做转换,那是没什么价值的. 这种转换无非就是做一个查询表,表里面是映射关系.顶多指令之间有些转换.

These mappings can be implemented by designing the simulator as an interpreter for the instruction stream of a program written for the target processor. The simulator can take as input, either the executable
instructions of Ip or their assembler mnemonics. In either case, the interpretation is easier by using an intermediate high-level language � HL that is supported by the host. The translation from HL to Ih is best left to the host's HL translator(Read as compiler/interpreter).

种映射可以这么实现 - 模拟器是一个解释程序,负责解释为目标处理器写的程序的指令流.这个模拟器可以接受两种输入:Ip集合中可执行指令或它们的汇编形式.不管哪种形式,解释 程序可以采用主机支持的中间语言 - HL. HL到主机指令集只见的翻译可以让主机的HL翻译器(compiler/interpreter). 也就是Ip -> HL -> Ih的一个流程.

Granularity of simulation
The simple instruction mapping approach suffices most normal programming tasks. However, on a closer look, it becomes evident that there is more to simulating a processor than implementing its instruction set. A fine grain behavioural simulation should involve modeling key functional blocks and macro blocks that make the processor.

前面提到的简单的指令映射的方法对于许多一般的 程序那是足够了。然而,这种方法更像是在模拟一个处理器而不是它的指令集。一个好的行为模拟粒度应该包括那些用于制造处理器的关键功能模块和宏模块。

Typical blocks that constitute a RISC processor include ALU, instruction decoder, processor control logic, register files, instruction pipeline, barrel shifters, multipliers, writebuffers and internal buses. Depending
on the target application / users, a simulator designer has to include models of these blocks into the simulator. For instance, if the simulator is to be used for detailed clock-cycle level profiling, the simulator must include a good model of the instruction pipeline and its clock.

RISC 处理器包括的典型的模块有:ALU,指令译码器,处理 器控制逻辑,注册表文件,指令流水线,桶式移位器,乘法器,写缓冲和内部总线. 模拟器设计者根据目标程序/用户的实际情况来决定哪些原型需要,哪些不需要. 例如,假如模拟器被用来做时钟周期级别的性能测评,那么模拟器必须包括一个好的指令流水线和它的时钟.

NOTE: 桶式移位器(BS) A barrel shifter is a digital circuit that can shift a data word by a specified number of bits.It can be implemented as a sequence of multiplexers.

Simulator Components
The rest of this paper presents a detailed behavioural model of Crisp - a hypothetical RISC processor. Instead of explaining the architecture of Crisp as a separate section, its simulator design is used as a vehicle to introduce the processor and its components.

下面要说的就是一个假设的RISC处理器,Crisp.


The key processor components to be modeled are:
0. Clock
1. Memory Interface
2. Execution Unit
3. Arithmetic and Logic Unit
4. Pipeline and parallelism among components

这个处理器的关键组成模 块有:

0.时钟

1.内存接口

2.执行单元

3.算术和逻辑单元

4.流水线及其相关模块

Clock
For a real processor, a clock signal provides the heartbeat. Each instruction takes a pre-designed number of clock cycles to complete. Such a clock is not an essential requirement for building a software model of the processor. Yet, instruction level profiling and fine grain performance analysis of programs will be difficult if such a model makes no provision for a clock. Also, as will be seen later, a model with a clock eases
simulating the behaviour of an instruction pipeline.

真实的处理器中时钟用来提供心跳功能.完成一条指令需要花费一个预先设计好的时钟周期.这样一个时钟不是构建一个处 理器软件模型所必需的.当然了,如前面提到的,如果处理器仿真器是拿来做指令级别的性能测评或颗粒度比较细的程序性能分析的话,那么这样一个时钟是必需 .后面我们也会看到,一个时钟模块可以简化指令模拟流水线的复杂度.

While the hardware design of a system clock is fairly complicated and involves high precision engineering for the oscillator and phase locked loops for fine-tuning, its software equivalent can be modelled very easily. A system wide counter can act as the clock with its value being updated at appropriate stages of executing each instruction.
硬件时钟是比较复杂的,它的软件实现会容易得多.一个系统范围内的计数器可扮演一个时钟的角色 - 在执行指令的适当时机更新它的值即可.

It is clear that this behaviour is opposite to that observed on a real processor where the clock drives the instruction execution. However letting the instruction execution phases drive the clock is a good enough approach for a software simulator.
当然了,模拟器的这个行为和由时钟驱动指令执行的真实处理器是相反的- 在真处理器上是时钟驱动指令执行;在模拟器上是指令执行驱动处理器. 但是对于软件模拟器来说,让指令的执行来驱动时钟已经可以满足需求了,所以说这也是一个不错的办法.

It might be worthwhile to consider using a floating-point value for the clock counter so as to represent half/quarter cycles or any other intermediate points within a clock cycle for very fine grain timing analysis. e.g.,
o RD, WR signals go high/low at set points in a cycle

o data/address buses contain valid data only during a specific portion of the cycle.
可以考虑用一个浮点数作为时钟计数器,这样可以模拟出1/21/4周期或一个时钟周期内的某个点.在某些 场合下着比较有用,比如:

- RD,WR信号在周期的某个点发生上升/下降沿变化

- 数据/地址总线只有在一个周期的特定时候才会包含有效数据

Crisp receives its clock from an external source such as a PLL.

Crisp用一个PLL做外 部时钟源.

NOTE:PLL:Phase Lock Loop clock.锁相回路.PLL is basically a closed loop frequency control system, which functioning is based on the phase sensitive detection of phase difference between the input and output signals of the controlled oscillator(CO).

Memory
Memory is best modelled as an array of data words. A more sophisticated approach would be to model memory as an abstract data type with features such as separate program and data memories, write protection and storage heirarchy (TLB, multi-level cache, primary memory, secondary memory etc.).
内存可以用一个数组来模拟.一个更复杂的方法将是用抽象的数据类型来构建内存,如单独的程序,这个程序包括 了数据存储器,写保护区和存储结构(TLB,多级缓存,主内存,辅内存等).

Registers can be treated as an extension to the memory model. Register files can be supported by a two dimensional array of data words, with one column per register.

寄存器可以看作是内存原型的 扩建部分.寄存器文件可以做成一个两维数组.

NOTE:For a two dimensional array, by convention, the first subscript is understood to be for rows and the second for columns.

Crisp has 15 general-purpose registers named r0 through r14. By convention, r13 is used as the stack pointer and r14 as the link register for procedure calls. r15, a special register, serves as the program counter (instruction pointer). These registers are 32-bit wide.

Crisp15个通用寄存 (r0-r14). r13里存放的是堆栈指针(SP), r14??, r15用作指令寄存器(PC).它们都是32位的.

Execution Unit
The execution unit can be modelled by as a mapping of the instruction set of the processor being modelled to that of the host processor. Or, as a simple translation of the semantics of a model instruction to that of a language construct interpretable on the host processor. e.g.,
执行单元可以看做是被模拟的处理器的指令集合到主机处理器的映射.或者干脆就是一个简单的语义层面的翻译器.

Model instruction:
operator operand_1 operand_2

A 'C' translation:
operator(operand_1, operand_2)
这里举了个例子:

要模拟的指令是:

operator operand_1 operand_2

它的C译文就是:
operator(operand_1, operand_2)

Though it seems unnecessary to introduce one more level of indirection between the model instruction and translation in the form of a function call, its utility becomes evident when it is realised that different types of operators might involve different kinds of processor subsystems. e.g.,
add r0, r1
; involves only registers and ALU


add r0, [r1]
; involves registers, memory and ALU


mov r0, 0x10

; involves only registers (instruction register and r0)


mov [r0], 0x10
; involves registers and memory

尽管好像没有必要在被模拟的指令和目标代码(被翻译后的译文)之间加入一个或更多的中间层. 是有些情况下还是很必要的,比如不同类型的运算符(operator)可能需要不同类型的处理器子系统的参与.

Arithmetic and Logic Unit
ALU operations come next only to memory operations in number, in any typical program. The ALU can also be modelled on lines similar to those of the execution unit. The operators of the processor being modelled are mapped on to those of the host processor or to those of any language understood on the host processor. e.g.,
Model instruction:
add r0, r1


Execution Unit model:
_add(_reg_r0, _reg_r1)


ALU model:
return (_reg_r0 += _reg_r1);

在一般程序中,ALU(算术运算单元)只涉及在内存中的数字操作. ALU也可以实现的和前面提到的执行单元一样 - 被模拟的运算符被映射到主机处理器或主机处理器能理解的任何语言.


Crisp does not have a multiplier but has a barrel-shifter to perform shifts of length 1-32 in a single cycle. Most of the Crisp instructions are in 3-address code format (with unspecified operands filled by an assembler with default values).

Crisp没有乘法器.但是有一个桶移位器来在一个周期内移动1-32.许多Crisp的指 令都是3地址格式的.

Pipeline and parallelism among components
Most modern processors have a 3-6 stage instruction execution pipeline.

A pipeline helps to maximise the utilisation of different components of a processor, which function in parallel and independent of each other (sharing the same clock).


A software model need not simulate parallelism in the real world time. It is necessary and sufficient if various components of the processor run in parallel with respect to the software clock that is available in the model.

Crisp employs a 3-stage fetch-decode-execute pipeline. The pipeline is clocked at the same speed as the external clock input.

许多现代处理器有3-6流水线站. 一个流水线可以提高处理器不同模块间的利用率. 一个软件模拟器没必要模拟并行 - 不同的处理器模块能够和软件时钟一起并行工作(而不是阻塞形的指令顺序)就足够了.Crisp有一个3站式流水线(取指令->译码-> ).这个流水线被外部时钟 驱动(也就是前面提到的PLL).

A Crisp simulator
In this section, 'C' code fragments of the simulator will be presented along with suitable explanations wherever required. We take a top-down approach for the design and look at non-trivial functionalities in detail. Firstly, the super-structure of the simulator:

-------------------------------------------------------------------------------------
int main(int argc, char *argv[])
{
extern char *progname;


/* process arguments */
progname = argv[0];
/* ... */
init_sim(); /* initialise Crisp functional blocks */

/*
* load the Crisp instruction stream to be executed into memory.
* argv[1] holds the stream file name.
*/
program_start = load_program(argv[1]);


start_Crisp(); /* Crisp starts executing from address 0 */
}

-------------------------------------------------------------------------------------

下面用流水账的形式过一下Crisp的代码,先看主函数做了些什么:

1.初始化Crisp

2.装载程序的指令流到内存中

3.开始执行指令

Initialisation

-------------------------------------------------------------------------------------

void init_sim(void)
{
init_regs(0); /* clear (zero) Crisp registers */
init_memory(0); /* clear memory accessible to Crisp */
init_clock(); /* reset the clock counter to zero */
init_pipeline(); /* setup an [empty] queue of instructions */
}

-------------------------------------------------------------------------------------
Though it is usual, at startup, to set the registers and memory to zero, it is better to design for a value other than zero too. For instance, in order to understand memory usage patterns, it is useful to initialise the memory to relatively unique patterns such as 0xbaba and 0xf00dcafe. Hence, init_regs() and init_memory() take an integer argument.


The pipeline is best modeled as a queue of instructions. New instructions enter the queue at the tail while the completed instructions exit the pipeline from the head. init_pipeline() initialises these head and tail indices.

-------------------------------------------------------------------------------------

void init_pipeline(void)
{
_p_head = _p_tail = 0;
}

-------------------------------------------------------------------------------------

始化:

1. 重置寄存器

2. 内存

3. 重置时钟

4. 流水线(建立一个空的队列用来装待执行的指令)

置大部分是指设0,当然你也可以设成其它值. 比如为了理解内存使用情况,你可以设置内存块的默认值是0xbaba之类的.当然了,这样的话你的init_regs等函数就要加一个参数了.

水线采用了一个FIFO的队列.

Crisp in action
The main phase of simulation opens with start_Crisp(), as the instruction stream execution starts from memory word zero � Crisp's reset vector address.

-------------------------------------------------------------------------------------
void start_Crisp(void)
{
extern int pending_bds; /* see 'Handling branches' */


set_reg_val(REG_NEXTPC, RESET_VEC_ADDR); // 设置NEXTPC
set_reg_val(REG_PC, get_reg_val(REG_NEXTPC) - 8);
// 设置PC
pending_bds = 0;

/* pipelined execution */
while (1) {
start_new_cycle();
exec_pipeline_stages();
// NEXTPC处取指令,译码然后执行
retire_instrs();
}
}

-------------------------------------------------------------------------------------

All Crisp instructions take exactly three cycles to complete. Each instruction in the pipeline completes one stage of execution, per clock cycle. exec_pipeline_stages() illustrates this. The reason for REG_PC trailing REG_NEXTPC by 8-bytes (two instructions) becomes evident as we go through the inner workings of all the three stages.

所有的Crisp指令都要花费3个时钟周期 - 一个时钟周期完成一个指令的一个流水站. (NOTE:后面的Handling branches节会解释为什么要-8)

-------------------------------------------------------------------------------------
void exec_pipeline_stages(void)
{
_decoder_output cur_decoder_output;
/*
* simulate pipelining by retaining a decoded instruction
* across invocations

*/
static _decoder_output prev_decoded_instr = {INVALID};

/*
* The three pipeline stages:
* 1. fetch the instruction pointed to by PC.

* 2. decode the instruction trailing the head by one position
* 3. execute the instruction which was previously decoded
*/
fetch(get_reg_val(REG_NEXTPC));
decode(peek_pipeline(HEAD, 1), &cur_decoder_output);
execute(prev_decoded_instr);

/* prepare for next cycle */
prev_decoded_instr = cur_decoder_output;
}

-------------------------------------------------------------------------------------

As the inline comments suggest, peek_pipeline() takes as arguments, an enumerated reference position (HEAD/TAIL) and an offset (0-2) from that position (towards the other position). It returns a (possibly NULL) pointer to the required instruction.

Crisp pipeline mechanics
The fetch() stage simply requests the memory subsystem for the instruction at the address contained in an internal register REG_NEXTPC and puts it into the pipeline.

-------------------------------------------------------------------------------------
void fetch(word *instr_addr)
{
enpipe((_instruction) *instr_addr);
}

-------------------------------------------------------------------------------------


The decode() stage is a bit more involved. In this stage, Crisp's instruction decode logic parses the instruction and generates necessary control signals that are needed for the 'execute' stage. The simulator can afford, however, to abstract most of these low-level details and only implement NEXTPC modification logic.

-------------------------------------------------------------------------------------

void decode(_intruction *instr, _decoder_output *out)
{
if (instr) {
out->opc = get_opcode(instr);
/* most Crisp instructions are in 3-address code format */
get_operands(instr, &out->opd1, &out->opd2, &out->opd3);

} else {
/* invalidate output so that execute stage ignores it */
out->result = INVALID;
}

/* prepare REG_NEXTPC for next cycle's fetch stage */
set_reg_val(REG_NEXTPC, get_reg_val(REG_NEXTPC) + 4);
}

decode函数结束后修改NEXTPC的值(+4 means 指向下一条指令),全局变量就是影响看代码的质量...

-------------------------------------------------------------------------------------


In the execute() stage, Crisp's functional units such as the ALU, shifter and data memory interfaces are activated according to the control signals generated by the decode stage for this instruction in the previous cycle. The simulator only needs the decoded instruction for this phase:

-------------------------------------------------------------------------------------

void execute(_decoder_output *decoded_instr)
{
extern void (* instr_handlers[])(_operand1 *, _operand2 *,_ operand3 *);

instr_handlers[decoded_instr->opc](&decoded_instr->opd1,&decoded_instr->opd2,&decoded_instr->opd3);


/* prepare REG_PC for the next cycle's execute stage */
set_reg_val(REG_PC, get_reg_val(REG_PC) + 4);
}

-------------------------------------------------------------------------------------

Each type of instruction is executed by its handler which can be obtained by indexing into instr_handlers[] with the instruction's opcode.
instr_handlers数组中存放的是每种类型的指令的响应函数.

Sample handlers
...

Handling branches

Arithmetic and logical instructions such as add/sub, shift, or/xor/and and compare update an internal register - REG_FLAGS, which holds processor state information related to carry, overflow, zero etc. Program flow can be altered by branching based on the state of these flags. This helps implementation of control structures such as if-else, for and do-while using branch instructions.
算术和逻辑指令(e.g. add/sub, shift, or/xor/and and compare)一个用刷新一个内部寄存器,REG_FLAGS.这个寄存器保存着处理器关于进位,溢出,归零等的状态信息.程序执 行流根据这些状态位而改变.这有助于实现像if-else,for,do-while一类的控制结构.

Branch instructions break the smooth flow in the pipeline and hence need special handling. Crisp takes the non-interlocked approach in implementing branches, by unconditionally executing two instructions that immediately follow the branch instruction in the program. This is also known as delayed-branching and the two instructions following the branch instruction are said to be in branch-delay-slots. This approach helps the pipeline to run without stalling for the branch target to be fetched.
分支指令打乱了流水线中的顺序指令流.于是需要特殊的处理. Crisp在实现分支的时候采取了非互锁的办法 - 无条件的执行紧跟在分支指令之后的两条指令. 这也就是传说中的"延迟转移",跟在分支指令后面的两条指令有较"延迟转移槽".这种方法帮助流水线停顿的问题.
NOTE:从开始处理转移指令到明确转移是否发生之间,存在一段转移延时时间,称为转移延时槽(Branch-Delay-Slot).
:延迟转移(delayed-branching)属于静态调度技术,它由编译程序重排指令 序列来实现.它的基本思想是发生转移取时并不排空指令流水线.而是让紧跟在转移指令之后己进入流水线的少数几条指令继续完成,若这些指令是与转移结果无关 的有用指令,那么延迟损失时间片就得以有效的利用. 这种"先执行再转移"的软件方法,比较适合于流水线段数较少的RISC处理器的指令流水线。这样的流水线段数一般只有34,C的取值一般为2(也就是 Crisp选取的值),由编译程序完成判测并重排少数几条指令是办得到的.这种延迟转移法实际上把对转移的有效处理留给了软件--具有优化功能的编译程 .看到这你似乎会注意到一个问题:延迟转移是针对遇到转移指令的时候,但是看Crisp的代码,它似乎


The number of branch delay slots yet to be executed is tracked by pending_bds. This helps to maintain the integrity of REG_PC. The REG_PC updation logic in execute() has to be modified to handle branches. During the execution of the two delay slots REG_PC should contain their addresses but should contain the address of the branch target immediately after the completion of the delay slots' execution. This is accomplished by resetting REG_PC to trail REG_NEXTPC by two instructions, as should be the normal case.
待执行的转移延时槽的数量由pending_bds来记录. 这有助于我们维护REG_PC. 在前面提到的execute函数的最后,会对REG_PC做一个+4的操作,这主要是为了处理分支的需要.在执行两个延时槽的时候,REG_PC要指向
它们的地址,而且需要在执行完2个延时槽以后立马指向转移的地址.这是通过将REG_PC设置成REG_NEXTPC-8(2条指令)来实现的.

void execute(_decoder_output *decoded_instr)
{
extern void (* instr_handlers[])(_operand1 *, _operand2 *, _operand3 *);

extern int pending_bds, bds_flag;
/* to handle branch delay slots */

instr_handlers[decoded_instr->opc] ( &decoded_instr->opd1, &decoded_instr->opd2, &decoded_instr->opd3, );

/* prepare REG_PC for the next cycle's execute stage */
switch (pending_bds) {
case 2:
/*
* first of the two delay slots executed in this cycle; let REG_PC move forward
*/
set_reg_val(REG_PC, get_reg_val(REG_PC) + 4);
// REG_PC+4指向下一个待处理的dalay slot
pending_bds --;
break;

case 1:
// 如果就剩1个待处理的delay slot,就要重新把NEXTPCPC拉开距离(距离2bds)
/*
* second delay slot executed in this cycle;reset REG_PC to trail REG_NEXTPC
*/
set_reg_val(REG_PC, get_reg_val(REG_NEXTPC) � 8);
pending_bds --;
break;

case 0: /* normal sequential flow */
set_reg_val(REG_PC, get_reg_val(REG_PC) + 4);
}
}
To avoid indeterminate behaviour, the Crisp architecture suggests that a BDS may not contain a branch instruction.
前面那个execute是简化版,上面这个才是比较接近真实代码的.Crips有一个缺陷:在延迟处理的指 令中不能含有分支指令.
.
接下来文章中有一个A Simple Run详细演示了REG_PCREG_NEXTPC的关系变化. 总结一下大概是这样的:
PC
指向要执行的指令
NEXTPC
指向要取的指令

初始状态下,NEXTPC指向程序指令流的第一条指令(e.g. 0x00),PC指向NEXTPC之前的2条指令(初始状态下也就是指向 0x00前面的无效指令了);
取指令:NEXTPC指向的指令
译码:NEXTPC指向的指令,然后NEXTPC+4
执行:PC指向的指令执行,然后PC+4

遇到转移指令的话pending_bds被赋值(here is 2).然后在执行部分会根据pending_bds的值来决定NEXTPCPC的值.

No comments: