A simulation study Simple Scalar Processor
Submitted to:
Prof Anshul Kumar
By
Manish Gaur(2000MCS012)
Imran Ali (2000MCS003)
Om Prakash (2000MCS005)
on
7th May(Monday)
SIMPLE SCALAR PROJECT
INTRODUCTION
SimpleScalar Tool Set is an architecture simulator
that reproduces the
behavior of a computing device. It takes system inputs
and produces system
outputs and system metrics. Program codes written either in C are
compiled and executed by
the simulator. The simulator
tracks
microarchitecture state for each cycle and produces detailed
history of all
instructions executed.
In this project we have simulated our binary search program on the
following simulators:
1. sim-outorder: This simulator implements a very detailed out-of order
issue superscaler processor with a two level memory system and speculative
exection support.This simulator is a performance simulator,tracking the
latency of all pipe line operations.
2.sim-cheetah: This program implements a functional simulator driver for cheetah.Cheetah is a cache simulation package written by Rabin Sugmar and Santosh Abraham which can efficiently simulate multiple cache configuartion in a single run of a program.Specially ,Cheetah can simulate ranges of single level set-associative and fully-associative caches.
3.sim-cache: This simulator implements a functional cache simulator.Cache statics are generated for a user-selected cache and the TLB configuration,which may include up to two level of instruction and data cache (with any level unified), and one level of instruction and data TLBs.No timing information is generated.
4.sim-bprced: This simulator implements a branch predictior analyzer.
In this project,one program is compiled and
run on the various
four simulators explained above and the result are observed and
analysed. The Sim-Outorder simulator generates detailed
statistics of various parameters like performance, branch
prediction miss
rate, and mis-speculations. In this project, with the help of
the simulated
output, the effectiveness of different
types of branch predictors is
observed. The effect of the size of the branch target
buffer (BTB) on the
branch misprediction rate is observed. If a speculation
scheme is used to
predict the branch target address, table updates
can be done after the
branch is executed. A comparison of updating
either in the Instruction
Decode (ID) or WriteBack (WB) stages is done. The effect of the depth
of the
pipeline on CPU execution time is observed. The instruction fetch queue
size
effects the CPI (Clocks per instruction) or IPC (Instructions
per cycle).
2. BENCHMARKS
The simulations are run on binary serch program written in C. It searches if an integer is in an array and gives its location in the array if it is found. The total number of instructions executed is 189668 of which 36800 are executed branches.
3. SIMULATIONS
SimpleScalar has a wide range of simulation tools like Sim-fast
Sim-bpred,
Sim-cache, Sim-profile, Sim-cheetah, and Sim-Outorder. Of all,
Sim-Outorder
gives out a detailed issue performance and has a multi-level
memory system.
Sim-outorder helps us vary a large number of options like
branch predictor
type, extra branch mis-prediction latency. The benchmarks are
run on all four
the simulator tools described above.
When a branch instruction is decoded,
the CPU tries to predict its
direction. The simulator allows specifying the type
of branch predictor.
Simulations are run on the benchmark specifying the branch predictor
to be one of always taken, always untaken, bimodal predictor
using a branch
target buffer, or a 2-level adaptive predictor.
When using a bimodal predictor, the size of the branch target
buffer can be
varied. The effect of the size of the BTB on
the number of mispredicted
branch directions is observed by running the benchmark varying
the size
of the BTB to be 256, 512, 1024, 2048, 4096 and 8192 bytes.
In a speculative predictor, prediction table updates
can be done at any
stage after the branches are actually executed. The simulations are
run with
the updates done early and late in the pipeline, in
ID and WB stages. The
effectiveness of the speculative predictor in both these cases is analyzed.
The pipeline depth effects the execution time of a program.
The number of
ALU functional units is varied and integer multiply
units is
varied and the effect on the execution times is observed in the
both the cases individually.
Varying the size of the instruction fetch queue
changes the CPI or IPC.
Increasing the size of the queue reduces the
CPI and increases the IPC
thereby decreasing the total number of cycles for the total simulation.
4. RESULTS
a. Sim-outorder
If the direction of the branch predicted by the predictor is taken,
then it
is termed as a branch prediction hit. Else it is considered
to be a branch
misprediction. The total number of mispredictions over
the total number of
branch lookups gives a measure of the branch
prediction miss rate. The
branch prediction miss rate gives a measure
of the effectiveness of a
predictor type. The lower the miss rate, the effective is the predictor.
Important results are as follows:
Toatl no of Instructions: 189668
No of mem references : 47713
no of Load Instr :
27141
no of store Instr : 20572
sim cycle
: 204950
I L1 hits
: 217480
I L1 misses
: 14217
I L1 writebacks :
0
D L1 hits
: 46712
d L1 misses
: 577
* Deatailed simulation result is annxed
b: sim-cheetah:
LRU Set associative caches being simulated. no of sets fm 128 to 16384.Maximum
associativity is 2 and the line size is 16 bytes.
Miss ratios observed are as follows:
No of sets
Associativity
1
2
128
0.048986
0.027377
256
0.034155
0.023916
512
0.025585
0.021795
1024
0.022145
0.021424
* Deatiled simulation results are annexed.
C: sim-pred
The 2-level predictor is the simplest
dynamic predictor with a branch
history table specifying if the recent branch is taken or not. This
table is
accessed before predicting the direction. Since the recent
behavior of the
branch is considered, it is effective
than the two static prediction
schemes.
The bimodal predictor accesses the branch target
buffer and fetches the
predicted address for the decoded instruction. The prediction
is based on
its own recent behavior. Hence its prediction is more exact than that
of the
2-level prediction scheme where the prediction is based
on the behavior of
any recent branch. In addition to the effectiveness in prediction,
a bimodal
predictor reduces the branch penalty and hence
the total execution time
since the next instruction address is known in the ID stage itself.
Thus, the bimodal prediction scheme is effective
than the other schemes
considered here. The selection of a suitable predictor
is very important
when the code has a higher branch frequency,
since the effect of the
misprediction rate will be higher. The mispredictions
always effect the
execution time.
No of Inst
: 189669
no of branches : 32211
Bimod addr hits : 28514
Bimod lookups : 32211
*Detailed results are annexed.
D: sim-cache
Simulation results are in contrast with the
result obtaines in sin-outorder. Therefore certifies the observations and
conclusion.
5. Analysis and General observation:
(i) EFFECT OF SIZE OF BTB
The prediction capability of a bimodal predictor varies
depending on the
size of the buffer. Since a larger buffer size
implies that we can get the predicted address for more number of
branch instructions and hence more efficiency.This might imply
that an infinite buffer will give a lot more effective
prediction rate. But that would be very expensive.
(ii) SPECULATIVE PREDICTORS & UPDATES
Dynamic speculation can be done
with hardware support using branch
prediction to parallelize the code. In this scheme, the
memory or register
file is updated only after the instruction is no longer
speculative. Thus,
the update can be done either after the ID
stage when the branch target
address is calculated or later in the pipeline.
The comparison of these
schemes is also obivious in the simulation results.
This improves the prediction-hit rate since the hardware-based
speculation
uses the dynamic data dependencies to choose when
to execute instructions.
Thus, the sooner the data values are updated,
the data dependences are
sooner resolved dynamically and thus help in better branch predictions.
CONCLUSION
A comparison of the various parameters will help us
know their effect on
various performance issues and
thus help in coming up with good
architectures.
ANNEXURE
SIM-OUTORDER EXACT RESULTS
sim_total_insn
209429 # total number of inst
sim_total_refs
52742 # total number of load
sim_total_loads
30622 # total number of load
sim_total_stores 22120.0000
# total number of stor
sim_total_branches
36801 # total number of bran
sim_cycle
204950 # total simulation tim
sim_IPC
0.9254 # instructions per cyc
sim_CPI
1.0806 # cycles per instructi
sim_exec_BW
1.0219 # total instructions (
per cycle
sim_IPB
5.8883 # instruction per bran
bpred_bimod.lookups
38412 # total number of bpre
bpred_bimod.updates
32211 # total number of updat
bpred_bimod.addr_hits
27836 # total number of addr
bpred_bimod.dir_hits
29017 # total number of dire
includes addr-hits)
bpred_bimod.misses
3194 # total number of miss
bpred_bimod.jr_hits
2432 # total number of addr
JR's
bpred_bimod.jr_seen
3139 # total number of JR's
bpred_bimod.bpred_addr_rate 0.8642 # branch address-pre
dr-hits/updates)
bpred_bimod.bpred_dir_rate 0.9008 # branch direction-pr
ll-hits/updates)
bpred_bimod.bpred_jr_rate 0.7748 # JR address-predictio
hits/JRs seen)
bpred_bimod.retstack_pushes
3100 # total number of
et-addr stack
bpred_bimod.retstack_pops
4240 # total number of a
et-addr stack
il1.accesses
231701.0000 # total number of acce
il1.hits
217484 # total number of hits
il1.misses
14217 # total number of miss
il1.replacements
13707 # total number of repl
il1.writebacks
0 # total number of writ
il1.invalidations
0 # total number of inva
il1.miss_rate
0.0614 # miss rate (i.e., mis
il1.repl_rate
0.0592 # replacement rate (i.
il1.wb_rate
0.0000 # writeback rate (i.e.
il1.inv_rate
0.0000 # invalidation rate (i
dl1.accesses
47289.0000 # total number of acce
dl1.hits
46712 # total number of hits
dl1.misses
577 # total number of miss
dl1.replacements
91 # total number of repl
dl1.writebacks
86 # total number of writ
dl1.invalidations
0 # total number of inva
dl1.miss_rate
0.0122 # miss rate (i.e., mis
dl1.repl_rate
0.0019 # replacement rate (i.
dl1.wb_rate
0.0018 # writeback rate (i.e.
dl1.inv_rate
0.0000 # invalidation rate (i
ul2.accesses
14880.0000 # total number of acce
ul2.hits
13646 # total number of hits
ul2.misses
1234 # total number of miss
ul2.replacements
0 # total number of repl
ul2.writebacks
0 # total number of writ
ul2.invalidations
0 # total number of inva
ul2.miss_rate
0.0829 # miss rate (i.e., mis
ul2.repl_rate
0.0000 # replacement rate (i.
ul2.wb_rate
0.0000 # writeback rate (i.e.
ul2.inv_rate
0.0000 # invalidation rate (i
itlb.accesses
231701.0000 # total number of acce
itlb.hits
231678 # total number of hits
itlb.misses
23 # total number of miss
itlb.replacements
0 # total number of repl
itlb.writebacks
0 # total number of writ
itlb.invalidations
0 # total number of inva
itlb.miss_rate
0.0001 # miss rate (i.e., mis
itlb.repl_rate
0.0000 # replacement rate (i.
itlb.wb_rate
0.0000 # writeback rate (i.e.
itlb.inv_rate
0.0000 # invalidation rate (i
dtlb.accesses
47934.0000 # total number of acce
dtlb.hits
47922 # total number of hits
dtlb.misses
12 # total number of miss
dtlb.replacements
0 # total number of repl
dtlb.writebacks
0 # total number of writ
dtlb.invalidations
0 # total number of inva
dtlb.miss_rate
0.0003 # miss rate (i.e., mis
dtlb.repl_rate
0.0000 # replacement rate (i.
dtlb.wb_rate
0.0000 # writeback rate (i.e.
dtlb.inv_rate
0.0000 # invalidation rate (i
ld_text_base
0x00400000 # program text (code)
ld_text_size
91744 # program text (code)
ld_data_base
0x10000000 # program initialized
ld_data_size
13028 # program init'ed `.da
s' size in bytes
ld_stack_base
0x7fffc000 # program stack segmen
s in stack)
ld_stack_size
16384 # program initial stac
ld_prog_entry
0x00400140 # program entry point
ld_environ_base
0x7fff8000 # program environment
ld_target_big_endian
0 # target executable en
big endian
mem_brk_point
0x10008000 # data segment break p
mem_stack_min
0x401271bc # lowest address acces
mem_total_data
13k # total bytes used in
nt
mem_total_heap
20k # total bytes used in
mem_total_stack
1047380k # total bytes used in
mem_total_mem
1047413k # total bytes used in
segments
SIM-CHEETAH EXACT RESULTS
sim: ** simulation statistics **
sim_num_insn
189669 # total number of instructions
executed
sim_num_refs
47713 # total number of loads and stores
executed
sim_elapsed_time
1 # total simulation time in seconds
sim_inst_rate
189669.0000 # simulation speed (in insts/sec)
ld_text_base
0x00400000 # program text (code) segment base
ld_text_size
91744 # program text (code) size in bytes
ld_data_base
0x10000000 # program initialized data segment
base
ld_data_size
13028 # program init'ed `.data' and
uninit'ed `.bss' size in bytes
ld_stack_base
0x7fffc000 # program stack segment base (highest
address in stack)
ld_stack_size
16384 # program initial stack size
ld_prog_entry
0x00400140 # program entry point (initial PC)
ld_environ_base
0x7fff8000 # program environment base address
address
ld_target_big_endian
0 # target executable endian-ness,
non-zero if big endian
mem_brk_point
0x10008000 # data segment break point
mem_stack_min
0x7fff6c40 # lowest address accessed in stack
segment
mem_total_data
13k # total bytes used in init/uninit data
segment
mem_total_heap
20k # total bytes used in program heap
segment
mem_total_stack
21k # total bytes used in stack segment
mem_total_mem
54k # total bytes used in data, heap, and
stack segments
libcheetah: ** end of simulation **
Addresses processed: 48544
Line size: 16 bytes
Miss Ratios
___________
Associativity
1
2
No. of sets
128
0.048986 0.027377
256
0.034155 0.023916
512
0.025585 0.021795
1024
0.022145 0.021424
2048
0.021692 0.021424
4096
0.021692 0.021424
8192
0.021424 0.021424
16384 0.021424
0.021424
SIM-PRED SIMULATION EXCAT RESULTS
sim: ** simulation statistics **
sim_num_insn
189669 # total number of instructions
executed
sim_num_refs
47713 # total number of loads and stores
executed
sim_elapsed_time
1 # total simulation time in seconds
sim_inst_rate
189669.0000 # simulation speed (in insts/sec)
sim_num_branches
32211 # total number of branches executed
sim_IPB
5.8883 # instruction per branch
bpred_bimod.lookups
32211 # total number of bpred lookups
bpred_bimod.updates
32211 # total number of updates
bpred_bimod.addr_hits
28514 # total number of address-predicted
hits
bpred_bimod.dir_hits
29001 # total number of direction-predicted
hits (includes addr-hits)
bpred_bimod.misses
3210 # total number of misses
bpred_bimod.jr_hits
3124 # total number of address-predicted
hits for JR's
bpred_bimod.jr_seen
3139 # total number of JR's seen
bpred_bimod.bpred_addr_rate 0.8852 # branch address-prediction
rate
(i.e., addr-hits/updates)
bpred_bimod.bpred_dir_rate 0.9003 # branch direction-prediction
rate
(i.e., all-hits/updates)
bpred_bimod.bpred_jr_rate 0.9952 # JR address-prediction
rate (i.e., JR
addr-hits/JRs seen)
bpred_bimod.retstack_pushes
3100 # total number of address pushed
onto ret-addr stack
bpred_bimod.retstack_pops
3098 # total number of address popped
off of ret-addr stack
ld_text_base
0x00400000 # program text (code) segment base
ld_text_size
91744 # program text (code) size in bytes
ld_data_base
0x10000000 # program initialized data segment
base
ld_data_size
13028 # program init'ed `.data' and
uninit'ed `.bss' size in bytes
ld_stack_base
0x7fffc000 # program stack segment base (highest
address in stack)
ld_stack_size
16384 # program initial stack size
ld_prog_entry
0x00400140 # program entry point (initial PC)
ld_environ_base
0x7fff8000 # program environment base address
address
ld_target_big_endian
0 # target executable endian-ness,
non-zero if big endian
mem_brk_point
0x10008000 # data segment break point
mem_stack_min
0x7fff6c40 # lowest address accessed in stack
segment
mem_total_data
13k # total bytes used in init/uninit data
segment
mem_total_heap
20k # total bytes used in program heap
segment
mem_total_stack
21k # total bytes used in stack segment
mem_total_mem
54k # total bytes used in data, heap, and
stack segments
SIM-CACHE EXACT RESULTS
sim: ** simulation statistics **
sim_num_insn
189669 # total number of instructions
executed
sim_num_refs
47713 # total number of loads and stores
executed
sim_elapsed_time
1 # total simulation time in seconds
sim_inst_rate
189669.0000 # simulation speed (in insts/sec)
il1.accesses
189669.0000 # total number of accesses
il1.hits
168662 # total number of hits
il1.misses
21007 # total number of misses
il1.replacements
20751 # total number of replacements
il1.writebacks
0 # total number of writebacks
il1.invalidations
0 # total number of invalidations
il1.miss_rate
0.1108 # miss rate (i.e., misses/ref)
il1.repl_rate
0.1094 # replacement rate (i.e., repls/ref)
il1.wb_rate
0.0000 # writeback rate (i.e., wrbks/ref)
il1.inv_rate
0.0000 # invalidation rate (i.e., invs/ref)
dl1.accesses
48544.0000 # total number of accesses
dl1.hits
47710 # total number of hits
dl1.misses
834 # total number of misses
dl1.replacements
578 # total number of replacements
dl1.writebacks
428 # total number of writebacks
dl1.invalidations
0 # total number of invalidations
dl1.miss_rate
0.0172 # miss rate (i.e., misses/ref)
dl1.repl_rate
0.0119 # replacement rate (i.e., repls/ref)
dl1.wb_rate
0.0088 # writeback rate (i.e., wrbks/ref)
dl1.inv_rate
0.0000 # invalidation rate (i.e., invs/ref)
ul2.accesses
22269.0000 # total number of accesses
ul2.hits
21065 # total number of hits
ul2.misses
1204 # total number of misses
ul2.replacements
0 # total number of replacements
ul2.writebacks
0 # total number of writebacks
ul2.invalidations
0 # total number of invalidations
ul2.miss_rate
0.0541 # miss rate (i.e., misses/ref)
ul2.repl_rate
0.0000 # replacement rate (i.e., repls/ref)
ul2.wb_rate
0.0000 # writeback rate (i.e., wrbks/ref)
ul2.inv_rate
0.0000 # invalidation rate (i.e., invs/ref)
itlb.accesses
189669.0000 # total number of accesses
itlb.hits
189646 # total number of hits
itlb.misses
23 # total number of misses
itlb.replacements
0 # total number of replacements
itlb.writebacks
0 # total number of writebacks
itlb.invalidations
0 # total number of invalidations
itlb.miss_rate
0.0001 # miss rate (i.e., misses/ref)
itlb.repl_rate
0.0000 # replacement rate (i.e., repls/ref)
itlb.wb_rate
0.0000 # writeback rate (i.e., wrbks/ref)
itlb.inv_rate
0.0000 # invalidation rate (i.e., invs/ref)
dtlb.accesses
48544.0000 # total number of accesses
dtlb.hits
48534 # total number of hits
dtlb.misses
10 # total number of misses
dtlb.replacements
0 # total number of replacements
dtlb.writebacks
0 # total number of writebacks
dtlb.invalidations
0 # total number of invalidations
dtlb.miss_rate
0.0002 # miss rate (i.e., misses/ref)
dtlb.repl_rate
0.0000 # replacement rate (i.e., repls/ref)
dtlb.wb_rate
0.0000 # writeback rate (i.e., wrbks/ref)
dtlb.inv_rate
0.0000 # invalidation rate (i.e., invs/ref)
ld_text_base
0x00400000 # program text (code) segment base
ld_text_size
91744 # program text (code) size in bytes
ld_data_base
0x10000000 # program initialized data segment
base
ld_data_size
13028 # program init'ed `.data' and
uninit'ed `.bss' size in bytes
ld_stack_base
0x7fffc000 # program stack segment base (highest
address in stack)
ld_stack_size
16384 # program initial stack size
ld_prog_entry
0x00400140 # program entry point (initial PC)
ld_environ_base
0x7fff8000 # program environment base address
address
ld_target_big_endian
0 # target executable endian-ness,
non-zero if big endian
mem_brk_point
0x10008000 # data segment break point
mem_stack_min
0x7fff6c40 # lowest address accessed in stack
segment
mem_total_data
13k # total bytes used in init/uninit data
segment
mem_total_heap
20k # total bytes used in program heap
segment
mem_total_stack
21k # total bytes used in stack segment
mem_total_mem
54k # total bytes used in data, heap, and
stack segments