Assignment II
CS 718N Architecture of Large System

A simulation study Simple Scalar Processor

Submitted to:

Prof Anshul Kumar

By

Manish Gaur(2000MCS012)

Imran Ali (2000MCS003)

Om Prakash (2000MCS005)

7th May(Monday)

SIMPLE SCALAR PROJECT

INTRODUCTION

SimpleScalar Tool Set is an architecture simulator that reproduces the
behavior of a computing device. It takes system inputs and produces system
outputs and system metrics. Program codes written either in C are
compiled and executed by the simulator. The simulator tracks
microarchitecture state for each cycle and produces detailed history of all
instructions executed.
In this project we have simulated our binary search program on the following simulators:
1. sim-outorder: This simulator implements a very detailed out-of order issue superscaler processor with a two level memory system and speculative exection support.This simulator is a performance simulator,tracking the latency of all pipe line operations.

2.sim-cheetah: This program implements a functional simulator driver for cheetah.Cheetah is a cache simulation package written by Rabin Sugmar and Santosh Abraham which can efficiently simulate multiple cache configuartion in a single run of a program.Specially ,Cheetah can simulate ranges of single level set-associative and fully-associative caches.

3.sim-cache: This simulator implements a functional cache simulator.Cache statics are generated for a user-selected cache and the TLB configuration,which may include up to two level of instruction and data cache (with any level unified), and one level of instruction and data TLBs.No timing information is generated.

4.sim-bprced: This simulator implements a branch predictior analyzer.

In this project,one program is compiled and run on the various
four simulators explained above and the result are observed and
analysed. The Sim-Outorder simulator generates detailed
statistics of various parameters like performance, branch prediction miss
rate, and mis-speculations. In this project, with the help of the simulated
output, the effectiveness of different types of branch predictors is
observed. The effect of the size of the branch target buffer (BTB) on the
branch misprediction rate is observed. If a speculation scheme is used to
predict the branch target address, table updates can be done after the
branch is executed. A comparison of updating either in the Instruction
Decode (ID) or WriteBack (WB) stages is done. The effect of the depth of the
pipeline on CPU execution time is observed. The instruction fetch queue size
effects the CPI (Clocks per instruction) or IPC (Instructions per cycle).

2. BENCHMARKS

The simulations are run on binary serch program written in C. It searches if an integer is in an array and gives its location in the array if it is found. The total number of instructions executed is 189668 of which 36800 are executed branches.

3. SIMULATIONS

SimpleScalar has a wide range of simulation tools like Sim-fast Sim-bpred,
Sim-cache, Sim-profile, Sim-cheetah, and Sim-Outorder. Of all, Sim-Outorder
gives out a detailed issue performance and has a multi-level memory system.
Sim-outorder helps us vary a large number of options like branch predictor
type, extra branch mis-prediction latency. The benchmarks are run on all four
the simulator tools described above.
When a branch instruction is decoded, the CPU tries to predict its
direction. The simulator allows specifying the type of branch predictor.
Simulations are run on the benchmark specifying the branch predictor
to be one of always taken, always untaken, bimodal predictor using a branch
target buffer, or a 2-level adaptive predictor.

When using a bimodal predictor, the size of the branch target buffer can be
varied. The effect of the size of the BTB on the number of mispredicted
branch directions is observed by running the benchmark varying the size
of the BTB to be 256, 512, 1024, 2048, 4096 and 8192 bytes.

In a speculative predictor, prediction table updates can be done at any
stage after the branches are actually executed. The simulations are run with
the updates done early and late in the pipeline, in ID and WB stages. The
effectiveness of the speculative predictor in both these cases is analyzed.

The pipeline depth effects the execution time of a program. The number of
ALU functional units is varied and integer multiply units is
varied and the effect on the execution times is observed in the
both the cases individually.

Varying the size of the instruction fetch queue changes the CPI or IPC.
Increasing the size of the queue reduces the CPI and increases the IPC
thereby decreasing the total number of cycles for the total simulation.

4. RESULTS

a. Sim-outorder

If the direction of the branch predicted by the predictor is taken, then it
is termed as a branch prediction hit. Else it is considered to be a branch
misprediction. The total number of mispredictions over the total number of
branch lookups gives a measure of the branch prediction miss rate. The
branch prediction miss rate gives a measure of the effectiveness of a
predictor type. The lower the miss rate, the effective is the predictor.
Important results are as follows:

Toatl no of Instructions: 189668
No of mem references    : 47713
no of Load Instr        : 27141
no of store Instr       : 20572
sim cycle               : 204950
I L1 hits               : 217480
I L1 misses             : 14217
I L1 writebacks         :      0
D L1 hits               : 46712
d L1 misses             :    577

* Deatailed simulation result is annxed

b: sim-cheetah:

LRU Set associative caches being simulated. no of sets fm 128 to 16384.Maximum associativity is 2 and the line size is 16 bytes.
Miss ratios observed are as follows:
No of sets                 Associativity
                       1                  2
128               0.048986             0.027377
256               0.034155             0.023916
512               0.025585             0.021795
1024              0.022145             0.021424

* Deatiled simulation results are annexed.

C: sim-pred

The 2-level predictor is the simplest dynamic predictor with a branch
history table specifying if the recent branch is taken or not. This table is
accessed before predicting the direction. Since the recent behavior of the
branch is considered, it is effective than the two static prediction
schemes.

The bimodal predictor accesses the branch target buffer and fetches the
predicted address for the decoded instruction. The prediction is based on
its own recent behavior. Hence its prediction is more exact than that of the
2-level prediction scheme where the prediction is based on the behavior of
any recent branch. In addition to the effectiveness in prediction, a bimodal
predictor reduces the branch penalty and hence the total execution time
since the next instruction address is known in the ID stage itself.

Thus, the bimodal prediction scheme is effective than the other schemes
considered here. The selection of a suitable predictor is very important
when the code has a higher branch frequency, since the effect of the
misprediction rate will be higher. The mispredictions always effect the
execution time.

No of Inst          : 189669
no of branches      : 32211
Bimod addr hits     : 28514
Bimod lookups       : 32211

*Detailed results are annexed.

D: sim-cache
Simulation results are in contrast with the result obtaines in sin-outorder. Therefore certifies the observations and conclusion.

5. Analysis and General observation:
(i) EFFECT OF SIZE OF BTB

The prediction capability of a bimodal predictor varies depending on the
size of the buffer. Since a larger buffer size implies that we can get the predicted address for more number of branch instructions and hence more efficiency.This might imply that an infinite buffer will give a lot more effective prediction rate. But that would be very expensive.

(ii) SPECULATIVE PREDICTORS & UPDATES

Dynamic speculation can be done with hardware support using branch
prediction to parallelize the code. In this scheme, the memory or register
file is updated only after the instruction is no longer speculative. Thus,
the update can be done either after the ID stage when the branch target
address is calculated or later in the pipeline. The comparison of these
schemes is also obivious in the simulation results.
This improves the prediction-hit rate since the hardware-based speculation
uses the dynamic data dependencies to choose when to execute instructions.
Thus, the sooner the data values are updated, the data dependences are
sooner resolved dynamically and thus help in better branch predictions.

CONCLUSION

A comparison of the various parameters will help us know their effect on
various performance issues and thus help in coming up with good
architectures.

ANNEXURE

    SIM-OUTORDER EXACT RESULTS
sim_total_insn               209429 # total number of inst
sim_total_refs                52742 # total number of load
sim_total_loads               30622 # total number of load
sim_total_stores         22120.0000 # total number of stor
sim_total_branches            36801 # total number of bran
sim_cycle                    204950 # total simulation tim
sim_IPC                      0.9254 # instructions per cyc
sim_CPI                      1.0806 # cycles per instructi
sim_exec_BW                  1.0219 # total instructions (
per cycle
sim_IPB                      5.8883 # instruction per bran
bpred_bimod.lookups           38412 # total number of bpre
bpred_bimod.updates          32211 # total number of updat
bpred_bimod.addr_hits         27836 # total number of addr
bpred_bimod.dir_hits          29017 # total number of dire
includes addr-hits)
bpred_bimod.misses             3194 # total number of miss
bpred_bimod.jr_hits            2432 # total number of addr
JR's
bpred_bimod.jr_seen            3139 # total number of JR's
bpred_bimod.bpred_addr_rate    0.8642 # branch address-pre
dr-hits/updates)
bpred_bimod.bpred_dir_rate    0.9008 # branch direction-pr
ll-hits/updates)
bpred_bimod.bpred_jr_rate    0.7748 # JR address-predictio
hits/JRs seen)
bpred_bimod.retstack_pushes         3100 # total number of
et-addr stack
bpred_bimod.retstack_pops         4240 # total number of a
et-addr stack
il1.accesses            231701.0000 # total number of acce
il1.hits                     217484 # total number of hits
il1.misses                    14217 # total number of miss
il1.replacements              13707 # total number of repl
il1.writebacks                    0 # total number of writ
il1.invalidations                 0 # total number of inva
il1.miss_rate                0.0614 # miss rate (i.e., mis
il1.repl_rate                0.0592 # replacement rate (i.
il1.wb_rate                  0.0000 # writeback rate (i.e.
il1.inv_rate                 0.0000 # invalidation rate (i
dl1.accesses             47289.0000 # total number of acce
dl1.hits                      46712 # total number of hits
dl1.misses                      577 # total number of miss
dl1.replacements                 91 # total number of repl

dl1.writebacks                   86 # total number of writ
dl1.invalidations                 0 # total number of inva
dl1.miss_rate                0.0122 # miss rate (i.e., mis
dl1.repl_rate                0.0019 # replacement rate (i.
dl1.wb_rate                  0.0018 # writeback rate (i.e.
dl1.inv_rate                 0.0000 # invalidation rate (i
ul2.accesses             14880.0000 # total number of acce
ul2.hits                      13646 # total number of hits
ul2.misses                     1234 # total number of miss
ul2.replacements                  0 # total number of repl
ul2.writebacks                    0 # total number of writ
ul2.invalidations                 0 # total number of inva
ul2.miss_rate                0.0829 # miss rate (i.e., mis
ul2.repl_rate                0.0000 # replacement rate (i.
ul2.wb_rate                  0.0000 # writeback rate (i.e.
ul2.inv_rate                 0.0000 # invalidation rate (i
itlb.accesses           231701.0000 # total number of acce
itlb.hits                    231678 # total number of hits
itlb.misses                      23 # total number of miss
itlb.replacements                 0 # total number of repl
itlb.writebacks                   0 # total number of writ
itlb.invalidations                0 # total number of inva
itlb.miss_rate               0.0001 # miss rate (i.e., mis
itlb.repl_rate               0.0000 # replacement rate (i.
itlb.wb_rate                 0.0000 # writeback rate (i.e.
itlb.inv_rate                0.0000 # invalidation rate (i
dtlb.accesses            47934.0000 # total number of acce
dtlb.hits                     47922 # total number of hits
dtlb.misses                      12 # total number of miss
dtlb.replacements                 0 # total number of repl
dtlb.writebacks                   0 # total number of writ
dtlb.invalidations                0 # total number of inva
dtlb.miss_rate               0.0003 # miss rate (i.e., mis
dtlb.repl_rate               0.0000 # replacement rate (i.
dtlb.wb_rate                 0.0000 # writeback rate (i.e.
dtlb.inv_rate                0.0000 # invalidation rate (i
ld_text_base             0x00400000 # program text (code)
ld_text_size                  91744 # program text (code)
ld_data_base             0x10000000 # program initialized
ld_data_size                  13028 # program init'ed `.da
s' size in bytes
ld_stack_base            0x7fffc000 # program stack segmen
s in stack)
ld_stack_size                 16384 # program initial stac
ld_prog_entry            0x00400140 # program entry point
ld_environ_base          0x7fff8000 # program environment
ld_target_big_endian              0 # target executable en
big endian
mem_brk_point            0x10008000 # data segment break p
mem_stack_min            0x401271bc # lowest address acces
mem_total_data                  13k # total bytes used in
nt
mem_total_heap                  20k # total bytes used in
mem_total_stack            1047380k # total bytes used in
mem_total_mem              1047413k # total bytes used in
segments

SIM-CHEETAH EXACT RESULTS

libcheetah: ** end of simulation **
Addresses processed: 48544
Line size: 16 bytes

Miss Ratios
___________

                Associativity
                1               2
No. of sets
128             0.048986        0.027377
256             0.034155        0.023916
512             0.025585        0.021795
1024            0.022145        0.021424
2048            0.021692        0.021424
4096            0.021692        0.021424
8192            0.021424        0.021424
16384           0.021424        0.021424

SIM-PRED SIMULATION EXCAT RESULTS

SIM-CACHE EXACT RESULTS

sim: ** simulation statistics **
sim_num_insn                 189669 # total number of instructions
executed
sim_num_refs                  47713 # total number of loads and stores
executed
sim_elapsed_time                  1 # total simulation time in seconds
sim_inst_rate           189669.0000 # simulation speed (in insts/sec)
il1.accesses            189669.0000 # total number of accesses
il1.hits                     168662 # total number of hits
il1.misses                    21007 # total number of misses
il1.replacements              20751 # total number of replacements
il1.writebacks                    0 # total number of writebacks
il1.invalidations                 0 # total number of invalidations
il1.miss_rate                0.1108 # miss rate (i.e., misses/ref)
il1.repl_rate                0.1094 # replacement rate (i.e., repls/ref)
il1.wb_rate                  0.0000 # writeback rate (i.e., wrbks/ref)
il1.inv_rate                 0.0000 # invalidation rate (i.e., invs/ref)
dl1.accesses             48544.0000 # total number of accesses
dl1.hits                      47710 # total number of hits
dl1.misses                      834 # total number of misses
dl1.replacements                578 # total number of replacements
dl1.writebacks                  428 # total number of writebacks
dl1.invalidations                 0 # total number of invalidations
dl1.miss_rate                0.0172 # miss rate (i.e., misses/ref)
dl1.repl_rate                0.0119 # replacement rate (i.e., repls/ref)
dl1.wb_rate                  0.0088 # writeback rate (i.e., wrbks/ref)
dl1.inv_rate                 0.0000 # invalidation rate (i.e., invs/ref)
ul2.accesses             22269.0000 # total number of accesses
ul2.hits                      21065 # total number of hits
ul2.misses                     1204 # total number of misses
ul2.replacements                  0 # total number of replacements
ul2.writebacks                    0 # total number of writebacks
ul2.invalidations                 0 # total number of invalidations
ul2.miss_rate                0.0541 # miss rate (i.e., misses/ref)
ul2.repl_rate                0.0000 # replacement rate (i.e., repls/ref)
ul2.wb_rate                  0.0000 # writeback rate (i.e., wrbks/ref)
ul2.inv_rate                 0.0000 # invalidation rate (i.e., invs/ref)
itlb.accesses           189669.0000 # total number of accesses
itlb.hits                    189646 # total number of hits
itlb.misses                      23 # total number of misses
itlb.replacements                 0 # total number of replacements
itlb.writebacks                   0 # total number of writebacks
itlb.invalidations                0 # total number of invalidations
itlb.miss_rate               0.0001 # miss rate (i.e., misses/ref)
itlb.repl_rate               0.0000 # replacement rate (i.e., repls/ref)
itlb.wb_rate                 0.0000 # writeback rate (i.e., wrbks/ref)
itlb.inv_rate                0.0000 # invalidation rate (i.e., invs/ref)
dtlb.accesses            48544.0000 # total number of accesses
dtlb.hits                     48534 # total number of hits
dtlb.misses                      10 # total number of misses
dtlb.replacements                 0 # total number of replacements
dtlb.writebacks                   0 # total number of writebacks
dtlb.invalidations                0 # total number of invalidations
dtlb.miss_rate               0.0002 # miss rate (i.e., misses/ref)
dtlb.repl_rate               0.0000 # replacement rate (i.e., repls/ref)
dtlb.wb_rate                 0.0000 # writeback rate (i.e., wrbks/ref)
dtlb.inv_rate                0.0000 # invalidation rate (i.e., invs/ref)
ld_text_base             0x00400000 # program text (code) segment base
ld_text_size                  91744 # program text (code) size in bytes
ld_data_base             0x10000000 # program initialized data segment
base
ld_data_size                  13028 # program init'ed `.data' and
uninit'ed `.bss' size in bytes
ld_stack_base            0x7fffc000 # program stack segment base (highest
address in stack)
ld_stack_size                 16384 # program initial stack size
ld_prog_entry            0x00400140 # program entry point (initial PC)
ld_environ_base          0x7fff8000 # program environment base address
address
ld_target_big_endian              0 # target executable endian-ness,
non-zero if big endian
mem_brk_point            0x10008000 # data segment break point
mem_stack_min            0x7fff6c40 # lowest address accessed in stack
segment
mem_total_data                  13k # total bytes used in init/uninit data
segment
mem_total_heap                  20k # total bytes used in program heap
segment
mem_total_stack                 21k # total bytes used in stack segment
mem_total_mem                   54k # total bytes used in data, heap, and
stack segments

back