Assignment II
                CS 718N Architecture of Large System

             A simulation study Simple Scalar Processor

                         Submitted to:

                       Prof Anshul Kumar
 
                             By

                     Manish Gaur(2000MCS012)

                     Imran  Ali (2000MCS003)

                     Om Prakash (2000MCS005)

                              on

                         7th May(Monday)

                        SIMPLE SCALAR PROJECT
 
 

INTRODUCTION

SimpleScalar  Tool  Set is  an  architecture simulator  that reproduces  the
behavior of  a computing device. It takes  system inputs and produces system
outputs and system metrics. Program codes written either in C are
compiled   and   executed   by   the   simulator.   The   simulator   tracks
microarchitecture state for each  cycle and produces detailed history of all
instructions executed.
In this project we have simulated our binary search program on the following simulators:
1. sim-outorder: This simulator implements a very detailed out-of order issue superscaler processor with a two level memory system and speculative exection support.This simulator is a performance simulator,tracking the latency of all pipe line operations.

2.sim-cheetah: This program implements a functional simulator driver for cheetah.Cheetah is a cache simulation package written by Rabin Sugmar and Santosh Abraham which can efficiently simulate multiple cache configuartion in a single run of a program.Specially ,Cheetah can simulate ranges of single level set-associative and fully-associative caches.

3.sim-cache: This simulator implements a functional cache simulator.Cache statics are generated for a user-selected cache and the TLB configuration,which may include up to two level of instruction and data cache (with any level unified), and one level of instruction and data TLBs.No timing information is generated.

4.sim-bprced: This simulator implements a branch predictior analyzer.
 

In  this project,one  program is  compiled and  run on  the various
four simulators explained above and the result are observed and
analysed. The Sim-Outorder  simulator generates detailed
statistics of  various parameters  like performance, branch  prediction miss
rate, and mis-speculations. In  this project, with the help of the simulated
output,  the  effectiveness  of  different  types of  branch  predictors  is
observed. The  effect of the size  of the branch target  buffer (BTB) on the
branch misprediction  rate is observed.  If a speculation scheme  is used to
predict  the branch  target  address, table  updates can  be done  after the
branch  is executed.  A  comparison of  updating either  in  the Instruction
Decode (ID) or WriteBack (WB) stages is done. The effect of the depth of the
pipeline on CPU execution time is observed. The instruction fetch queue size
effects the  CPI (Clocks  per instruction) or IPC  (Instructions per cycle).

2. BENCHMARKS

The simulations are run on binary serch program written in C. It searches if  an integer is in an array and gives its location in the  array if  it is  found. The  total number  of instructions  executed is 189668 of which 36800 are executed branches.

3. SIMULATIONS

SimpleScalar has  a wide range of  simulation tools like Sim-fast Sim-bpred,
Sim-cache, Sim-profile, Sim-cheetah,  and Sim-Outorder. Of all, Sim-Outorder
gives out a detailed  issue performance and has a multi-level memory system.
Sim-outorder helps  us vary a large number  of options like branch predictor
type, extra  branch mis-prediction latency. The benchmarks are run on all four
the simulator tools described above.
When  a  branch  instruction  is  decoded,  the  CPU tries  to  predict  its
direction.  The simulator  allows specifying  the type of  branch predictor.
Simulations are run on  the benchmark specifying the branch predictor
to be one of  always taken, always untaken, bimodal predictor using a branch
target buffer, or a 2-level adaptive predictor.

When using a bimodal  predictor, the size of the branch target buffer can be
varied.  The effect of  the size of  the BTB  on the number  of mispredicted
branch directions is observed by running  the benchmark varying the size
of the BTB to be 256, 512, 1024, 2048, 4096 and 8192 bytes.

In  a speculative  predictor, prediction  table updates  can be done  at any
stage after the branches are actually executed. The simulations are run with
the updates  done early and late  in the pipeline, in  ID and WB stages. The
effectiveness of the speculative predictor in both these cases is analyzed.

The pipeline  depth effects the execution  time of a program.  The number of
ALU functional  units is  varied and integer  multiply units is
varied and the effect on the execution times is observed in the
both the cases individually.

Varying  the size  of the instruction  fetch queue  changes the CPI  or IPC.
Increasing  the size  of the  queue reduces  the CPI  and increases  the IPC
thereby decreasing the total  number of cycles for the total simulation.

4. RESULTS

a. Sim-outorder

If the direction of  the branch predicted by the predictor is taken, then it
is termed  as a branch prediction hit. Else it is  considered to be a branch
misprediction. The  total number of mispredictions  over the total number of
branch  lookups gives  a  measure of  the branch  prediction miss  rate. The
branch  prediction miss  rate  gives a  measure  of the  effectiveness of  a
predictor type. The lower the miss rate, the effective is the predictor.
Important results are as follows:
 
Toatl no of Instructions: 189668
No of mem references    :  47713
no of Load Instr        :  27141
no of store Instr       :  20572
sim cycle               : 204950
I L1 hits               : 217480
I L1 misses             :  14217
I L1 writebacks         :      0
D L1 hits               :  46712
d L1 misses             :    577

* Deatailed simulation result is annxed

b: sim-cheetah:

LRU Set associative caches being simulated. no of sets fm 128 to 16384.Maximum associativity is 2 and the line size is 16 bytes.
Miss ratios observed are as follows:
No of sets                 Associativity
                       1                  2
128               0.048986             0.027377
256               0.034155             0.023916
512               0.025585             0.021795
1024              0.022145             0.021424

  * Deatiled simulation results are annexed.

C: sim-pred

The  2-level  predictor is  the  simplest  dynamic predictor  with a  branch
history table specifying if the recent branch is taken or not. This table is
accessed before  predicting the direction. Since  the recent behavior of the
branch  is  considered,  it  is effective  than  the  two static  prediction
schemes.

The  bimodal predictor  accesses the  branch target  buffer and  fetches the
predicted address  for the  decoded instruction. The prediction  is based on
its own recent behavior. Hence its prediction is more exact than that of the
2-level prediction  scheme where the prediction is  based on the behavior of
any recent branch. In addition to the effectiveness in prediction, a bimodal
predictor  reduces the  branch penalty  and hence  the total  execution time
since the next instruction address is known in the ID stage itself.

Thus,  the bimodal  prediction scheme  is effective  than the  other schemes
considered  here. The selection  of a  suitable predictor is  very important
when  the  code has  a  higher branch  frequency,  since the  effect of  the
misprediction  rate will  be  higher. The  mispredictions always  effect the
execution time.

 No of Inst          : 189669
 no of branches      : 32211
 Bimod addr hits     : 28514
 Bimod lookups       : 32211

  *Detailed results are annexed.

D: sim-cache
     Simulation results are in contrast with the result obtaines in sin-outorder. Therefore certifies the observations and conclusion.
 

5.  Analysis and General observation:
   (i) EFFECT OF SIZE OF BTB

The  prediction capability of  a bimodal  predictor varies depending  on the
size  of the  buffer. Since a  larger buffer size  implies that we can  get the predicted address for more number of branch instructions and hence more efficiency.This  might imply  that an infinite  buffer will  give a lot  more effective prediction rate. But that would be very expensive.

   (ii) SPECULATIVE PREDICTORS & UPDATES

Dynamic  speculation  can  be   done  with  hardware  support  using  branch
prediction to  parallelize the code. In this  scheme, the memory or register
file is  updated only after the instruction  is no longer speculative. Thus,
the  update can be  done either after  the ID  stage when the  branch target
address  is calculated  or later  in the  pipeline. The comparison  of these
schemes is also obivious in the simulation results.
This improves  the prediction-hit rate since  the hardware-based speculation
uses the  dynamic data  dependencies to choose when  to execute instructions.
Thus,  the sooner  the  data values  are updated,  the data  dependences are
sooner resolved dynamically and thus help in better branch predictions.

CONCLUSION

A comparison  of the  various parameters will  help us know  their effect on
various  performance   issues  and   thus  help  in  coming   up  with  good
architectures.

ANNEXURE
 

    SIM-OUTORDER EXACT RESULTS
sim_total_insn               209429 # total number of inst
sim_total_refs                52742 # total number of load
sim_total_loads               30622 # total number of load
sim_total_stores         22120.0000 # total number of stor
sim_total_branches            36801 # total number of bran
sim_cycle                    204950 # total simulation tim
sim_IPC                      0.9254 # instructions per cyc
sim_CPI                      1.0806 # cycles per instructi
sim_exec_BW                  1.0219 # total instructions (
per cycle
sim_IPB                      5.8883 # instruction per bran
bpred_bimod.lookups           38412 # total number of bpre
bpred_bimod.updates          32211 # total number of updat
bpred_bimod.addr_hits         27836 # total number of addr
bpred_bimod.dir_hits          29017 # total number of dire
includes addr-hits)
bpred_bimod.misses             3194 # total number of miss
bpred_bimod.jr_hits            2432 # total number of addr
 JR's
bpred_bimod.jr_seen            3139 # total number of JR's
bpred_bimod.bpred_addr_rate    0.8642 # branch address-pre
dr-hits/updates)
bpred_bimod.bpred_dir_rate    0.9008 # branch direction-pr
ll-hits/updates)
bpred_bimod.bpred_jr_rate    0.7748 # JR address-predictio
hits/JRs seen)
bpred_bimod.retstack_pushes         3100 # total number of
et-addr stack
bpred_bimod.retstack_pops         4240 # total number of a
et-addr stack
il1.accesses            231701.0000 # total number of acce
il1.hits                     217484 # total number of hits
il1.misses                    14217 # total number of miss
il1.replacements              13707 # total number of repl
il1.writebacks                    0 # total number of writ
il1.invalidations                 0 # total number of inva
il1.miss_rate                0.0614 # miss rate (i.e., mis
il1.repl_rate                0.0592 # replacement rate (i.
il1.wb_rate                  0.0000 # writeback rate (i.e.
il1.inv_rate                 0.0000 # invalidation rate (i
dl1.accesses             47289.0000 # total number of acce
dl1.hits                      46712 # total number of hits
dl1.misses                      577 # total number of miss
dl1.replacements                 91 # total number of repl

dl1.writebacks                   86 # total number of writ
dl1.invalidations                 0 # total number of inva
dl1.miss_rate                0.0122 # miss rate (i.e., mis
dl1.repl_rate                0.0019 # replacement rate (i.
dl1.wb_rate                  0.0018 # writeback rate (i.e.
dl1.inv_rate                 0.0000 # invalidation rate (i
ul2.accesses             14880.0000 # total number of acce
ul2.hits                      13646 # total number of hits
ul2.misses                     1234 # total number of miss
ul2.replacements                  0 # total number of repl
ul2.writebacks                    0 # total number of writ
ul2.invalidations                 0 # total number of inva
ul2.miss_rate                0.0829 # miss rate (i.e., mis
ul2.repl_rate                0.0000 # replacement rate (i.
ul2.wb_rate                  0.0000 # writeback rate (i.e.
ul2.inv_rate                 0.0000 # invalidation rate (i
itlb.accesses           231701.0000 # total number of acce
itlb.hits                    231678 # total number of hits
itlb.misses                      23 # total number of miss
itlb.replacements                 0 # total number of repl
itlb.writebacks                   0 # total number of writ
itlb.invalidations                0 # total number of inva
itlb.miss_rate               0.0001 # miss rate (i.e., mis
itlb.repl_rate               0.0000 # replacement rate (i.
itlb.wb_rate                 0.0000 # writeback rate (i.e.
itlb.inv_rate                0.0000 # invalidation rate (i
dtlb.accesses            47934.0000 # total number of acce
dtlb.hits                     47922 # total number of hits
dtlb.misses                      12 # total number of miss
dtlb.replacements                 0 # total number of repl
dtlb.writebacks                   0 # total number of writ
dtlb.invalidations                0 # total number of inva
dtlb.miss_rate               0.0003 # miss rate (i.e., mis
dtlb.repl_rate               0.0000 # replacement rate (i.
dtlb.wb_rate                 0.0000 # writeback rate (i.e.
dtlb.inv_rate                0.0000 # invalidation rate (i
ld_text_base             0x00400000 # program text (code)
ld_text_size                  91744 # program text (code)
ld_data_base             0x10000000 # program initialized
ld_data_size                  13028 # program init'ed `.da
s' size in bytes
ld_stack_base            0x7fffc000 # program stack segmen
s in stack)
ld_stack_size                 16384 # program initial stac
ld_prog_entry            0x00400140 # program entry point
ld_environ_base          0x7fff8000 # program environment
ld_target_big_endian              0 # target executable en
 big endian
mem_brk_point            0x10008000 # data segment break p
mem_stack_min            0x401271bc # lowest address acces
mem_total_data                  13k # total bytes used in
nt
mem_total_heap                  20k # total bytes used in
mem_total_stack            1047380k # total bytes used in
mem_total_mem              1047413k # total bytes used in
segments

                         SIM-CHEETAH EXACT RESULTS

sim: ** simulation statistics **
sim_num_insn                 189669 # total number of instructions
executed
sim_num_refs                  47713 # total number of loads and stores
executed
sim_elapsed_time                  1 # total simulation time in seconds
sim_inst_rate           189669.0000 # simulation speed (in insts/sec)
ld_text_base             0x00400000 # program text (code) segment base
ld_text_size                  91744 # program text (code) size in bytes
ld_data_base             0x10000000 # program initialized data segment
base
ld_data_size                  13028 # program init'ed `.data' and
uninit'ed `.bss' size in bytes
ld_stack_base            0x7fffc000 # program stack segment base (highest
address in stack)
ld_stack_size                 16384 # program initial stack size
ld_prog_entry            0x00400140 # program entry point (initial PC)
ld_environ_base          0x7fff8000 # program environment base address
address
ld_target_big_endian              0 # target executable endian-ness,
non-zero if big endian
mem_brk_point            0x10008000 # data segment break point
mem_stack_min            0x7fff6c40 # lowest address accessed in stack
segment
mem_total_data                  13k # total bytes used in init/uninit data
segment
mem_total_heap                  20k # total bytes used in program heap
segment
mem_total_stack                 21k # total bytes used in stack segment
mem_total_mem                   54k # total bytes used in data, heap, and
stack segments

libcheetah: ** end of simulation **
Addresses processed: 48544
Line size: 16 bytes

Miss Ratios
___________

                Associativity
                1               2
No. of sets
128             0.048986        0.027377
256             0.034155        0.023916
512             0.025585        0.021795
1024            0.022145        0.021424
2048            0.021692        0.021424
4096            0.021692        0.021424
8192            0.021424        0.021424
16384           0.021424        0.021424

                  SIM-PRED SIMULATION EXCAT RESULTS

sim: ** simulation statistics **
sim_num_insn                 189669 # total number of instructions
executed
sim_num_refs                  47713 # total number of loads and stores
executed
sim_elapsed_time                  1 # total simulation time in seconds
sim_inst_rate           189669.0000 # simulation speed (in insts/sec)
sim_num_branches              32211 # total number of branches executed
sim_IPB                      5.8883 # instruction per branch
bpred_bimod.lookups           32211 # total number of bpred lookups
bpred_bimod.updates          32211 # total number of updates
bpred_bimod.addr_hits         28514 # total number of address-predicted
hits
bpred_bimod.dir_hits          29001 # total number of direction-predicted
hits (includes addr-hits)
bpred_bimod.misses             3210 # total number of misses
bpred_bimod.jr_hits            3124 # total number of address-predicted
hits for JR's
bpred_bimod.jr_seen            3139 # total number of JR's seen
bpred_bimod.bpred_addr_rate    0.8852 # branch address-prediction rate
(i.e., addr-hits/updates)
bpred_bimod.bpred_dir_rate    0.9003 # branch direction-prediction rate
(i.e., all-hits/updates)
bpred_bimod.bpred_jr_rate    0.9952 # JR address-prediction rate (i.e., JR
addr-hits/JRs seen)
bpred_bimod.retstack_pushes         3100 # total number of address pushed
onto ret-addr stack
bpred_bimod.retstack_pops         3098 # total number of address popped
off of ret-addr stack
ld_text_base             0x00400000 # program text (code) segment base
ld_text_size                  91744 # program text (code) size in bytes
ld_data_base             0x10000000 # program initialized data segment
base
ld_data_size                  13028 # program init'ed `.data' and
uninit'ed `.bss' size in bytes
ld_stack_base            0x7fffc000 # program stack segment base (highest
address in stack)
ld_stack_size                 16384 # program initial stack size
ld_prog_entry            0x00400140 # program entry point (initial PC)
ld_environ_base          0x7fff8000 # program environment base address
address
ld_target_big_endian              0 # target executable endian-ness,
non-zero if big endian
mem_brk_point            0x10008000 # data segment break point
mem_stack_min            0x7fff6c40 # lowest address accessed in stack
segment
mem_total_data                  13k # total bytes used in init/uninit data
segment
mem_total_heap                  20k # total bytes used in program heap
segment
mem_total_stack                 21k # total bytes used in stack segment
mem_total_mem                   54k # total bytes used in data, heap, and
stack segments

                   SIM-CACHE EXACT RESULTS
 

  sim: ** simulation statistics **
sim_num_insn                 189669 # total number of instructions
executed
sim_num_refs                  47713 # total number of loads and stores
executed
sim_elapsed_time                  1 # total simulation time in seconds
sim_inst_rate           189669.0000 # simulation speed (in insts/sec)
il1.accesses            189669.0000 # total number of accesses
il1.hits                     168662 # total number of hits
il1.misses                    21007 # total number of misses
il1.replacements              20751 # total number of replacements
il1.writebacks                    0 # total number of writebacks
il1.invalidations                 0 # total number of invalidations
il1.miss_rate                0.1108 # miss rate (i.e., misses/ref)
il1.repl_rate                0.1094 # replacement rate (i.e., repls/ref)
il1.wb_rate                  0.0000 # writeback rate (i.e., wrbks/ref)
il1.inv_rate                 0.0000 # invalidation rate (i.e., invs/ref)
dl1.accesses             48544.0000 # total number of accesses
dl1.hits                      47710 # total number of hits
dl1.misses                      834 # total number of misses
dl1.replacements                578 # total number of replacements
dl1.writebacks                  428 # total number of writebacks
dl1.invalidations                 0 # total number of invalidations
dl1.miss_rate                0.0172 # miss rate (i.e., misses/ref)
dl1.repl_rate                0.0119 # replacement rate (i.e., repls/ref)
dl1.wb_rate                  0.0088 # writeback rate (i.e., wrbks/ref)
dl1.inv_rate                 0.0000 # invalidation rate (i.e., invs/ref)
ul2.accesses             22269.0000 # total number of accesses
ul2.hits                      21065 # total number of hits
ul2.misses                     1204 # total number of misses
ul2.replacements                  0 # total number of replacements
ul2.writebacks                    0 # total number of writebacks
ul2.invalidations                 0 # total number of invalidations
ul2.miss_rate                0.0541 # miss rate (i.e., misses/ref)
ul2.repl_rate                0.0000 # replacement rate (i.e., repls/ref)
ul2.wb_rate                  0.0000 # writeback rate (i.e., wrbks/ref)
ul2.inv_rate                 0.0000 # invalidation rate (i.e., invs/ref)
itlb.accesses           189669.0000 # total number of accesses
itlb.hits                    189646 # total number of hits
itlb.misses                      23 # total number of misses
itlb.replacements                 0 # total number of replacements
itlb.writebacks                   0 # total number of writebacks
itlb.invalidations                0 # total number of invalidations
itlb.miss_rate               0.0001 # miss rate (i.e., misses/ref)
itlb.repl_rate               0.0000 # replacement rate (i.e., repls/ref)
itlb.wb_rate                 0.0000 # writeback rate (i.e., wrbks/ref)
itlb.inv_rate                0.0000 # invalidation rate (i.e., invs/ref)
dtlb.accesses            48544.0000 # total number of accesses
dtlb.hits                     48534 # total number of hits
dtlb.misses                      10 # total number of misses
dtlb.replacements                 0 # total number of replacements
dtlb.writebacks                   0 # total number of writebacks
dtlb.invalidations                0 # total number of invalidations
dtlb.miss_rate               0.0002 # miss rate (i.e., misses/ref)
dtlb.repl_rate               0.0000 # replacement rate (i.e., repls/ref)
dtlb.wb_rate                 0.0000 # writeback rate (i.e., wrbks/ref)
dtlb.inv_rate                0.0000 # invalidation rate (i.e., invs/ref)
ld_text_base             0x00400000 # program text (code) segment base
ld_text_size                  91744 # program text (code) size in bytes
ld_data_base             0x10000000 # program initialized data segment
base
ld_data_size                  13028 # program init'ed `.data' and
uninit'ed `.bss' size in bytes
ld_stack_base            0x7fffc000 # program stack segment base (highest
address in stack)
ld_stack_size                 16384 # program initial stack size
ld_prog_entry            0x00400140 # program entry point (initial PC)
ld_environ_base          0x7fff8000 # program environment base address
address
ld_target_big_endian              0 # target executable endian-ness,
non-zero if big endian
mem_brk_point            0x10008000 # data segment break point
mem_stack_min            0x7fff6c40 # lowest address accessed in stack
segment
mem_total_data                  13k # total bytes used in init/uninit data
segment
mem_total_heap                  20k # total bytes used in program heap
segment
mem_total_stack                 21k # total bytes used in stack segment
mem_total_mem                   54k # total bytes used in data, heap, and
stack segments
 

 back