A brief description about various simulators.
The following command-line arguments are available in all simulators included with the release:
-h    prints the simulator help message.
-d    turn on the debug message.
-i    start execution in the DLite! debugger. This option is not supported in the sim-fast simulator.
-q    terminate immediately.
-dumpconfig <file>  generate a configuration file saving the command-line parameters. Comments are permitted in the config files, and begin with a #.
-config <file>  read in and use a configuration file. These files may reference other config files.

sim-fast: It does no time accounting, only functional simulation-it executes each instruction serially, simulating no instructions in parallel. sim-fast is optimized for raw speed, and assumes no cache, instruction checking.

sim-safe: It also performs functional simulation, but checks for correct alignment and access permissions for each memory reference. sim-fast and sim-safe do not accept any additional command line arguments.
sim-profile: It can generate detailed profiles on instruction classes and addresses, text symbols, memory accesses, branches, and data segment symbols. It accepts following  additional command-line arguments, which toggle the various profiling features:
-iclass                    instruction class profiling (e.g. ALU, branch).
-iprof                      instruction profiling (e.g., bnez, addi).
-brprof                    branch class profiling (e.g., direct, calls, conditional).
-amprof                  addr. mode profiling (e.g., displaced, R+R).
-segprof                 load/store segment profiling (e.g., data, heap).
-tsymprof              execution profile by text symbol (functions).
-dsymprof              reference profile by data segment symbol.
-taddrprof              execution profile by text address.
-all                         turn on all profiling listed above.

sim-cache and sim-cheetah: These simulators are ideal for fast simulation of caches if the effect of cache performance on execution time is not needed.

sim-cache accepts the following arguments, in addition to the universal arguments.
-cache:dl1 <config> configures a level-one data cache.
-cache:dl2 <config> configures a level-two data cache.
-cache:il1 <config>  configures a level-one instr. cache.
-cache:il2 <config>  configures a level-two instr. cache.
-tlb:dtlb <config>     configures the data TLB.
-tlb:itlb <config>      configures the instruction TLB.
-flush <boolean>      flush all caches on a system call; (<boolean> = 0 | 1 | true | TRUE | false | FALSE).
-icompress               remap SimpleScalar's 64-bit instructions to a 32-bit equivalent in the simulation (i.e., model a machine with 4-word instructions).
-pcstat <stat>         generate a text-based profile.

The cache configuration (<config>) is formatted as follows:
<name>:<nsets>:<bsize>:<assoc>:<repl>
Each of these fields has the following meaning:
<name>    cache name, must be unique.
<nsets>    number of sets in the cache.
<bsize>    block size (for TLBs, use the page size).
<assoc>   associativity of the cache (power of two).
<repl>       replacement policy (l | f | r), where l = LRU, f = FIFO, r = random replacement.
The cache size is therefore the product of <nsets>, <bsize>, and <assoc>.
To have a unified level in the hierarchy, "point" the instruction cache to the name of the data cache in the corresponding level, as in the following example:
-cache:il1 il1:128:64:1:l
-cache:il2 dl2
-cache:dl1 dl1:256:32:1:l
-cache:dl2 ul2:1024:64:2:l
The defaults used in sim-cache are as follows:
L1 instruction cache:   il1:256:32:1:l    (8 KB)
L1 data cache:            dl1:256:32:1:l    (8 KB)
L2 unified cache:      ul2:1024:64:4:l    (256 KB)
instruction TLB:       itlb:16:4096:4:l    (64 entries)
data TLB:                  dtlb:32:4096:4:l    (128 entries)

sim-cheetah accepts the following command-line arguments, in addition to the universal command line parameters.
-refs [inst | data | unified]  specify which reference stream to analyze.
-C [fa | sa | dm] fully associative, set associative, or direct-mapped cache.
-R [lru | opt]       replacement policy.
-a <sets>           log base 2 minimum bound on number of sets to simulate simultaneously.
-b <sets>           log base 2 maximum bound on set number.
-l <line>             cache line size (in bytes).
-n <assoc>        maximum associativity to analyze (in log base 2).
-in <interval>    cache size interval to report when simulating fully associative caches.
-M <size>         maximum cache size of interest.
-C <size>          cache size for direct-mapped analyses.

sim-profile: It generates detailed profiles on instruction classes and addresses, text symbols, memory accesses, branches, and data segment symbols. It accepts the following command-line arguments, which toggle the various profiling features:
-iclass        instruction class profiling (e.g. ALU, branch).
-iprof          instruction profiling (e.g., bnez, addi).
-brprof        branch class profiling (e.g., direct, calls, conditional).
-amprof      addr. mode profiling (e.g., displaced, R+R).
-segprof     load/store segment profiling (e.g., data, heap).
-tsymprof   execution profile by text symbol (functions).
-dsymprof  reference profile by data segment symbol.
-taddrprof  execution profile by text address.
-all             turn on all profiling listed above.

sim-outorder: This simulator supports out-of-order issue and execution, based on the Register Update Unit . The RUU scheme uses a reorder buffer to automatically rename registers and hold the results of pending instructions. Each cycle the reorder buffer retires completed instructions in program order to the architected register file.
The processor's memory system employs a load/store queue. Store values are placed in the queue if the store is speculative. Loads are dispatched to the memory system when the addresses of all previous stores are known. Loads may be satisfied either by the memory system or by an earlier store value residing in the queue, if their addresses match. Speculative loads may generate cache misses, but speculative TLB misses stall the pipeline until the branch condition is known.
sim-outorder runs about an order of magnitude slower than sim-fast. In addition to the general arguments sim-outorder uses the following command-line argu-ments:

Specifying the processor core
-fetch:ifqsize <size>   set the fetch width to be <size> instructions. Must be a power of two. The default is 4.
-fetch:speed <ratio>   set the ratio of the front end speed relative to the execution core (allowing <ratio> times as many instructions to be fetched as decoded per cycle).
-fetch:mplat <cycles>  set the branch misprediction latency. The default is 3 cycles.
-decode:width <insts> set the decode width to be <insts>, which must be a power of two. The default is 4.
-issue:width <insts>    set the maximum issue width in a given cycle. Must be a power of two. The default is 4.
-issue:inorder                force the simulator to use in-order issue. The default is false.
-issue:wrongpath          allow instructions to issue after a misspeculation. The default is true.
-ruu:size <insts>          capacity of the RUU (in instructions). The default is 16.
-lsq:size <insts>          capacity of the load/store queue (in instructions). The default is 8.
-res:ialu <num>            specify number of integer ALUs. The default is 4.
-res:imult <num>          specify number of integer multipliers/dividers. The default is 1.
-res:memports <num>  specify number of L1 cache ports. The default is 2.
-res:fpalu <num>          specify number of floating point ALUs. The default is 4.
-res: fpmult <num>       specify number of floating point multipliers/ dividers. The default is 1.

Specifying the memory hierarchy
All of the cache arguments and formats used in sim-cache are also used in sim-out-order, with the following additions:
-cache:dl1lat <cycles> specify the hit latency of the L1 data cache. The default is 1 cycle.
-cache:d12lat <cycles>specify the hit latency of the L2 data cache. The default is 6 cycles.
-cache:il1lat <cycles>  specify the hit latency of the L1 instruction cache. The default is 1 cycle.
-cache:il2lat <cycles>  specify the hit latency of the L2 instruction cache. The default is 6 cycles.
-mem:lat <1st> <next> specify main memory access latency (first, rest). The defaults are 18 cycles and 2 cycles.
-mem:width <bytes>     specify width of memory bus in bytes. The default is 8 bytes.
-tlb:lat <cycles>            specify latency (in cycles) to service a TLB miss. The default is 30 cycles.

Specifying the branch predictor
Branch prediction is specified by choosing the following flag with one of the six subsequent arguments. The default is a bimodal predictor with 2048 entries.
-bpred <type>
nottaken    always predict not taken.
taken         always predict taken.
perfect       perfect predictor.
bimod         bimodal predictor, using a branch target buffer (BTB) with 2-bit counters.
2lev            2-level adaptive predictor.
comb          combined predictor (bimodal and 2-level adaptive).
The predictor-specific arguments are listed below:
-bpred:bimod <size>  set the bimodal predictor table size to be <size> entries.
-bpred:2lev <l1size> <l2size> <hist_size> <xor>  specify the 2-level adaptive predictor.
<l1size>        specifies the number of entries in the first-level table,
<l2size>        specifies the number of entries in the second-level table,
<hist_size>  specifies the history width, and
<xor>            allows you to xor the history and the address in the second level of the predictor.
The default settings for the above four parameters are 1, 1024, 8, and 0, respectively.
-bpred:comb <size>              set the meta-table size of the combined predictor to be <size> entries. The default is 1024.
-bpred:ras <size>                  set the return stack size to <size> (0 entries means to return stack). The default is 8 entries.
-bpred:btb <sets> <assoc>  configure the BTB to have <sets> sets and an associativity of <assoc>. The defaults are 512 sets and an associativity of 4.
-bpred:spec_update <stage> allow speculative updates of the branch predictor in the decode or writeback stages (<stage> = [ID|WB]). The default is nonspeculative updates in the commit stage.

Visualization
-pcstat <stat>  record statistic <stat> by text address;
-ptrace <file> <range>  pipeline tracing.

 back