Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to latest accel-sim:dev #236

Merged
merged 110 commits into from
Oct 18, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
110 commits
Select commit Hold shift + click to select a range
b7b9dc0
Fix for bugs in lazy write handling
gvoskuilen Oct 26, 2020
2d73b61
Merge pull request #3 from gvoskuilen/dev
mkhairy Oct 26, 2020
950464e
change address type into ull
allencho1222 Nov 9, 2020
07f77e1
do not truncate 32 MSB bits of the memory address
allencho1222 Nov 9, 2020
132c2ce
added MSHR_HIT
JRPan Nov 15, 2020
85e36b9
Merge pull request #6 from JRPan/add_mshr
mkhairy Nov 22, 2020
29cce50
Merge pull request #4 from allencho1222/patch-1
tgrogers Jan 27, 2021
e6b0608
Merge pull request #5 from allencho1222/patch-2
tgrogers Jan 27, 2021
5ac0b60
Merge branch 'dev' of https://github.com/accel-sim/gpgpu-sim_distribu…
mkhairy Jan 28, 2021
f3a0077
bug fix was_writeback_sent
JRPan Feb 12, 2021
67f89ab
Merge pull request #7 from JRPan/fix-was_writeback_sent
mkhairy Feb 16, 2021
51d9925
fix hash funciton
JRPan Feb 15, 2021
2f96645
Merge pull request #9 from JRPan/fix-cache-hash
mkhairy Feb 19, 2021
b430b36
adding new RTX 3070 config
mkhairy Feb 25, 2021
deb5eb5
Merge branch 'dev' of https://github.com/accel-sim/gpgpu-sim_distribu…
mkhairy Feb 25, 2021
09f10eb
change the L1 cache policy to be on-miss based on recent ubench
mkhairy Mar 25, 2021
1ee03f0
change the L1 cache policy based on recent ubench
mkhairy Mar 25, 2021
5533464
parition CU allocation, add prints
barnes88 May 9, 2021
645a0ea
minor fixes
barnes88 May 9, 2021
46423a2
useful print statement
barnes88 May 9, 2021
b672880
validated collector unit partitioning based on scheduler
barnes88 May 9, 2021
fa76ab4
sub core model dispatches only to assigned exec pipelines
barnes88 May 10, 2021
c905726
minor fix accessing du
barnes88 May 10, 2021
a72b84e
fix find_ready reg_id
barnes88 May 10, 2021
6ad5bad
dont need du id
barnes88 May 10, 2021
9219236
remove prints
barnes88 May 10, 2021
52a890c
need at least 1 cu per sched for sub_core model, fix find_ready() reg_id
barnes88 May 11, 2021
2db9120
move reg_id calc to cu object init
barnes88 May 11, 2021
4825a1d
fix assert
barnes88 May 11, 2021
e2b410d
clean up redundant method args
barnes88 May 11, 2021
9c0156b
more cleanup
barnes88 May 11, 2021
28c3c94
cleanup find_ready
barnes88 May 11, 2021
28d0565
partition issue() in the shader execute stage
barnes88 May 11, 2021
08ad045
Merge branch 'sub_core_devel' of github.com:barnes88/gpgpu-sim_distri…
barnes88 May 11, 2021
ec55c68
minor fixes, pure virtual calls
barnes88 May 11, 2021
71455d8
add prints for ex issue validation
barnes88 May 12, 2021
640674b
issue function needed to be constrained
barnes88 May 12, 2021
9b6af84
fix print, move simd::issue() impl to .cc file
barnes88 May 12, 2021
6ae2391
fix prints / segfault
barnes88 May 12, 2021
a450d74
remove prints
barnes88 May 12, 2021
6a09900
rm unnecessary instr get
barnes88 May 12, 2021
5945d70
specialized unit should be partitioned too
barnes88 May 13, 2021
92c814a
run changes through clang-format
barnes88 May 13, 2021
db10197
rm old dirs in format-code.sh
barnes88 May 13, 2021
c526262
fix adaptive cache cfg option parsing data type
JRPan May 13, 2021
c51350d
Merge pull request #13 from JRPan/fix-config-parser
tgrogers May 13, 2021
f2a7d9c
fixing streaming cache based on recent ubench
mkhairy May 15, 2021
1347395
adding the missing xoring hashing
mkhairy May 15, 2021
6319e31
moving reg file read to read_operands function as before
mkhairy May 15, 2021
d89f9f7
Merge branch 'dev' of https://github.com/accel-sim/gpgpu-sim_distribu…
mkhairy May 17, 2021
c94b883
code refactoring cycle()
mkhairy May 17, 2021
7d9a12f
specialized unit get_ready() was missing subcore
barnes88 May 17, 2021
585dcf5
Merge pull request #12 from barnes88/sub_core_devel
mkhairy May 18, 2021
6121a88
Merge branch 'dev' of https://github.com/accel-sim/gpgpu-sim_distribu…
mkhairy May 18, 2021
0f30305
dirty counter added. NO increamenting yet
JRPan Feb 15, 2021
615f173
store ack for new waps
JRPan Feb 20, 2021
ad72041
sending cache block byte mask
JRPan Mar 2, 2021
bb19c0c
update mf breakdown at L2
JRPan Mar 2, 2021
e05fa4a
little bug fix - flush()
JRPan Mar 2, 2021
804ee90
sending byte mask for all policies
JRPan Mar 8, 2021
b3dab5e
set byte mask on fill
JRPan Mar 8, 2021
40077df
solve deadlock for non-sectored cache configs
JRPan Mar 8, 2021
64bf6fd
dirty counter not resetting after kernel finish
JRPan Mar 18, 2021
a374b33
remove MSHR_HIT from cache total access
JRPan Mar 26, 2021
f6fb56b
check sector readable only on reads
JRPan Apr 6, 2021
994fb19
reset dirty counter
JRPan May 4, 2021
7306930
remove runtime check of dirty counter
JRPan May 12, 2021
0601354
Add WT to lazy_fetch_on_read
JRPan May 18, 2021
f783351
new configs - adaptive cache and cache write ratio
JRPan May 17, 2021
a2b1b1c
adaptive cache - update
JRPan May 17, 2021
f70f5d6
re-wording/formatting
JRPan May 19, 2021
4a762a9
formatting again
JRPan May 19, 2021
4c354eb
minor improvements
JRPan May 19, 2021
f27da22
Use cache config multipilier when possible
JRPan May 19, 2021
0e4f12a
Merge pull request #14 from JRPan/spring-2021-all
mkhairy May 19, 2021
1875132
Merge branch 'dev' into adaptive-cache
JRPan May 19, 2021
2b2b6a2
Merge pull request #15 from JRPan/adaptive-cache
mkhairy May 19, 2021
14f22bc
add checking on spec unit in subcore
mkhairy May 19, 2021
3363536
Merge branch 'dev' of https://github.com/accel-sim/gpgpu-sim_distribu…
mkhairy May 19, 2021
604baaf
fixing the failing of merging
mkhairy May 19, 2021
a2ba2f5
updating config files with right adaptive cache parameters
mkhairy May 19, 2021
b63d19a
updating config files
mkhairy May 19, 2021
e3d186b
chaning @sets to 4 based on recent ubenchs
mkhairy May 19, 2021
24ffab2
moving shmem option to the base class and change the code to accept t…
mkhairy May 20, 2021
fedcde3
moving the unified size from the base class config to l1 config
mkhairy May 20, 2021
8aee56d
rename set_dirty_byte_mask
mkhairy May 20, 2021
b466afe
eliminate redundant code in gpu-cache.h
mkhairy May 20, 2021
7fac247
change L1 cache config in Volta+ to be write-through and write-alloca…
mkhairy May 20, 2021
0d33266
oops delete this config, it should not be pushed
mkhairy May 20, 2021
2aef4e3
Merge pull request #16 from mkhairy/dev
mkhairy May 20, 2021
c8eca04
fix merge conflict
JRPan May 17, 2021
f665ad5
L2 breakdown - reuse mf allocator
JRPan May 21, 2021
b814c52
cast to float - dirty line percentage
JRPan May 21, 2021
ce4f20f
Merge pull request #17 from JRPan/rewrite-l2-breakdown
mkhairy May 21, 2021
3b75d8f
Update version
mkhairy May 22, 2021
7e48560
Update CHANGES
mkhairy May 22, 2021
b6409b4
Update README.md
mkhairy May 22, 2021
6c9e13d
format code
mkhairy May 23, 2021
778962e
updating the configs based on the tuner output
mkhairy May 26, 2021
3eea014
changing kernel latency
mkhairy May 26, 2021
6ad461a
fixing configs
mkhairy May 27, 2021
110aeb1
rewrite shmem_option parsing
JRPan May 31, 2021
04462cb
update readable
JRPan Jun 3, 2021
e9d781a
minor improvements
JRPan Jun 3, 2021
0f088dc
correct dirty counter
JRPan Jun 16, 2021
3cf24b8
WT in lazy fetch on read
JRPan Jun 23, 2021
b1befa8
Adding restricted round robin scheduler
JRPan Aug 16, 2021
b658147
better oc selecting when sub core enabled
JRPan Aug 16, 2021
a8256e5
Update volta to use lrr scheduler
JRPan Aug 23, 2021
84c4f46
Ampere and Turing also lrr scheduler
JRPan Aug 23, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CHANGES
Original file line number Diff line number Diff line change
@@ -1,4 +1,16 @@
LOG:
Version 4.1.0 versus 4.0.0
-Features:
1- Supporting L1 write-allocate with sub-sector writing policy as in Volta+ hardware, and changing the Volta+ cards config to make L1 write-allocate with write-through
2- Making the L1 adaptive cache policy to be configurable
3- Adding Ampere RTX 3060 config files
-Bugs:
1- Fixing L1 bank hash function bug
2- Fixing L1 read hit counters in gpgpu-sim to match nvprof, to achieve more accurate L1 correlation with the HW
3- Fixing bugs in lazy write handling, thanks to Gwendolyn Voskuilen from Sandia labs for this fix
4- Fixing the backend pipeline for sub_core model
5- Fixing Memory stomp bug at the shader_config
6- Some code refactoring:
Version 4.0.0 (development branch) versus 3.2.3
-Front-End:
1- Support .nc cache modifier and __ldg function to access the read-only L1D cache
Expand Down
9 changes: 7 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,22 +11,26 @@ This version of GPGPU-Sim has been tested with a subset of CUDA version 4.2,
Please see the copyright notice in the file COPYRIGHT distributed with this
release in the same directory as this file.

GPGPU-Sim 4.0 is compatible with Accel-Sim simulation framework. With the support
of Accel-Sim, GPGPU-Sim 4.0 can run NVIDIA SASS traces (trace-based simulation)
generated by NVIDIA's dynamic binary instrumentation tool (NVBit). For more information
about Accel-Sim, see [https://accel-sim.github.io/](https://accel-sim.github.io/)

If you use GPGPU-Sim 4.0 in your research, please cite:

Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, Timothy G Rogers.
Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling.
In proceedings of the 47th IEEE/ACM International Symposium on Computer Architecture (ISCA),
May 29 - June 3, 2020.

If you use CuDNN or PyTorch support, checkpointing or our new debugging tool for functional
If you use CuDNN or PyTorch support (execution-driven simulation), checkpointing or our new debugging tool for functional
simulation errors in GPGPU-Sim for your research, please cite:

Jonathan Lew, Deval Shah, Suchita Pati, Shaylin Cattell, Mengchi Zhang, Amruth Sandhupatla,
Christopher Ng, Negar Goli, Matthew D. Sinclair, Timothy G. Rogers, Tor M. Aamodt
Analyzing Machine Learning Workloads Using a Detailed GPU Simulator, arXiv:1811.08933,
https://arxiv.org/abs/1811.08933


If you use the Tensor Core model in GPGPU-Sim or GPGPU-Sim's CUTLASS Library
for your research please cite:

Expand Down Expand Up @@ -261,6 +265,7 @@ To clean the docs run
The documentation resides at doc/doxygen/html.

To run Pytorch applications with the simulator, install the modified Pytorch library as well by following instructions [here](https://github.com/gpgpu-sim/pytorch-gpgpu-sim).

## Step 3: Run

Before we run, we need to make sure the application's executable file is dynamically linked to CUDA runtime library. This can be done during compilation of your program by introducing the nvcc flag "--cudart shared" in makefile (quotes should be excluded).
Expand Down
119 changes: 57 additions & 62 deletions configs/tested-cfgs/SM75_RTX2060/gpgpusim.config
Original file line number Diff line number Diff line change
@@ -1,8 +1,3 @@
# This config models the Turing RTX 2060
# For more info about turing architecture:
# 1- https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
# 2- "RTX on—The NVIDIA Turing GPU", IEEE MICRO 2020

# functional simulator specification
-gpgpu_ptx_instruction_classification 0
-gpgpu_ptx_sim_mode 0
Expand All @@ -14,6 +9,7 @@
-gpgpu_runtime_sync_depth_limit 2
-gpgpu_runtime_pending_launch_count_limit 2048
-gpgpu_kernel_launch_latency 5000
-gpgpu_TB_launch_latency 0

# Compute Capability
-gpgpu_compute_capability_major 7
Expand All @@ -27,91 +23,93 @@
-gpgpu_n_clusters 30
-gpgpu_n_cores_per_cluster 1
-gpgpu_n_mem 12
-gpgpu_n_sub_partition_per_mchannel 2
-gpgpu_n_sub_partition_per_mchannel 2

# volta clock domains
# clock domains
#-gpgpu_clock_domains <Core Clock>:<Interconnect Clock>:<L2 Clock>:<DRAM Clock>
-gpgpu_clock_domains 1365.0:1365.0:1365.0:3500.0
# boost mode
# -gpgpu_clock_domains 1680.0:1680.0:1680.0:3500.0
-gpgpu_clock_domains 1365:1365:1365:3500.5

# shader core pipeline config
-gpgpu_shader_registers 65536
-gpgpu_registers_per_block 65536
-gpgpu_occupancy_sm_number 75

# This implies a maximum of 32 warps/SM
-gpgpu_shader_core_pipeline 1024:32
-gpgpu_shader_cta 32
-gpgpu_shader_core_pipeline 1024:32
-gpgpu_shader_cta 16
-gpgpu_simd_model 1

# Pipeline widths and number of FUs
# ID_OC_SP,ID_OC_DP,ID_OC_INT,ID_OC_SFU,ID_OC_MEM,OC_EX_SP,OC_EX_DP,OC_EX_INT,OC_EX_SFU,OC_EX_MEM,EX_WB,ID_OC_TENSOR_CORE,OC_EX_TENSOR_CORE
## Turing has 4 SP SIMD units, 4 INT units, 4 SFU units, 8 Tensor core units
## We need to scale the number of pipeline registers to be equal to the number of SP units
-gpgpu_pipeline_widths 4,0,4,4,4,4,0,4,4,4,8,4,4
-gpgpu_pipeline_widths 4,4,4,4,4,4,4,4,4,4,8,4,4
-gpgpu_num_sp_units 4
-gpgpu_num_sfu_units 4
-gpgpu_num_dp_units 4
-gpgpu_num_int_units 4
-gpgpu_tensor_core_avail 1
-gpgpu_num_tensor_core_units 4

# Instruction latencies and initiation intervals
# "ADD,MAX,MUL,MAD,DIV"
# All Div operations are executed on SFU unit
-ptx_opcode_latency_int 4,13,4,5,145,32
-ptx_opcode_initiation_int 2,2,2,2,8,4
-ptx_opcode_latency_fp 4,13,4,5,39
-ptx_opcode_latency_int 4,4,4,4,21
-ptx_opcode_initiation_int 2,2,2,2,2
-ptx_opcode_latency_fp 4,4,4,4,39
-ptx_opcode_initiation_fp 2,2,2,2,4
-ptx_opcode_latency_dp 8,19,8,8,330
-ptx_opcode_initiation_dp 4,4,4,4,130
-ptx_opcode_latency_sfu 100
-ptx_opcode_latency_dp 64,64,64,64,330
-ptx_opcode_initiation_dp 64,64,64,64,130
-ptx_opcode_latency_sfu 21
-ptx_opcode_initiation_sfu 8
-ptx_opcode_latency_tesnor 64
-ptx_opcode_initiation_tensor 64

# Turing has four schedulers per core
-gpgpu_num_sched_per_core 4
# Greedy then oldest scheduler
-gpgpu_scheduler gto
## In Turing, a warp scheduler can issue 1 inst per cycle
-gpgpu_max_insn_issue_per_warp 1
-gpgpu_dual_issue_diff_exec_units 1

# shared memory bankconflict detection
-gpgpu_shmem_num_banks 32
-gpgpu_shmem_limited_broadcast 0
-gpgpu_shmem_warp_parts 1
-gpgpu_coalesce_arch 75

# Trung has sub core model, in which each scheduler has its own register file and EUs
# sub core model: in which each scheduler has its own register file and EUs
# i.e. schedulers are isolated
-gpgpu_sub_core_model 1
# disable specialized operand collectors and use generic operand collectors instead
-gpgpu_enable_specialized_operand_collector 0
-gpgpu_operand_collector_num_units_gen 8
-gpgpu_operand_collector_num_in_ports_gen 8
-gpgpu_operand_collector_num_out_ports_gen 8
# turing has 8 banks dual-port, 4 schedulers, two banks per scheduler
# we increase #banks to 16 to mitigate the effect of Regisrer File Cache (RFC) which we do not implement in the current version
-gpgpu_num_reg_banks 16
# register banks
-gpgpu_num_reg_banks 8
-gpgpu_reg_file_port_throughput 2

# warp scheduling
-gpgpu_num_sched_per_core 4
-gpgpu_scheduler lrr
# a warp scheduler issue mode
-gpgpu_max_insn_issue_per_warp 1
-gpgpu_dual_issue_diff_exec_units 1

## L1/shared memory configuration
# <nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>:<set_index_fn>,<mshr>:<N>:<merge>,<mq>:**<fifo_entry>
# ** Optional parameter - Required when mshr_type==Texture Fifo
-gpgpu_adaptive_cache_config 0
# In adaptive cache, we adaptively assign the remaining shared memory to L1 cache
# For more info, see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-7-x
-gpgpu_adaptive_cache_config 1
-gpgpu_shmem_option 32,64
-gpgpu_unified_l1d_size 96
# L1 cache configuration
-gpgpu_l1_banks 4
-gpgpu_cache:dl1 S:1:128:512,L:L:s:N:L,A:256:8,16:0,32
-gpgpu_shmem_size 65536
-gpgpu_shmem_sizeDefault 65536
-gpgpu_shmem_per_block 65536
-gpgpu_cache:dl1 S:4:128:64,L:T:m:L:L,A:256:32,16:0,32
-gpgpu_l1_latency 32
-gpgpu_gmem_skip_L1D 0
-gpgpu_n_cluster_ejection_buffer_size 32
-gpgpu_l1_latency 20
-gpgpu_smem_latency 20
-gpgpu_flush_l1_cache 1
-gpgpu_n_cluster_ejection_buffer_size 32
-gpgpu_l1_cache_write_ratio 25

# 64 sets, each 128 bytes 16-way for each memory sub partition (128 KB per memory sub partition). This gives us 3MB L2 cache
# shared memory configuration
-gpgpu_shmem_size 65536
-gpgpu_shmem_sizeDefault 65536
-gpgpu_shmem_per_block 49152
-gpgpu_smem_latency 30
# shared memory bankconflict detection
-gpgpu_shmem_num_banks 32
-gpgpu_shmem_limited_broadcast 0
-gpgpu_shmem_warp_parts 1
-gpgpu_coalesce_arch 75

# L2 cache
-gpgpu_cache:dl2 S:64:128:16,L:B:m:L:P,A:192:4,32:0,32
-gpgpu_cache:dl2_texture_only 0
-gpgpu_dram_partition_queues 64:64:64:64
Expand All @@ -122,44 +120,41 @@
-gpgpu_cache:il1 N:64:128:16,L:R:f:N:L,S:2:48,4
-gpgpu_inst_fetch_throughput 4
# 128 KB Tex
# Note, TEX is deprected in Volta, It is used for legacy apps only. Use L1D cache instead with .nc modifier or __ldg mehtod
# Note, TEX is deprected since Volta, It is used for legacy apps only. Use L1D cache instead with .nc modifier or __ldg mehtod
-gpgpu_tex_cache:l1 N:4:128:256,L:R:m:N:L,T:512:8,128:2
# 64 KB Const
-gpgpu_const_cache:l1 N:128:64:8,L:R:f:N:L,S:2:64,4
-gpgpu_perfect_inst_const_cache 1

# interconnection
#-network_mode 1
#-inter_config_file config_turing_islip.icnt
# use built-in local xbar
-network_mode 2
-icnt_in_buffer_limit 512
-icnt_out_buffer_limit 512
-icnt_subnets 2
-icnt_arbiter_algo 1
-icnt_flit_size 40
-icnt_arbiter_algo 1

# memory partition latency config
-gpgpu_l2_rop_latency 160
-dram_latency 100
-gpgpu_l2_rop_latency 194
-dram_latency 96

# dram model config
# dram sched config
-gpgpu_dram_scheduler 1
-gpgpu_frfcfs_dram_sched_queue_size 64
-gpgpu_dram_return_queue_size 192

# Turing has GDDR6
# http://monitorinsider.com/GDDR6.html
# dram model config
-gpgpu_n_mem_per_ctrlr 1
-gpgpu_dram_buswidth 2
-gpgpu_dram_burst_length 16
-dram_data_command_freq_ratio 4
-gpgpu_mem_address_mask 1
-gpgpu_mem_addr_mapping dramid@8;00000000.00000000.00000000.00000000.0000RRRR.RRRRRRRR.RBBBCCCC.BCCSSSSS

# Use the same GDDR5 timing, scaled to 3500MHZ
-gpgpu_dram_timing_opt "nbk=16:CCD=4:RRD=10:RCD=20:RAS=50:RP=20:RC=62:
CL=20:WL=8:CDLR=9:WR=20:nbkgrp=4:CCDL=4:RTPL=4"
# Mem timing
-gpgpu_dram_timing_opt nbk=16:CCD=4:RRD=12:RCD=24:RAS=55:RP=24:RC=78:CL=24:WL=8:CDLR=10:WR=24:nbkgrp=4:CCDL=6:RTPL=4
-dram_dual_bus_interface 0

# select lower bits for bnkgrp to increase bnkgrp parallelism
-dram_bnk_indexing_policy 0
Expand All @@ -174,7 +169,7 @@
-enable_ptx_file_line_stats 1
-visualizer_enabled 0

# power model configs, disable it untill we create a real energy model for Volta
# power model configs, disable it untill we create a real energy model
-power_simulation_enabled 0

# tracing functionality
Expand Down
23 changes: 13 additions & 10 deletions configs/tested-cfgs/SM7_QV100/gpgpusim.config
Original file line number Diff line number Diff line change
Expand Up @@ -94,12 +94,12 @@
-gpgpu_shmem_num_banks 32
-gpgpu_shmem_limited_broadcast 0
-gpgpu_shmem_warp_parts 1
-gpgpu_coalesce_arch 60
-gpgpu_coalesce_arch 70

# Volta has four schedulers per core
-gpgpu_num_sched_per_core 4
# Greedy then oldest scheduler
-gpgpu_scheduler gto
-gpgpu_scheduler lrr
## In Volta, a warp scheduler can issue 1 inst per cycle
-gpgpu_max_insn_issue_per_warp 1
-gpgpu_dual_issue_diff_exec_units 1
Expand All @@ -113,17 +113,21 @@
# For more info, see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-7-x
# disable this mode in case of multi kernels/apps execution
-gpgpu_adaptive_cache_config 1
# Volta unified cache has four banks
-gpgpu_shmem_option 0,8,16,32,64,96
-gpgpu_unified_l1d_size 128
# L1 cache configuration
-gpgpu_l1_banks 4
-gpgpu_cache:dl1 S:1:128:256,L:L:s:N:L,A:256:8,16:0,32
-gpgpu_cache:dl1 S:4:128:64,L:T:m:L:L,A:512:8,16:0,32
-gpgpu_l1_cache_write_ratio 25
-gpgpu_l1_latency 20
-gpgpu_gmem_skip_L1D 0
-gpgpu_flush_l1_cache 1
-gpgpu_n_cluster_ejection_buffer_size 32
# shared memory configuration
-gpgpu_shmem_size 98304
-gpgpu_shmem_sizeDefault 98304
-gpgpu_shmem_per_block 65536
-gpgpu_gmem_skip_L1D 0
-gpgpu_n_cluster_ejection_buffer_size 32
-gpgpu_l1_latency 20
-gpgpu_smem_latency 20
-gpgpu_flush_l1_cache 1

# 32 sets, each 128 bytes 24-way for each memory sub partition (96 KB per memory sub partition). This gives us 6MB L2 cache
-gpgpu_cache:dl2 S:32:128:24,L:B:m:L:P,A:192:4,32:0,32
Expand Down Expand Up @@ -201,5 +205,4 @@
# tracing functionality
#-trace_enabled 1
#-trace_components WARP_SCHEDULER,SCOREBOARD
#-trace_sampling_core 0

#-trace_sampling_core 0
18 changes: 11 additions & 7 deletions configs/tested-cfgs/SM7_TITANV/gpgpusim.config
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@
# Volta has four schedulers per core
-gpgpu_num_sched_per_core 4
# Greedy then oldest scheduler
-gpgpu_scheduler gto
-gpgpu_scheduler lrr
## In Volta, a warp scheduler can issue 1 inst per cycle
-gpgpu_max_insn_issue_per_warp 1
-gpgpu_dual_issue_diff_exec_units 1
Expand All @@ -114,17 +114,21 @@
# For more info, see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-7-x
# disable this mode in case of multi kernels/apps execution
-gpgpu_adaptive_cache_config 1
# Volta unified cache has four banks
-gpgpu_shmem_option 0,8,16,32,64,96
-gpgpu_unified_l1d_size 128
# L1 cache configuration
-gpgpu_l1_banks 4
-gpgpu_cache:dl1 S:1:128:256,L:L:s:N:L,A:256:8,16:0,32
-gpgpu_cache:dl1 S:4:128:64,L:T:m:L:L,A:512:8,16:0,32
-gpgpu_l1_cache_write_ratio 25
-gpgpu_gmem_skip_L1D 0
-gpgpu_l1_latency 20
-gpgpu_flush_l1_cache 1
-gpgpu_n_cluster_ejection_buffer_size 32
# shared memory configuration
-gpgpu_shmem_size 98304
-gpgpu_shmem_sizeDefault 98304
-gpgpu_shmem_per_block 65536
-gpgpu_gmem_skip_L1D 0
-gpgpu_n_cluster_ejection_buffer_size 32
-gpgpu_l1_latency 20
-gpgpu_smem_latency 20
-gpgpu_flush_l1_cache 1

# 32 sets, each 128 bytes 24-way for each memory sub partition (96 KB per memory sub partition). This gives us 4.5MB L2 cache
-gpgpu_cache:dl2 S:32:128:24,L:B:m:L:P,A:192:4,32:0,32
Expand Down
Loading