-
Notifications
You must be signed in to change notification settings - Fork 182
SNC and membw
Typically if one want to populate a cache they just run: ./membw -c <core_number> -b <MB/s> --write
The tool allocates a chunk of memory as much as twice of socket's L3 physical cache size and this is enough to populate 100% of the cache.
The situation changes when SNC is enabled. membw tool uses standard means of the C library for the memory allocation. Therefore it is affected by all OS memory placement policy limitations such as NUMA-awareness. NUMA-aware OS allocates only addresses local to the SNC-domain and only SNC-domain local cache slices are populated. In the case of SNC-2 it populates 50% of cache slices. For SNC-3 the population will be 33%. For SNC-4 only 25% of the cache will be populated. The conclusion is that if one wants to run a workload that uses 100% socket cache occupancy they should distribute it across all the SNC domains on the socket, using at lease one CPU core from each domain.
Example of membw usage for 100% cache population in presence of SNC-2
System configuration excerpts obtained with lspcu:
CPU(s): 256
On-line CPU(s) list: 0-255
Vendor ID: GenuineIntel
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
Caches (sum of all):
L1d: 6 MiB (128 instances)
L1i: 4 MiB (128 instances)
L2: 256 MiB (128 instances)
L3: 640 MiB (2 instances)
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s): 0-31,128-159
NUMA node1 CPU(s): 32-63,160-191
NUMA node2 CPU(s): 64-95,192-223
NUMA node3 CPU(s): 96-127,224-255
The test system is a 2 socket system, with 256 logical CPUs total or 128 per socket. Let's consider socket 0. It contains 2 NUMA nodes 0 and 1, that is, SNC-Domanins 0 and 1. SNC 0 domain has 2 ranges of CPUs: 0-31, 128-159. SNC 1 domain contains 2 ranges of CPUs: 32-36, 160-191. The system L3 cache size is 640MiB: 320MiB per socket, 160MiB per SNC domain. To populate the 100% of cache occupancy on physical socket 0 we have to choose 2 random cores from SNC domains 0 and 1 and then run membw against each of those. Let's choose core 4 from SNC 0 and core 32 from SNC-1 (notice that binding memory allocation to both SNC domain 0 and 1 won't render to population more than 160MiB, that is addresses allocated on the local SNC domain):
cache-qos-source/tools/membw$ numactl --membind=0,1 ./membw -c 4 -b 2400 –write
cache-qos-source/tools/membw$ numactl --membind=0,1 ./membw -c 32 -b 2400 --write
Measure the cache occupancy on the cores under load:
$ LD_LIBRARY_PATH=lib cache-qos-source/pqos/pqos --iface=msr -r snc-total -m llc:4,32
The command will show 160Mib cache occupancy for each of the cores: 4 and 32
Summing up the values gives us ~100% cache occupation in SNC-total mode (The socket cache size is 320Mib in this machine).