seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT #6354

prasun3 · 2023-10-10T06:40:24Z

Describe the bug

See https://groups.google.com/g/DynamoRIO-Users/c/Dsf4As64r7o

In some of my tests, I am seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT.

The value returned by stream->get_output_cpuid matches the initial CPU_ID marker values, but at some point on some streams, we see other CPU_ID values as shown below.

Example:

Output #5 processing CPU #7                   (from get_stream_cpuid)
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 7
...
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 7
Output #5 saw CPU_ID marker: 5

To Reproduce
Steps to reproduce the behavior:

Collect a trace with drrun -t drcachesim -offline
Use the scheduler infrastructure with the MAP_TO_RECORDED_OUTPUT option
We see multiple CPU IDs on the same output stream which is not expected

Please also answer these questions:

What happens when you run without any client?
- Not applicable
What happens when you run with debug build ("-debug" flag to drrun/drconfig/drinject)?
- No difference

Expected behavior
With the MAP_TO_RECORDED_OUTPUT option, each output stream maps to exactly one CPU

Screenshots or Pasted Text

Versions

What version of DynamoRIO are you using?
- Commit 910ccb2
Does the latest build from https://github.com/DynamoRIO/dynamorio/releases solve the problem?
- No
What operating system version are you running on? ("Windows 10" is not sufficient: give the release number.)
- Ubuntu 18
Is your application 32-bit or 64-bit?
- 64-bit

Additional context

The text was updated successfully, but these errors were encountered:

Added check to verify that output stream's CPU ID matches the CPU ID markers seen in the trace. Issue: #6354

Issue: #6354

derekbruening · 2023-10-10T20:50:56Z

It looks like this is the culprit, this thread with zero instructions:

$ bin64/drrun -t drcachesim -simulator_type view -only_thread 85300 -indir ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64
Output format:
<--record#-> <--instr#->: <---tid---> <record details>
------------------------------------------------------------
           1           0:       85300 <marker: version 6>
           2           0:       85300 <marker: filetype 0xe40>
           3           0:       85300 <marker: cache line size 64>
           4           0:       85300 <marker: chunk instruction count 10000000>
           5           0:       85300 <marker: page size 4096>
           6           0:       85300 <marker: timestamp 13341391202394706>
           7           0:       85300 <marker: tid 85300 on core 1>
           8           0:       85300 <marker: function==syscall #202>
           9           0:       85300 <marker: function return value 0x0>
          10           0:       85300 <marker: timestamp 13341391214912820>
          11           0:       85300 <marker: tid 85300 on core 0>
          12           0:       85300 <thread 85300 exited>
View tool results:
              0 : total instructions

All the scheduling is instruction-count-based so this degenerate thread has 2 entries both at instruction count 0:

$ unzip -p ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64/cpu_schedule.bin.zip 0 | od -t d8 -w32
0000000                85303    13341391204829725                    0                    0
0000040                85304    13341391205074867                    0                    0
0000100                85300    13341391214912820                    0                    0
0000140                85304    13341391214943240                    0               109071
0000200

$ unzip -p ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64/cpu_schedule.bin.zip 1 | od -t d8 -w32
0000000                85300    13341391202394706                    1                    0
0000040                85303    13341391206798536                    1                29985
0000100

The first question is, how did this thread end up having no instructions and still being output: I thought we had a filter to remove empty threads. That filter is based bytes written: so maybe the futex syscall exit (must have attached mid-futex as there's no entry) is what made it past the filter?

Tightens the normal-mode output rule to discard a thread with no instructions (keeping the no-data check for i-filtered modes). Adds the scheduler_launcher CPU check from PR #6355. It is not easy to create a test with a thread that reliably executes zero instructions yet still outputs some data. I added an invariant check for a zero-instruction thread and confirmed it fires on the trace in PR #6355: ``` $ bin64/drrun -t drcachesim -simulator_type invariant_checker -indir ~/tmp/drmemtrace.schedtest.x64 Trace invariant failure in T85300 at ref # 12 (0 instrs since timestamp 13341391214912820): An unfiltered thread should have at least 1 instruction ``` Issue: #6354

prasun3 · 2023-10-11T07:49:24Z

Thanks for the debug! I added a comment here: #6356 (comment)

Issue: #6354

derekbruening · 2023-10-12T19:35:53Z

The second case from #6355 (comment) is rep string:

Input    1 @ 13109305 refs,   3329113 instrs
ERROR: CPU marker 0 on core #1 differs from output stream CPU ID 1

$ for i in 0 1 12; do echo "Core #$i"; unzip -p ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64/cpu_schedule.bin.zip $i | od -t d8 -w32; done
Core #0
0000000               180525    13341574296169946                    0                    0
0000040
Core #1
0000000               180525    13341574300611346                    1              3329113
0000040
Core #12
0000000               180524    13341574296003803                   12                    0
0000040

$ bin64/drrun -t drcachesim -simulator_type view -only_thread 180525 -indir ../src/clients/drcachesim/tests/drmemtrace.schedtest.x64 -skip_instrs 3329110 -sim_refs 30
Output format:
<--record#-> <--instr#->: <---tid---> <record details>
------------------------------------------------------------
    13109282     3329110:      180525 <marker: timestamp 13341574300611269>
    13109283     3329110:      180525 <marker: tid 180525 on core 0>
    13109284     3329111:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f20 48 89 c7             mov    %rax -> %rdi
    13109285     3329112:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f23 48 89 d6             mov    %rdx -> %rsi
    13109286     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109287     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6ae0 by PC 0x00007f25505e5f26
    13109288     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6ce0 by PC 0x00007f25505e5f26
    13109289     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109290     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6ae8 by PC 0x00007f25505e5f26
    13109291     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6ce8 by PC 0x00007f25505e5f26
    13109292     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109293     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6af0 by PC 0x00007f25505e5f26
    13109294     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6cf0 by PC 0x00007f25505e5f26
    13109295     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109296     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6af8 by PC 0x00007f25505e5f26
    13109297     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6cf8 by PC 0x00007f25505e5f26
    13109298     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109299     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6b00 by PC 0x00007f25505e5f26
    13109300     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6d00 by PC 0x00007f25505e5f26
    13109301     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109302     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6b08 by PC 0x00007f25505e5f26
    13109303     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6d08 by PC 0x00007f25505e5f26
    13109304     3329113:      180525 <marker: timestamp 13341574300611280>
    13109305     3329113:      180525 <marker: tid 180525 on core 0>
    13109306     3329113:      180525 <marker: timestamp 13341574300611346>
    13109307     3329113:      180525 <marker: tid 180525 on core 1>
    13109308     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx
    13109309     3329113:      180525 read         8 byte(s) @ 0x00007f25519a6b10 by PC 0x00007f25505e5f26
    13109310     3329113:      180525 write        8 byte(s) @ 0x00007f25519a6d10 by PC 0x00007f25505e5f26
    13109311     3329113:      180525 ifetch       3 byte(s) @ 0x00007f25505e5f26 f3 48 a5             rep movs %ds:(%rsi)[8byte] %rsi %rdi %rcx -> %es:(%rdi)[8byte] %rsi %rdi %rcx

All the rep strings have the same instr count: making it not work well as a divider point! Will have to consider how to handle this.

derekbruening · 2023-10-12T19:44:00Z

If we eliminated the ifetch record for rep string (#4948, #4915) it doesn't help: the load/store records themselves could fill the buffer up and trigger cpuid markers in the middle.

derekbruening · 2023-10-12T19:45:22Z

Want instr ord for fast skip. So have to store both instr and record count?
And have skip_instructions() take in both and no longer always align at the
start of an instruction?

This affects record-replay too: though the switch point would just always be the
start of the rep string loop.

derekbruening · 2023-10-12T19:47:59Z

Theoretically the same thing could happen with a ton of markers in between two instructions.

derekbruening · 2023-10-12T19:54:16Z

Since the rep string loop could be quite long, it doesn't seem like we can just ignore this case unfortunately (for the marker case I would say it's not worth more effort than throwing away the offending cpuid marker in the middle).

derekbruening · 2023-10-12T22:44:59Z

@prasun3 do you have an opinion on how important this is: if real-world rep string loops are generally not very long maybe never switching in the middle of one is ok? (There are still other aspects to sort out like record-replay with time quanta if we went this route.)

Tightens the normal-mode output rule to discard a thread with no instructions (keeping the no-data check for i-filtered modes) and expands the empty-thread behavior to include regular exit runs and not just detach. Adds the scheduler_launcher CPU check from PR #6355. It is not easy to create a test with a thread that reliably executes zero instructions yet still outputs some data. I added an invariant check for a zero-instruction thread and confirmed it fires on the trace in PR #6355: ``` $ bin64/drrun -t drcachesim -simulator_type invariant_checker -indir ~/tmp/drmemtrace.schedtest.x64 Trace invariant failure in T85300 at ref # 12 (0 instrs since timestamp 13341391214912820): An unfiltered thread should have at least 1 instruction ``` Issue: #6354

prasun3 · 2023-10-13T04:02:39Z

@prasun3 do you have an opinion on how important this is: if real-world rep string loops are generally not very long maybe never switching in the middle of one is ok? (There are still other aspects to sort out like record-replay with time quanta if we went this route.)

I think not switching in the middle of a rep should be okay.

derekbruening · 2023-10-13T04:38:25Z

@prasun3 -- another question: what is your preference for whether each iteration of a rep string loop should have a regular instruction fetch record, or not? We had decided in the past that it should not; but an alternative solution here would be to revisit that, to have a separate instruction ordinal for each iteration: either with a regular fetch or even without.

prasun3 · 2023-10-13T04:46:33Z

@prasun3 -- another question: what is your preference for whether each iteration of a rep string loop should have a regular instruction fetch record, or not? We had decided in the past that it should not; but an alternative solution here would be to revisit that, to have a separate instruction ordinal for each iteration: either with a regular fetch or even without.

We'd prefer if each iteration did not have an instr fetch record.

derekbruening · 2023-10-25T17:27:38Z

For the rep string:

Currently: scheduler records and replays the start only. Problems:

Wrong cpuid in record-as-traced
Instr quantum won't switch mid-loop
Time quantum will switch mid-loop but on replay would switch at start so won't match

Proposal #A: Consider not a big deal to not switch mid-loop

Hide the wrong cpuid: raw2trace deletes??
Time quantum delays mid-loop switch??

Proposal #B: Count non-fetched instrs in all instr ordinals

Confusing with instr records not matching ordinal

Proposal #C: Reverse earlier decisions and insert instr record for each iter

Won't match PMU counts
Seems unpopular when users are polled (e.g., seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT #6354 (comment))

Proposal #D: Store both an instruction ordinal and a record ordinal everywhere we store instruction ordinals today (as-traced and serial schedule files; record-replay files; range_t region-of-interest bounds; skip_instructions API) and update the skipping code to advance to mid-instruction points (any point?)

Proposal #E: Like #D but we store both an instruction ordinal and sub-instruction "load/store ordinal". This then does not rely on having unfetched instr records (which we have proposed removing) and has more structure than an asolute record ordinal.

Proposal #F: New counter of {fetched+unfetched} instructions used by the scheduler. But skipping and zipfile chunks are based on fetched only.

We've decided that Proposal #E is the winner. This involves bumping the version for a new field in all the schedule file records (the raw2trace-written format and the record-replay format).

derekbruening · 2023-10-25T17:29:05Z

The rep string solution may also serve to support running instruction-filtered traces or portions of traces in the scheduler with instruction quanta, skips, and replayed schedules.

prasun3 added a commit that referenced this issue Oct 10, 2023

i#6354 test: Verify one CPU ID per stream with MAP_TO_RECORDED

2416231

Added check to verify that output stream's CPU ID matches the CPU ID markers seen in the trace. Issue: #6354

prasun3 linked a pull request Oct 10, 2023 that will close this issue

i#6354 test: Verify one CPU ID per stream with MAP_TO_RECORDED #6355

Open

prasun3 added a commit that referenced this issue Oct 10, 2023

i#6354 test: Fixed build when HAS_ZIP is not defined

439e491

Issue: #6354

derekbruening mentioned this issue Oct 10, 2023

i#6354: Do not output zero-instruction (unfiltered) threads #6356

Merged

derekbruening self-assigned this Oct 10, 2023

prasun3 added a commit that referenced this issue Oct 12, 2023

i6354 test: removed traces with zero-inst trace

8e19465

Issue: #6354

prasun3 added a commit that referenced this issue Oct 12, 2023

i6354 test: Added test trace

27bb137

Issue: #6354

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT #6354

seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT #6354

prasun3 commented Oct 10, 2023

derekbruening commented Oct 10, 2023

prasun3 commented Oct 11, 2023

derekbruening commented Oct 12, 2023 •

edited

Loading

derekbruening commented Oct 12, 2023

derekbruening commented Oct 12, 2023 •

edited

Loading

derekbruening commented Oct 12, 2023

derekbruening commented Oct 12, 2023

derekbruening commented Oct 12, 2023

prasun3 commented Oct 13, 2023

derekbruening commented Oct 13, 2023

prasun3 commented Oct 13, 2023

derekbruening commented Oct 25, 2023

derekbruening commented Oct 25, 2023

seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT #6354

seeing more than one cpu id on one output stream with MAP_TO_RECORDED_OUTPUT #6354

Comments

prasun3 commented Oct 10, 2023

derekbruening commented Oct 10, 2023

prasun3 commented Oct 11, 2023

derekbruening commented Oct 12, 2023 • edited Loading

derekbruening commented Oct 12, 2023

derekbruening commented Oct 12, 2023 • edited Loading

derekbruening commented Oct 12, 2023

derekbruening commented Oct 12, 2023

derekbruening commented Oct 12, 2023

prasun3 commented Oct 13, 2023

derekbruening commented Oct 13, 2023

prasun3 commented Oct 13, 2023

derekbruening commented Oct 25, 2023

derekbruening commented Oct 25, 2023

derekbruening commented Oct 12, 2023 •

edited

Loading

derekbruening commented Oct 12, 2023 •

edited

Loading