-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atomic store operations are not detected on cached private pages #52
Comments
ioanev
pushed a commit
to ioanev/argodsm
that referenced
this issue
Mar 1, 2021
As per issue etascale#52, changes were made to mempool and the collective allocators, in order for only the home node of the data structure allocated to apply changes on it (initialization/modification). As the above didn't solve all issues, further modifications were applied on the failing tests. More specifically: - PerfectForwardingCollectiveNew under allocators.cpp In this test, a global pointer object is attached to the collecti- vely allocated tuple, in order to identify the home node of this page and allow only that process to proceed on the modifications. - selectiveArray, selectiveUnaligned, writeBufferLoad under backend.cpp In these tests, only node_0 was set to handle the initialization of the global data structures worked on, thus leading to failure when utilizing the first-touch memory policy, as the backing store of node_0 is not sufficient to host all pages. To spread the pages across all nodes and avoid this issue, the initialization is done collectively. The ArgoDSM global memory size is also increased from 16 to 32MB, as writeBufferLoad was still failing even by team process initializing the data structures. - ReadUninitializedSinglenode under uninitialized.cpp This test was also modified to collectively initialize the data structure worked on due to the aforementioned limitation. The case where if one node under first-touch initializes all data leads to failure is left to be handled on future work, by moving the allocations to the next node of sufficient backing store size.
ioanev
pushed a commit
to ioanev/argodsm
that referenced
this issue
Mar 1, 2021
As per issue etascale#52, changes were made to mempool and the collective allocators, in order for only the home node of the data structure allocated to apply changes on it (initialization/modification). As the above didn't solve all issues, further modifications were applied on the failing tests. More specifically: - PerfectForwardingCollectiveNew under allocators.cpp In this test, a global pointer object is attached to the collecti- vely allocated tuple, in order to identify the home node of this page and allow only that process to proceed on the modifications. - selectiveArray, selectiveUnaligned, writeBufferLoad under backend.cpp In these tests, only node_0 was set to handle the initialization of the global data structures worked on, thus leading to failure when utilizing the first-touch memory policy, as the backing store of node_0 is not sufficient to host all pages. To spread the pages across all nodes and avoid this issue, the initialization is done collectively. The ArgoDSM global memory size is also increased from 16 to 32MB, as writeBufferLoad was still failing even by team process initializing the data structures. - ReadUninitializedSinglenode under uninitialized.cpp This test was also modified to collectively initialize the data structure worked on due to the aforementioned limitation. The case where if one node under first-touch initializes all data leads to failure is left to be handled on future work, by moving the allocations to the next node of sufficient backing store size.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Issue
When an ArgoDSM page is cached by a remote node, and the page is in private state (caching node is single writer or no writer, caching node is sharer), subsequent atomic writes to the page are not detected on this node by regular reads even after self-invalidation.
The question is whether this is a bug (I believe so), or if the semantics of the ArgoDSM atomic functions simply do not define such behavior in the first place. This is related to #20, but is further exposed by Ioannis work on memory allocation policies which fails some of the atomic tests when the allocated data is owned by a node other than 0.
Reproduction
This following test exposes the bug on 2 or more nodes with the naive allocation policy. In order to display the second case (no writer, caching node is sharer), simply substitute
*counter = 0;
withvolatile int temp = *counter
. The important point is thatcounter
is allocated on node 0, and that exactly one other node performs a read or write to it.Detail
This issue is courtesy of the following optimization:
argodsm/src/backend/mpi/swdsm.cpp
Lines 984 to 994 in 5f9b572
The reason is that ArgoDSM atomics do not alter the Pyxis directory (
globalSharers
) state, and therefore cached remote pages in single writer or no writer, shared state are not invalidated upon self-invalidation causing the node to miss updates until the state of the page changes.Solution?
The fact that cached private pages are not downgraded to shared on atomic writes means that it is never completely safe to mix atomic writes and regular reads/writes. I believe that the correct solution would be to write atomic changes to the cache and to update local (and remote when needed as a result) Pyxis directories to the correct state.
The text was updated successfully, but these errors were encountered: