-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix segmentation fault in mvn update_shapes #27263
base: master
Are you sure you want to change the base?
Conversation
@@ -300,7 +300,7 @@ inline void update_shapes(kernel_selector::Params& p, const kernel_impl_params& | |||
auto& fd = bp.fused_ops[i]; | |||
fd.output_tensor = convert_data_tensor(fused_prim.output_layout); | |||
for (size_t i = fd.dep_idx_start; i < fd.dep_idx_start + fd.dep_size; i++) { | |||
fd.tensors.push_back(convert_data_tensor(impl_param.get_input_layout(i))); | |||
fd.AddTensor(convert_data_tensor(impl_param.get_input_layout(i))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think data race is unexpected here. kernel_impl_params object should be unique for each primitive in each thread. Could you describe a little bit what causes data race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vladimir-paramuzov, previously, I found that update_shapes was called twice from the debug log:
--> primitive_inst: multiply:Multiply_22381, execute update_impl start
--> primitive_inst: multiply:Multiply_22381, execute update_impl start
[+] update_shapes
[+] update_shapes
and it will fail at fd.tensors.push_back, since 1 of them updated the fd.tensors data pointer:
(so the second "[after ] ***" doesn't printed out in the log)
[before] fd.tensors.size: 0x10, fd.tensors.data: 0x7ffd2484f960
[before] fd.tensors.size: 0x10, fd.tensors.data: 0x7ffd2484f960
[after ] fd.tensors.size: 0x11, fd.tensors.data: 0x7ffd24a88840
I think this may be caused by my 2 Arc A770 cards setting on the machine.
After changing to only 1 iGPU on another machine, I can't re-produce the twice called update_shapes in my debug log now, so update_shapes is only called once (for primitive) and should not be the problem. But the segmentation fault issue still existed.
After excluding update_shapes, I will try to track the fd to see if it is used simultaneously somewhere else. (to see the potential cause of data race).
BTW, my commit in this PR also works for the iGPU (as well as previous dGPU A770 when debugging).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update_shapes is only called once (for primitive) and should not be the problem. But the segmentation fault issue still existed.
Could you go deeper why it happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@p-durandin Yes, this issue happens when running multiple times "start_async()" in the test_cross-lingual-books-alignment.ipynb. notebook.
Changing multiple times running of "start_async()" to "infer()" will have no issue, so the problem is related to multiple asynchronized running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a log to detect if concurrent access happen. Yes, it indeed happened when running in notebook. benchmark_app doesn't reproduce this issue with many nstreams.
for (size_t i = 0; i < bp.fused_ops.size(); i++) {
const auto& fused_prim = impl_param.fused_desc[i];
auto& fd = bp.fused_ops[i];
if (!vector_mutex.try_lock()) {
log_message("WARNING: Concurrent access detected! Another thread is accessing the vector.");
}
fd.output_tensor = convert_data_tensor(fused_prim.output_layout);
for (size_t i = fd.dep_idx_start; i < fd.dep_idx_start + fd.dep_size; i++) {
fd.tensors.push_back(convert_data_tensor(impl_param.get_input_layout(i)));
}
if (vector_mutex.try_lock()) { // Check if this thread holds the lock
vector_mutex.unlock();
}
}
Another suggestion to the fix. Instead using mutex, change the std::vector to concurrent_vector.
src/plugins/intel_gpu/src/kernel_selector/kernel_selector_params.h
#include <tbb/concurrent_vector.h>
//using MultiDataTensor = std::vector;
using MultiDataTensor = tbb::concurrent_vector;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this below happen, the segfault trigger.
[2024-11-06 18:51:51.444] [Thread 135250088822336] WARNING: Concurrent access detected! Another thread is accessing the vector.
[2024-11-06 18:51:51.449] [Thread 135245015811648] WARNING: Concurrent access detected! Another thread is accessing the vector.
[2024-11-06 18:51:51.449] [Thread 135245015811648] WARNING: Concurrent access detected! Another thread is accessing the vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that you are referring to the AsyncInferQueue and its' start_async() calls in the test_cross-lingual-books-alignment.ipynb notebook?
It seems that this segmentation fault is a side effect of another issue. Could you please check why and how the single network (and impl_params of the primitives) are being used simultaneously by two different threads? It's not expected behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- With
config={hints.performance_mode(): hints.PerformanceMode.THROUGHPUT}
, the NUM_STREAMS will be 2.
PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
COMPILATION_NUM_THREADS: 20
NUM_STREAMS: 2
PERFORMANCE_HINT_NUM_REQUESTS: 0
INFERENCE_PRECISION_HINT: <Type: 'float16'>
DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
ACTIVATIONS_SCALE_FACTOR: 0.0
DEVICE_ID: 0
EXECUTION_DEVICES: ['GPU.0']
And 2 different threads (thread 729e28c00640
and 729e17e00640
) will be used to start_async:
--> sentence idx = 0 start_async
--> cpu_streams_executor Impl run task with thread: 729e28c00640
--> sentence idx = 1 start_async
--> cpu_streams_executor Impl run task with thread: 729e17e00640
These 2 threads will have different networks to run:
// 1st thread
#8 0x00007ffedcf31007 in cldnn::network::execute[abi:cxx11](this=0x55556fcf3450,
#10 ov::intel_gpu::SyncInferRequest::infer (this=0x5555647b64d0)
#25 ov::threading::CPUStreamsExecutor::Impl::Execute 0x55556467b250
// 2nd thread
#8 0x00007ffedcf31007 in cldnn::network::execute[abi:cxx11](this=0x55556fdf5370,
#10 ov::intel_gpu::SyncInferRequest::infer (this=0x55556fac2240)
#25 ov::threading::CPUStreamsExecutor::Impl::Execute 0x55556467b250
But will access the same fd (fused_operation_desc: 0x729bdc004490
), and push_back new element to the same fd.tensors (std::vector: 0x729c0c040850
).
--> update_shapes[729e28c00640] i: 0, &fd: 0x729bdc004490, id: mvn:__module.encoder.layer.0.attention.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 3, ptr: 0x729c0c040850
--> update_shapes[729e17e00640] i: 0, &fd: 0x729bdc004490, id: mvn:__module.encoder.layer.0.attention.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 4, ptr: 0x729c0c040850
And thus there is data race of fd.tensors under this THROUGHPUT mode.
- BTW, for different mvn primitive_inst, the fd (and fd.tensors) are different (so should be no data race between primitive_inst within the same thread):
--> update_shapes[729e17e00640] i: 0, &fd: 0x729bdc004490, id: mvn:__module.encoder.layer.0.attention.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 1, ptr: 0x729bdc003bd0
--> update_shapes[729e17e00640] i: 0, &fd: 0x729bb9fe0f40, id: mvn:__module.encoder.layer.0.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 1, ptr: 0x729bb83128c0
--> update_shapes[729e17e00640] i: 0, &fd: 0x729bf40106e0, id: mvn:__module.encoder.layer.1.attention.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 1, ptr: 0x729bf4002690
--> update_shapes[729e17e00640] i: 0, &fd: 0x729bba26f320, id: mvn:__module.encoder.layer.1.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 1, ptr: 0x729bb8685ad0
b8d8db9
to
27ddd4d
Compare
Reverted mutex for fd.tensors (std::vector) and copied _impl KernelData params for different networks when running start_async() with multi-threads (THROUGHPUT mode), no segmentation fault now with commit 2f4b607 for test_cross-lingual-books-alignment.ipynb notebook with iGPU as well as Arc A770. |
auto prim_impl = make_unique<permute_impl>(*this); | ||
kernel_params_t* params_ptr = dynamic_cast<kernel_params_t*>(prim_impl->_kernel_data.params.get()); | ||
prim_impl->_kernel_data.params = make_unique<kernel_params_t>(*params_ptr); | ||
return prim_impl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can have a template helper to minimize code copy-paste.
Signed-off-by: yuan.xiong <[email protected]>
…th multi-threads under THROUGHPUT mode Signed-off-by: yuan.xiong <[email protected]>
This reverts commit 27ddd4d.
Details:
Tickets: