Fix segmentation fault in mvn update_shapes #27263

yuanxion · 2024-10-27T08:45:00Z

Details:

Add mutex for std::vector push_back in mvn update_shapes, to avoid multi-threads competition

Tickets:

CVS-155564

yuanxion · 2024-10-27T08:47:20Z

After applying the commit 838043e, the test_cross-lingual-books-alignment.ipynb notebook can be passed without segmentation fault now.

vladimir-paramuzov · 2024-10-28T05:11:55Z

src/plugins/intel_gpu/src/graph/impls/ocl/kernel_selector_helper.h

@@ -300,7 +300,7 @@ inline void update_shapes(kernel_selector::Params& p, const kernel_impl_params&
        auto& fd = bp.fused_ops[i];
        fd.output_tensor = convert_data_tensor(fused_prim.output_layout);
        for (size_t i = fd.dep_idx_start; i < fd.dep_idx_start + fd.dep_size; i++) {
-            fd.tensors.push_back(convert_data_tensor(impl_param.get_input_layout(i)));
+            fd.AddTensor(convert_data_tensor(impl_param.get_input_layout(i)));


I think data race is unexpected here. kernel_impl_params object should be unique for each primitive in each thread. Could you describe a little bit what causes data race?

Hi @vladimir-paramuzov, previously, I found that update_shapes was called twice from the debug log:

--> primitive_inst: multiply:Multiply_22381, execute update_impl start --> primitive_inst: multiply:Multiply_22381, execute update_impl start [+] update_shapes [+] update_shapes

and it will fail at fd.tensors.push_back, since 1 of them updated the fd.tensors data pointer:
(so the second "[after ] ***" doesn't printed out in the log)

[before] fd.tensors.size: 0x10, fd.tensors.data: 0x7ffd2484f960 [before] fd.tensors.size: 0x10, fd.tensors.data: 0x7ffd2484f960 [after ] fd.tensors.size: 0x11, fd.tensors.data: 0x7ffd24a88840

I think this may be caused by my 2 Arc A770 cards setting on the machine.
After changing to only 1 iGPU on another machine, I can't re-produce the twice called update_shapes in my debug log now, so update_shapes is only called once (for primitive) and should not be the problem. But the segmentation fault issue still existed.

After excluding update_shapes, I will try to track the fd to see if it is used simultaneously somewhere else. (to see the potential cause of data race).
BTW, my commit in this PR also works for the iGPU (as well as previous dGPU A770 when debugging).

update_shapes is only called once (for primitive) and should not be the problem. But the segmentation fault issue still existed.
Could you go deeper why it happens

@p-durandin Yes, this issue happens when running multiple times "start_async()" in the test_cross-lingual-books-alignment.ipynb. notebook.
Changing multiple times running of "start_async()" to "infer()" will have no issue, so the problem is related to multiple asynchronized running.

Added a log to detect if concurrent access happen. Yes, it indeed happened when running in notebook. benchmark_app doesn't reproduce this issue with many nstreams.

for (size_t i = 0; i < bp.fused_ops.size(); i++) { const auto& fused_prim = impl_param.fused_desc[i]; auto& fd = bp.fused_ops[i]; if (!vector_mutex.try_lock()) { log_message("WARNING: Concurrent access detected! Another thread is accessing the vector."); } fd.output_tensor = convert_data_tensor(fused_prim.output_layout); for (size_t i = fd.dep_idx_start; i < fd.dep_idx_start + fd.dep_size; i++) {

fd.tensors.push_back(convert_data_tensor(impl_param.get_input_layout(i)));
}
if (vector_mutex.try_lock()) { // Check if this thread holds the lock
vector_mutex.unlock();
}
}

Another suggestion to the fix. Instead using mutex, change the std::vector to concurrent_vector.

src/plugins/intel_gpu/src/kernel_selector/kernel_selector_params.h

#include <tbb/concurrent_vector.h>

//using MultiDataTensor = std::vector;
using MultiDataTensor = tbb::concurrent_vector;

When this below happen, the segfault trigger.

[2024-11-06 18:51:51.444] [Thread 135250088822336] WARNING: Concurrent access detected! Another thread is accessing the vector.
[2024-11-06 18:51:51.449] [Thread 135245015811648] WARNING: Concurrent access detected! Another thread is accessing the vector.
[2024-11-06 18:51:51.449] [Thread 135245015811648] WARNING: Concurrent access detected! Another thread is accessing the vector.

Do I understand correctly that you are referring to the AsyncInferQueue and its' start_async() calls in the test_cross-lingual-books-alignment.ipynb notebook?
It seems that this segmentation fault is a side effect of another issue. Could you please check why and how the single network (and impl_params of the primitives) are being used simultaneously by two different threads? It's not expected behavior

Thanks @clee30 & @sshlyapn for your suggestions.

Dumped the model properties for the model, and seems the numbers of infer requests & streams are different:
(left: with performance hint, failed; right: without performance hint, succeeded)

With config={hints.performance_mode(): hints.PerformanceMode.THROUGHPUT}, the NUM_STREAMS will be 2.

PERFORMANCE_HINT: PerformanceMode.THROUGHPUT EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE COMPILATION_NUM_THREADS: 20 NUM_STREAMS: 2 PERFORMANCE_HINT_NUM_REQUESTS: 0 INFERENCE_PRECISION_HINT: <Type: 'float16'> DYNAMIC_QUANTIZATION_GROUP_SIZE: 32 ACTIVATIONS_SCALE_FACTOR: 0.0 DEVICE_ID: 0 EXECUTION_DEVICES: ['GPU.0']

And 2 different threads (thread 729e28c00640 and 729e17e00640) will be used to start_async:

--> sentence idx = 0 start_async --> cpu_streams_executor Impl run task with thread: 729e28c00640 --> sentence idx = 1 start_async --> cpu_streams_executor Impl run task with thread: 729e17e00640

These 2 threads will have different networks to run:

// 1st thread #8 0x00007ffedcf31007 in cldnn::network::execute[abi:cxx11](this=0x55556fcf3450, #10 ov::intel_gpu::SyncInferRequest::infer (this=0x5555647b64d0) #25 ov::threading::CPUStreamsExecutor::Impl::Execute 0x55556467b250 // 2nd thread #8 0x00007ffedcf31007 in cldnn::network::execute[abi:cxx11](this=0x55556fdf5370, #10 ov::intel_gpu::SyncInferRequest::infer (this=0x55556fac2240) #25 ov::threading::CPUStreamsExecutor::Impl::Execute 0x55556467b250

But will access the same fd (fused_operation_desc: 0x729bdc004490), and push_back new element to the same fd.tensors (std::vector: 0x729c0c040850).

--> update_shapes[729e28c00640] i: 0, &fd: 0x729bdc004490, id: mvn:__module.encoder.layer.0.attention.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 3, ptr: 0x729c0c040850 --> update_shapes[729e17e00640] i: 0, &fd: 0x729bdc004490, id: mvn:__module.encoder.layer.0.attention.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 4, ptr: 0x729c0c040850

And thus there is data race of fd.tensors under this THROUGHPUT mode.

BTW, for different mvn primitive_inst, the fd (and fd.tensors) are different (so should be no data race between primitive_inst within the same thread):

--> update_shapes[729e17e00640] i: 0, &fd: 0x729bdc004490, id: mvn:__module.encoder.layer.0.attention.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 1, ptr: 0x729bdc003bd0 --> update_shapes[729e17e00640] i: 0, &fd: 0x729bb9fe0f40, id: mvn:__module.encoder.layer.0.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 1, ptr: 0x729bb83128c0 --> update_shapes[729e17e00640] i: 0, &fd: 0x729bf40106e0, id: mvn:__module.encoder.layer.1.attention.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 1, ptr: 0x729bf4002690 --> update_shapes[729e17e00640] i: 0, &fd: 0x729bba26f320, id: mvn:__module.encoder.layer.1.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 1, ptr: 0x729bb8685ad0

yuanxion · 2024-11-15T05:12:52Z

Reverted mutex for fd.tensors (std::vector) and copied _impl KernelData params for different networks when running start_async() with multi-threads (THROUGHPUT mode), no segmentation fault now with commit 2f4b607 for test_cross-lingual-books-alignment.ipynb notebook with iGPU as well as Arc A770.

vladimir-paramuzov · 2024-11-15T06:14:50Z

src/plugins/intel_gpu/src/graph/impls/ocl/permute.cpp

+        auto prim_impl = make_unique<permute_impl>(*this);
+        kernel_params_t* params_ptr = dynamic_cast<kernel_params_t*>(prim_impl->_kernel_data.params.get());
+        prim_impl->_kernel_data.params = make_unique<kernel_params_t>(*params_ptr);
+        return prim_impl;


I think we can have a template helper to minimize code copy-paste.

Signed-off-by: yuan.xiong <[email protected]>

…th multi-threads under THROUGHPUT mode Signed-off-by: yuan.xiong <[email protected]>

This reverts commit 27ddd4d.

github-actions bot added the category: GPU OpenVINO GPU plugin label Oct 27, 2024

sys-openvino-ci added the ExternalPR External contributor label Oct 27, 2024

yuanxion marked this pull request as ready for review October 27, 2024 09:26

yuanxion requested review from a team as code owners October 27, 2024 09:26

vladimir-paramuzov reviewed Oct 28, 2024

View reviewed changes

geunhwan added this to the 2024.5 milestone Oct 29, 2024

geunhwan added the Code Freeze label Oct 29, 2024

geunhwan removed this from the 2024.5 milestone Oct 30, 2024

geunhwan removed the Code Freeze label Oct 30, 2024

yuanxion force-pushed the fix-segment-fault branch 2 times, most recently from b8d8db9 to 27ddd4d Compare November 15, 2024 03:41

vladimir-paramuzov reviewed Nov 15, 2024

View reviewed changes

yuanxion added 3 commits November 15, 2024 19:45

Fix segmentation fault in mvn update_shapes

27ddd4d

Signed-off-by: yuan.xiong <[email protected]>

Deep copy _impl params to avoid segmentation fault for start_async wi…

0754bc5

…th multi-threads under THROUGHPUT mode Signed-off-by: yuan.xiong <[email protected]>

Revert "Fix segmentation fault in mvn update_shapes"

2f4b607

This reverts commit 27ddd4d.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix segmentation fault in mvn update_shapes #27263

Fix segmentation fault in mvn update_shapes #27263

yuanxion commented Oct 27, 2024

yuanxion commented Oct 27, 2024

vladimir-paramuzov Oct 28, 2024

yuanxion Oct 29, 2024

p-durandin Nov 4, 2024

yuanxion Nov 5, 2024

clee30 Nov 6, 2024

clee30 Nov 6, 2024 •

edited

Loading

sshlyapn Nov 6, 2024

yuanxion Nov 7, 2024

yuanxion Nov 13, 2024 •

edited

Loading

yuanxion commented Nov 15, 2024 •

edited

Loading

vladimir-paramuzov Nov 15, 2024

Fix segmentation fault in mvn update_shapes #27263

Are you sure you want to change the base?

Fix segmentation fault in mvn update_shapes #27263

Conversation

yuanxion commented Oct 27, 2024

Details:

Tickets:

yuanxion commented Oct 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clee30 Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuanxion Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

yuanxion commented Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

clee30 Nov 6, 2024 •

edited

Loading

yuanxion Nov 13, 2024 •

edited

Loading

yuanxion commented Nov 15, 2024 •

edited

Loading