-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Allow configuration of size of AWS event loop thread pool #34118
Comments
I thought the default behavior was for AWS to not use a pool at all and spin up a brand new detached thread per-request but that article is pretty old so maybe it is no longer the behavior. Furthermore, the docs state "which will create one for each processor on the machine." Perhaps it is a typo on their part but unless you have a multi-CPU machine (e.g. NUMA) I would expect this to use a single thread (and it would be weird if their default went against their recommendations). Although, looking at the linked issue, it does indeed seem to be a lot of threads. And...after further debugging...it does seem to be thread per physical core on my system.
I wouldn't be terribly worried about this. I expect these threads will spend the majority of their time in a blocked state, nonscheduled by the OS. I agree there is some minor hit to having more threads than you need but this isn't the more significant hit you get by over-scheduling CPU threads which leads to an excess of context switches.
Agreed, there is already
It sounds like it would be a good idea in general to change the default to 1 anyways. Though this could use some benchmarking.
Do you want to open a separate issue for this? Seems like a reasonable request. |
I looked a bit harder, and you can do it, you need to call |
…34134) This also changes the default # of threads to 1 per advice in the linked PR. * Closes: #34118 Authored-by: Weston Pace <[email protected]> Signed-off-by: Weston Pace <[email protected]>
Thanks! |
…rable (apache#34134) This also changes the default # of threads to 1 per advice in the linked PR. * Closes: apache#34118 Authored-by: Weston Pace <[email protected]> Signed-off-by: Weston Pace <[email protected]>
@westonpace @wence ARROW_IO_THREADS=1 OMP_NUM_THREADS=1 ipython
>>> from pyarrow._s3fs import initialize_s3
>>> initialize_s3(num_event_loop_threads=1)
>>> import pyarrow as pa
>>> pa.__version__
>>> '12.0.1' ps -eT -o "ppid,pid,tid,comm" | grep 2406868
1047424 2406868 2406868 ipython
1047424 2406868 2406896 ipython
1047424 2406868 2407167 jemalloc_bg_thd
1047424 2406868 2407291 AwsEventLoop 1
1047424 2406868 2407292 AwsEventLoop 2
1047424 2406868 2407293 AwsEventLoop 3
1047424 2406868 2407294 AwsEventLoop 4
1047424 2406868 2407295 AwsEventLoop 5
1047424 2406868 2407296 AwsEventLoop 6
1047424 2406868 2407297 AwsEventLoop 7
1047424 2406868 2407298 AwsEventLoop 8
1047424 2406868 2407299 AwsEventLoop 9
1047424 2406868 2407300 AwsEventLoop 10
1047424 2406868 2407301 AwsEventLoop 11
1047424 2406868 2407302 AwsEventLoop 12
1047424 2406868 2407303 AwsEventLoop 13
1047424 2406868 2407304 AwsEventLoop 14
1047424 2406868 2407305 AwsEventLoop 15
1047424 2406868 2407306 AwsEventLoop 16
1047424 2406868 2407307 ipython Also in 12.01 pyarrow because ensure being call in init.py in I suggest we reopen this issue. |
Hmm, I do not see this behavior.
Is it possible that
I agree this is a concern. I believe #35575 is tracking this. |
I do have the same identical behavior of @austinzh from plain python (3.9.6), confirmed using 13.0 too
One of the annoying things is this is triggered by also just importing |
I add a compile check in the s3fs.cc and found out that the ubuntu_cpp and also the python manylinux build both failed if I add following code
Here is my build script to trigger this error.
I have to do following patch to remove the precompile marco diff -r arrow/cpp/src/arrow/CMakeLists.txt apache-arrow-12.0.1/cpp/src/arrow/CMakeLists.txt
488,497d487
< try_compile(S3_HAS_CRT ${CMAKE_CURRENT_BINARY_DIR}/try_compile
< SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/filesystem/try_compile/check_s3fs_crt.cc"
< CMAKE_FLAGS "-DINCLUDE_DIRECTORIES=${CURRENT_INCLUDE_DIRECTORIES}"
< LINK_LIBRARIES ${AWSSDK_LINK_LIBRARIES} CXX_STANDARD 17)
<
< if(S3_HAS_CRT)
< message(STATUS "AWS SDK is new enough to have CRT support")
< add_definitions(-DARROW_S3_HAS_CRT)
< endif()
<
diff -r arrow/cpp/src/arrow/filesystem/s3fs.cc apache-arrow-12.0.1/cpp/src/arrow/filesystem/s3fs.cc
54d53
< #ifdef ARROW_S3_HAS_CRT
58d56
< #endif
2629d2626
< #ifdef ARROW_S3_HAS_CRT
2641d2637
< #endif |
I have this issue as well on pyarrow 13.0.0
Attempting to initialize the number of threads using |
Use #37394 instead of here. |
Describe the enhancement requested
When calling
DoInitializeS3
, arrow creates initialises the AWS API, which by default creates a thread pool for the background AWS event loop that uses one thread per physical core on the system.This is particularly noticeable when using pyarrow where merely importing
from pyarrow.fs import PyFileSystem
will cause eager initialisation of the AWS API, so you're paying (something) for these threads even when you have no intention of using them.This is rather unfriendly when running a multi-process or some otherhow parallelised process on a multicore box since it leads to oversubscription. Moreover, it may well not be the best option depending on the number of concurrent connections this arrow process is going to be making to the AWS api: quoth the documentation
It would be nice if there were a way to control the size of this thread pool in the same way that one can control the number of IO threads arrow uses using
ARROW_IO_THREADS
. [Aside: AFAICT there's no programmatic way of control arrow's thread pool size, it must be done via environment variables, which is also rather unfriendly].I think the following diff is kind of a sketch in this direction, although it just unilaterally sets the size of the thread pool available to a single thread.
I'm not really familiar with the arrow layout so I don't know how to plumb this in to any configuration options that might abound: is there such a thing or would one just introduce a (new?) env var?
Component(s)
C++
The text was updated successfully, but these errors were encountered: