Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Allow configuration of size of AWS event loop thread pool #34118

Closed
wence- opened this issue Feb 10, 2023 · 9 comments · Fixed by #34134
Closed

[C++] Allow configuration of size of AWS event loop thread pool #34118

wence- opened this issue Feb 10, 2023 · 9 comments · Fixed by #34134

Comments

@wence-
Copy link

wence- commented Feb 10, 2023

Describe the enhancement requested

When calling DoInitializeS3, arrow creates initialises the AWS API, which by default creates a thread pool for the background AWS event loop that uses one thread per physical core on the system.

This is particularly noticeable when using pyarrow where merely importing from pyarrow.fs import PyFileSystem will cause eager initialisation of the AWS API, so you're paying (something) for these threads even when you have no intention of using them.

This is rather unfriendly when running a multi-process or some otherhow parallelised process on a multicore box since it leads to oversubscription. Moreover, it may well not be the best option depending on the number of concurrent connections this arrow process is going to be making to the AWS api: quoth the documentation

The number of threads used depends on your use-case. IF you have a maximum of less than a few hundred connections 1 thread is the ideal threadCount.

It would be nice if there were a way to control the size of this thread pool in the same way that one can control the number of IO threads arrow uses using ARROW_IO_THREADS. [Aside: AFAICT there's no programmatic way of control arrow's thread pool size, it must be done via environment variables, which is also rather unfriendly].

I think the following diff is kind of a sketch in this direction, although it just unilaterally sets the size of the thread pool available to a single thread.

diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc
index 16ffe2526..a71ec93f7 100644
--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -2604,6 +2604,13 @@ Status DoInitializeS3(const S3GlobalOptions& options) {
   // This configuration options is only available with AWS SDK 1.9.272 and later.
   aws_options.httpOptions.compliantRfc3986Encoding = true;
 #endif
+  aws_options.ioOptions.clientBootstrap_create_fn = []() {
+    Aws::Crt::Io::EventLoopGroup eventLoopGroup(1);
+    Aws::Crt::Io::DefaultHostResolver defaultHostResolver(eventLoopGroup, 8, 30);
+    auto clientBootstrap = Aws::MakeShared<Aws::Crt::Io::ClientBootstrap>(ALLOCATION_TAG, eventLoopGroup, defaultHostResolver);
+    clientBootstrap->EnableBlockingShutdown();
+    return clientBootstrap;
+  };
   Aws::InitAPI(aws_options);
   aws_initialized.store(true);
   return Status::OK();

I'm not really familiar with the arrow layout so I don't know how to plumb this in to any configuration options that might abound: is there such a thing or would one just introduce a (new?) env var?

Component(s)

C++

@westonpace
Copy link
Member

When calling DoInitializeS3, arrow creates initialises the AWS API, which by default creates a thread pool for the background AWS event loop that uses one thread per physical core on the system.

I thought the default behavior was for AWS to not use a pool at all and spin up a brand new detached thread per-request but that article is pretty old so maybe it is no longer the behavior.

Furthermore, the docs state "which will create one for each processor on the machine." Perhaps it is a typo on their part but unless you have a multi-CPU machine (e.g. NUMA) I would expect this to use a single thread (and it would be weird if their default went against their recommendations). Although, looking at the linked issue, it does indeed seem to be a lot of threads. And...after further debugging...it does seem to be thread per physical core on my system.

This is rather unfriendly when running a multi-process or some otherhow parallelised process on a multicore box since it leads to oversubscription.

I wouldn't be terribly worried about this. I expect these threads will spend the majority of their time in a blocked state, nonscheduled by the OS. I agree there is some minor hit to having more threads than you need but this isn't the more significant hit you get by over-scheduling CPU threads which leads to an excess of context switches.

It would be nice if there were a way to control the size of this thread pool

Agreed, there is already arrow::fs::S3GlobalOptions so we have some precedent. I don't know if there are python bindings and it seems we need to add an "event loop thread pool count" to the mix.

I think the following diff is kind of a sketch in this direction, although it just unilaterally sets the size of the thread pool available to a single thread.

It sounds like it would be a good idea in general to change the default to 1 anyways. Though this could use some benchmarking.

Aside: AFAICT there's no programmatic way of control arrow's thread pool size, it must be done via environment variables, which is also rather unfriendly

Do you want to open a separate issue for this? Seems like a reasonable request.

@kou kou changed the title Allow configuration of size of AWS event loop thread pool [C++] Allow configuration of size of AWS event loop thread pool Feb 10, 2023
@wence-
Copy link
Author

wence- commented Feb 13, 2023

Aside: AFAICT there's no programmatic way of control arrow's thread pool size, it must be done via environment variables, which is also rather unfriendly

Do you want to open a separate issue for this? Seems like a reasonable request.

I looked a bit harder, and you can do it, you need to call ThreadPool->SetCapacity after creation, so I'll retract this aside.

westonpace added a commit that referenced this issue Feb 22, 2023
…34134)

This also changes the default # of threads to 1 per advice in the linked PR.
* Closes: #34118

Authored-by: Weston Pace <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
@westonpace westonpace added this to the 12.0.0 milestone Feb 22, 2023
@wence-
Copy link
Author

wence- commented Feb 22, 2023

Thanks!

fatemehp pushed a commit to fatemehp/arrow that referenced this issue Feb 24, 2023
…rable (apache#34134)

This also changes the default # of threads to 1 per advice in the linked PR.
* Closes: apache#34118

Authored-by: Weston Pace <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
@austinzh
Copy link

austinzh commented Aug 22, 2023

@westonpace @wence
Issue still exist in python 3.8 version 12.0.1

ARROW_IO_THREADS=1  OMP_NUM_THREADS=1 ipython
>>> from  pyarrow._s3fs import initialize_s3
>>> initialize_s3(num_event_loop_threads=1)
>>> import pyarrow as pa
>>> pa.__version__
>>> '12.0.1'
 ps -eT -o "ppid,pid,tid,comm" | grep 2406868
1047424 2406868 2406868 ipython
1047424 2406868 2406896 ipython
1047424 2406868 2407167 jemalloc_bg_thd
1047424 2406868 2407291 AwsEventLoop 1
1047424 2406868 2407292 AwsEventLoop 2
1047424 2406868 2407293 AwsEventLoop 3
1047424 2406868 2407294 AwsEventLoop 4
1047424 2406868 2407295 AwsEventLoop 5
1047424 2406868 2407296 AwsEventLoop 6
1047424 2406868 2407297 AwsEventLoop 7
1047424 2406868 2407298 AwsEventLoop 8
1047424 2406868 2407299 AwsEventLoop 9
1047424 2406868 2407300 AwsEventLoop 10
1047424 2406868 2407301 AwsEventLoop 11
1047424 2406868 2407302 AwsEventLoop 12
1047424 2406868 2407303 AwsEventLoop 13
1047424 2406868 2407304 AwsEventLoop 14
1047424 2406868 2407305 AwsEventLoop 15
1047424 2406868 2407306 AwsEventLoop 16
1047424 2406868 2407307 ipython

Also in 12.01 pyarrow because ensure being call in init.py in pyarrow.fs
User can not call pyarow.fs.initialize_s3.

I suggest we reopen this issue.

@westonpace
Copy link
Member

Hmm, I do not see this behavior.

(arrow-release-12-0-1) pace@pace-desktop:~$ ARROW_IO_THREADS=1 OMP_NUM_THREADS=1 python
Python 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:17) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from  pyarrow._s3fs import initialize_s3
>>> initialize_s3(num_event_loop_threads=1)
>>> import pyarrow as pa
>>> pa.__version__
'12.0.1'
>>> import os
>>> os.getpid()
18795
pace@pace-desktop:~$ ps -eT -o "ppid,pid,tid,comm" | grep 18795
  11208   18795   18795 python
  11208   18795   18797 jemalloc_bg_thd
  11208   18795   18800 AwsEventLoop 1

Is it possible that ipython's initialization is importing pyarrow.fs somewhere else?

Also in 12.01 pyarrow because ensure being call in init.py in pyarrow.fs
User can not call pyarow.fs.initialize_s3.

I agree this is a concern. I believe #35575 is tracking this.

@mauropagano
Copy link

I do have the same identical behavior of @austinzh from plain python (3.9.6), confirmed using 13.0 too

ARROW_IO_THREADS=1 OMP_NUM_THREADS=1 poetry run python
Python 3.9.6 (default, Jul 27 2021, 22:14:48) 
[GCC 9.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from  pyarrow._s3fs import initialize_s3
>>> initialize_s3(num_event_loop_threads=1)
>>> import pyarrow as pa
>>> import os
>>> os.getpid()
15589

ps -eT -o "ppid,pid,tid,comm" | grep 15589
30694 15589 15589 python
30694 15589 15623 jemalloc_bg_thd
30694 15589 15624 AwsEventLoop 1
30694 15589 15625 AwsEventLoop 2
30694 15589 15626 AwsEventLoop 3
30694 15589 15627 AwsEventLoop 4
30694 15589 15628 AwsEventLoop 5
30694 15589 15629 AwsEventLoop 6
30694 15589 15630 AwsEventLoop 7
30694 15589 15631 AwsEventLoop 8
30694 15589 15632 AwsEventLoop 9
30694 15589 15633 AwsEventLoop 10
30694 15589 15634 AwsEventLoop 11
30694 15589 15635 AwsEventLoop 12
30694 15589 15636 AwsEventLoop 13
30694 15589 15637 AwsEventLoop 14
30694 15589 15638 AwsEventLoop 15
30694 15589 15639 AwsEventLoop 16

One of the annoying things is this is triggered by also just importing pyarrow.parquet, not matter if reading from S3 or not

@austinzh
Copy link

I add a compile check in the s3fs.cc and found out that the ubuntu_cpp and also the python manylinux build both failed if I add following code

#ifndef ARROW_S3_HAS_CRT
#error "ARROW_S3_HAS_CRT does not exist"
#endif

Here is my build script to trigger this error.

archery docker run  --env CMAKE_BUILD_TYPE=release python-wheel-manylinux-2014

I have to do following patch to remove the precompile marco

diff -r arrow/cpp/src/arrow/CMakeLists.txt apache-arrow-12.0.1/cpp/src/arrow/CMakeLists.txt
488,497d487
<     try_compile(S3_HAS_CRT ${CMAKE_CURRENT_BINARY_DIR}/try_compile
<                 SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/filesystem/try_compile/check_s3fs_crt.cc"
<                 CMAKE_FLAGS "-DINCLUDE_DIRECTORIES=${CURRENT_INCLUDE_DIRECTORIES}"
<                 LINK_LIBRARIES ${AWSSDK_LINK_LIBRARIES} CXX_STANDARD 17)
<
<     if(S3_HAS_CRT)
<       message(STATUS "AWS SDK is new enough to have CRT support")
<       add_definitions(-DARROW_S3_HAS_CRT)
<     endif()
<
diff -r arrow/cpp/src/arrow/filesystem/s3fs.cc apache-arrow-12.0.1/cpp/src/arrow/filesystem/s3fs.cc
54d53
< #ifdef ARROW_S3_HAS_CRT
58d56
< #endif
2629d2626
< #ifdef ARROW_S3_HAS_CRT
2641d2637
< #endif

@igozali
Copy link

igozali commented Aug 25, 2023

I have this issue as well on pyarrow 13.0.0

igozali@host $ python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.fs, os; print(os.getpid())
3575184
>>> pyarrow.__version__
'13.0.0'
>>> os.cpu_count()
128

# On a separate shell
igozali@host $ ps -eT -o "ppid,pid,tid,comm" | grep 3575184
3505885 3575184 3575184 python
3505885 3575184 3575286 jemalloc_bg_thd
3505885 3575184 3575288 AwsEventLoop 1
3505885 3575184 3575289 AwsEventLoop 2
3505885 3575184 3575290 AwsEventLoop 3
3505885 3575184 3575291 AwsEventLoop 4
3505885 3575184 3575292 AwsEventLoop 5
3505885 3575184 3575293 AwsEventLoop 6
3505885 3575184 3575294 AwsEventLoop 7
3505885 3575184 3575295 AwsEventLoop 8
3505885 3575184 3575296 AwsEventLoop 9
3505885 3575184 3575297 AwsEventLoop 10
3505885 3575184 3575298 AwsEventLoop 11
3505885 3575184 3575299 AwsEventLoop 12
3505885 3575184 3575300 AwsEventLoop 13
3505885 3575184 3575301 AwsEventLoop 14
3505885 3575184 3575302 AwsEventLoop 15
3505885 3575184 3575303 AwsEventLoop 16
3505885 3575184 3575304 AwsEventLoop 17
3505885 3575184 3575305 AwsEventLoop 18
3505885 3575184 3575306 AwsEventLoop 19
3505885 3575184 3575307 AwsEventLoop 20
3505885 3575184 3575308 AwsEventLoop 21
3505885 3575184 3575309 AwsEventLoop 22
3505885 3575184 3575310 AwsEventLoop 23
3505885 3575184 3575311 AwsEventLoop 24
3505885 3575184 3575312 AwsEventLoop 25
3505885 3575184 3575313 AwsEventLoop 26
3505885 3575184 3575314 AwsEventLoop 27
3505885 3575184 3575315 AwsEventLoop 28
3505885 3575184 3575316 AwsEventLoop 29
3505885 3575184 3575317 AwsEventLoop 30
3505885 3575184 3575318 AwsEventLoop 31
3505885 3575184 3575319 AwsEventLoop 32
3505885 3575184 3575320 AwsEventLoop 33
3505885 3575184 3575321 AwsEventLoop 34
3505885 3575184 3575322 AwsEventLoop 35
3505885 3575184 3575323 AwsEventLoop 36
3505885 3575184 3575324 AwsEventLoop 37
3505885 3575184 3575325 AwsEventLoop 38
3505885 3575184 3575326 AwsEventLoop 39
3505885 3575184 3575327 AwsEventLoop 40
3505885 3575184 3575328 AwsEventLoop 41
3505885 3575184 3575329 AwsEventLoop 42
3505885 3575184 3575330 AwsEventLoop 43
3505885 3575184 3575331 AwsEventLoop 44
3505885 3575184 3575332 AwsEventLoop 45
3505885 3575184 3575333 AwsEventLoop 46
3505885 3575184 3575334 AwsEventLoop 47
3505885 3575184 3575335 AwsEventLoop 48
3505885 3575184 3575336 AwsEventLoop 49
3505885 3575184 3575337 AwsEventLoop 50
3505885 3575184 3575338 AwsEventLoop 51
3505885 3575184 3575339 AwsEventLoop 52
3505885 3575184 3575340 AwsEventLoop 53
3505885 3575184 3575341 AwsEventLoop 54
3505885 3575184 3575342 AwsEventLoop 55
3505885 3575184 3575343 AwsEventLoop 56
3505885 3575184 3575344 AwsEventLoop 57
3505885 3575184 3575345 AwsEventLoop 58
3505885 3575184 3575346 AwsEventLoop 59
3505885 3575184 3575347 AwsEventLoop 60
3505885 3575184 3575348 AwsEventLoop 61
3505885 3575184 3575349 AwsEventLoop 62
3505885 3575184 3575350 AwsEventLoop 63
3505885 3575184 3575351 AwsEventLoop 64

Attempting to initialize the number of threads using initialize_s3(num_event_loop_threads=1) still shows 64 threads. Maybe need to use env var here?

@kou
Copy link
Member

kou commented Aug 25, 2023

Use #37394 instead of here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants