Skip to content
This repository has been archived by the owner on Oct 15, 2020. It is now read-only.

Integrate Bed reader code from pysnptools #23

Closed
eric-czech opened this issue Aug 6, 2020 · 12 comments
Closed

Integrate Bed reader code from pysnptools #23

eric-czech opened this issue Aug 6, 2020 · 12 comments

Comments

@eric-czech
Copy link
Collaborator

To make our dependencies lighter and streamline the library a bit, it would be helpful to vendor in the pysnptools code for reading PLINK.

@eric-czech
Copy link
Collaborator Author

eric-czech commented Aug 10, 2020

This is a summary of an email exchange with @CarlKCarlK on concurrency within a single process in pysnptools.

I had asked if there was any reason we should expect thread-safety issues or if the c++ code ever had locks or other synchronizations in place. He very kindly tested this with a short example here: https://github.com/CarlKCarlK/sgkit-plink/blob/4365a350d9356f32b4aaef39b6340cf16eee45c2/docs/ipynb/multithread.ipynb. This indicates that some speedup occurs when bed reads are made concurrently.

His last question to me was whether or not I had some code that could reproduce some of the suboptimal parallelism or any other thread-related issues.

Hey @CarlKCarlK, can we pick this thread up here instead of over email?

@eric-czech
Copy link
Collaborator Author

eric-czech commented Aug 10, 2020

To your question though, the closest thing I have to a case the reproduces some low CPU utilization when using Bed concurrently in the same process is https://github.com/pystatgen/sgkit/issues/48#issuecomment-668665296.

It's not a huge problem though since it can be worked around with multiprocessing. My larger concern was that results might differ when doing concurrent reads. I haven't ever observed that in some small tests looking for equality with and without threading, and it would be good to get a sense from you what kind of unsynchronized state shared by threads could get corrupted or if you think there are some good ways to test for that.

For some more context, the default parallelism behavior in Dask use to use multiple threads in a single process. They do this because most of the code running on threads is not Python and wouldn't block on the GIL. It would be awesome if the PLINK reader had these some characteristics and it sounds like it does to an extent. The litmus test we're using in https://github.com/pystatgen/sgkit/issues/48 is simply whether or not CPU utilization is maxed out on all available CPUs during the entire reading process (as well as how long it takes overall). I suspect there is still some room for improvement in the Bed code since we get a 3x speedup from using multiprocessing rather than threading. Let me know if that make sense or if you think there is another way we should be looking at getting the highest possible read throughput.

If that does make sense, I'll happily put together a complete script for reproduction/testing.

@CarlKCarlK
Copy link
Collaborator

CarlKCarlK commented Aug 10, 2020 via email

@eric-czech
Copy link
Collaborator Author

eric-czech commented Aug 11, 2020

I’m happy to announce that the Bed reader is now not just thread safe, it is also multithreaded!

Awesome! How does the threading work in the c++? I wasn't able to tell in a glance at CarlKCarlK/sgkit-plink@7228d14.

Dask has a flag that would let us choose between reading array chunks in parallel or synchronizing so that .read would only be on one thread and parallelism would be left up to the underlying library. The upside of the former is that it means parallelism/memory restrications are hooked into the dask configuration system and users have more control over it. The upside of the latter is that it could be faster. I think we should make this switch and simply relay the num_threads setting from Dask if we can show that it is faster when parallelized at the c++ level instead.

I split that test out into a separate issue at #26.

Also thanks again @CarlKCarlK for jumping on these change so quickly.

@CarlKCarlK
Copy link
Collaborator

CarlKCarlK commented Aug 11, 2020 via email

@CarlKCarlK
Copy link
Collaborator

CarlKCarlK commented Aug 11, 2020 via email

@eric-czech
Copy link
Collaborator Author

Thank you @CarlKCarlK! I'll kick the tires on that soon.

In the meantime is it easy to add a test for results with num_threads > 1 being equivalent to results from num_threads=1 in https://github.com/CarlKCarlK/sgkit-plink/blob/d74c60c9fdd2ed6675a2eb2e87d7a4b85f826f36/sgkit_plink/tests/test_open_bed.py?

Those other tests look great! The test_properties function is a particularly good candidate for pytest.parametrize if you were interested in learning that. You don't have to actually do anything in the code to switch to using pytest btw, i.e. you could remove the main method in that test and then run something like this from the root level of the repo:

pytest -v --cov=sgkit_plink --cov-report term-missing -k open_bed

The "-k" filter would look for the string "open_bed" in any test file name or function name. Alternatively you could pass a path totest_open_bed.py and it would run any function that starts with "test_" in it.

That would also spit out a coverage report saying whether or not any lines of code (only Python code though) weren't tested.

@CarlKCarlK
Copy link
Collaborator

CarlKCarlK commented Aug 13, 2020 via email

@CarlKCarlK
Copy link
Collaborator

CarlKCarlK commented Aug 20, 2020 via email

@horta
Copy link

horta commented Aug 20, 2020

Linux: Everything working great. OpenMP linked statically.
Windows: Working, but not great. Intel doesn’t support static OpenMP on Windows, so it is using the Python ctypes library its own copy of the OpenMP DLL. Could there be versioning problems with other Intel or LVMM versions of the DLL? Maybe.
MacOS: Not working. Default Mac gcc doesn’t support openmp. Using switching to regular gcc breaks “delocate”, the Mac clone of audit and fix that cibuildwheel uses.

I do build a Python project that depends on a C library that does use OpenMP for Macos and Linux (not Windows, for now.): https://github.com/EBI-Metagenomics/imm-py
The CFFI building is done here: https://github.com/EBI-Metagenomics/imm-py/blob/master/imm/build_ext.py

The underlying C library is this one: https://github.com/EBI-Metagenomics/imm/
I looked up at the travis for MacOS and indeed it does not find OpenMP:

-- Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES) 
-- Could NOT find OpenMP (missing: OpenMP_C_FOUND) 

But that underlying C library can work without OpenMP. So I guess I just don't use OpenMP on MacOS for wheel buids...

@CarlKCarlK
Copy link
Collaborator

[Sorry if I wasn't suppose to check-in to master with a code review. etc.]

I just checked in 823e246
It switches sgkit-plink from PySnpTools to the alpha version bed-reader.
bed-reader needs more work to get the docs, etc, in shape, but testing coverage is almost 100% and performance should be good.

@eric-czech
Copy link
Collaborator Author

Closed by #30

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants