-
Notifications
You must be signed in to change notification settings - Fork 4
Integrate Bed reader code from pysnptools #23
Comments
This is a summary of an email exchange with @CarlKCarlK on concurrency within a single process in pysnptools. I had asked if there was any reason we should expect thread-safety issues or if the c++ code ever had locks or other synchronizations in place. He very kindly tested this with a short example here: https://github.com/CarlKCarlK/sgkit-plink/blob/4365a350d9356f32b4aaef39b6340cf16eee45c2/docs/ipynb/multithread.ipynb. This indicates that some speedup occurs when bed reads are made concurrently. His last question to me was whether or not I had some code that could reproduce some of the suboptimal parallelism or any other thread-related issues. Hey @CarlKCarlK, can we pick this thread up here instead of over email? |
To your question though, the closest thing I have to a case the reproduces some low CPU utilization when using Bed concurrently in the same process is https://github.com/pystatgen/sgkit/issues/48#issuecomment-668665296. It's not a huge problem though since it can be worked around with multiprocessing. My larger concern was that results might differ when doing concurrent reads. I haven't ever observed that in some small tests looking for equality with and without threading, and it would be good to get a sense from you what kind of unsynchronized state shared by threads could get corrupted or if you think there are some good ways to test for that. For some more context, the default parallelism behavior in Dask use to use multiple threads in a single process. They do this because most of the code running on threads is not Python and wouldn't block on the GIL. It would be awesome if the PLINK reader had these some characteristics and it sounds like it does to an extent. The litmus test we're using in https://github.com/pystatgen/sgkit/issues/48 is simply whether or not CPU utilization is maxed out on all available CPUs during the entire reading process (as well as how long it takes overall). I suspect there is still some room for improvement in the Bed code since we get a 3x speedup from using multiprocessing rather than threading. Let me know if that make sense or if you think there is another way we should be looking at getting the highest possible read throughput. If that does make sense, I'll happily put together a complete script for reproduction/testing. |
Eric,
I’m happy to announce that the Bed reader is now not just thread safe, it is also multithreaded!
A notebook showing usage and times is here:
https://github.com/CarlKCarlK/sgkit-plink/blob/frompysnptools/docs/ipynb/multithread.ipynb
Disclaimer: I’m still evolving the “open_bed” API, so there will be minor API changes.
If you want to try it:
* Get code form: https://github.com/CarlKCarlK/sgkit-plink/tree/7228d147
* Set your PYTHONPATH to that directory
* Build the C++ extensions with
* python setup.py build_ext –inplace
For example on my 6 proc (12 with hyperthreading) machine reading from a fast SSD drive:
For 487K x 10K:
open_bed(bigfile,shape=shape,num_threads=1).read(np.s_[:,:10*1000]).shape
Takes 11.3 seconds
open_bed(bigfile,shape=shape,num_threads=12).read(np.s_[:,:10*1000]).shape
Takes 3.03 seconds and pegs the CPUs
Likewise, for 1000 x 100K:
open_bed(bigfile,shape=shape,num_threads=1).read(np.s_[:1000,:100*1000]).shape
38.1 seconds
open_bed(bigfile,shape=shape).read(np.s_[:1000,:100*1000]).shape
10.6 seconds (pegs the drive)
* Carl
|
Awesome! How does the threading work in the c++? I wasn't able to tell in a glance at CarlKCarlK/sgkit-plink@7228d14. Dask has a flag that would let us choose between reading array chunks in parallel or synchronizing so that I split that test out into a separate issue at #26. Also thanks again @CarlKCarlK for jumping on these change so quickly. |
Eric writes:
* How does the threading work in the c++?
I’m using the OpenMP library<https://en.wikipedia.org/wiki/OpenMP>. The core code is below. The only thing new is the “#prama omp” statements (plus some new compiler settings.).
(My first task today is to make sure the setup includes all needed dependencies. FaST-LMM already uses OpenMP, so I can just copy from myself.).
The advantages of multithreading at the C++ level is:
* The file is opened only once and the threads share the file pointer.
* The threads share inputs and output memory.
You’ll notice it only parallelizes across variants (aka SNPs). That’s a disadvantage, but doubt that any threading across samples (aka individuals) would improve throughput because of the layout of the *.bed file.
* Carl
|
Greetings,
I’ve created an alpha wheel for Python 3.7 on Linux (and Windows, too, and [maybe in a while Mac]).
You call install with:
pip install https://github.com/CarlKCarlK/sgkit-plink/releases/download/0.0.1/sgkit_plink-0.0.1-cp37-cp37m-linux_x86_64.whl
Below is a script for testing the speed (it requires a sample file download). Run with
python bedspeedtest.py
(This all corresponds to commit https://github.com/CarlKCarlK/sgkit-plink/tree/d74c60c9)
* Carl
bedspeedtest.py
import time
import numpy as np
from sgkit_plink._open_bed import open_bed
#Can get file from https://www.dropbox.com/sh/xluk9opjiaobteg/AABgEggLk0ZoO0KQq0I4CaTJa?dl=0
#bigfile = r'M:\deldir\genbgen\2\merged_487400x220000.1.bed'
bigfile = '/mnt/m/deldir/genbgen/2/merged_487400x220000.1.bed'
slicer = np.s_[4000:6000,:20000]
with open_bed(bigfile) as bed:
for order in ['C','F']:
for dtype in ['float32','float64','int8']:
start = time.time()
val = bed.read(slicer,order=order,dtype=dtype)
print(f'{order},{dtype},{time.time()-start}')
|
Thank you @CarlKCarlK! I'll kick the tires on that soon. In the meantime is it easy to add a test for results with Those other tests look great! The test_properties function is a particularly good candidate for pytest.parametrize if you were interested in learning that. You don't have to actually do anything in the code to switch to using pytest btw, i.e. you could remove the main method in that test and then run something like this from the root level of the repo:
The "-k" filter would look for the string "open_bed" in any test file name or function name. Alternatively you could pass a path to That would also spit out a coverage report saying whether or not any lines of code (only Python code though) weren't tested. |
Eric,
I just added a thread test. (I used an old established version of PySnpTools to create the reference). The code passed the treads test on the first try.
I also just ported all the Bed-related tests from PySnpTools.
With respect to testing, I have more to do:
* add tests for coverage
* put .write() tests back in
* Update to use the modern PyTest params, etc y’all are using.
* Carl
|
[Adding Danilo via email for is knowledge about cibuildwheel and general wisdom. [Do @mentions in Github work for people not in the project?])
Summary:
* Status of bed-reader – Plan to have it live with fastlmm/pysntools, not with SGKit, to minimize dependencies, but it will still be super easy and good for SGKit to use.
* I’m happy with API and performance
* Because of wheel problems on Mac (and to a lesser degree Windows) there will be both OpenMP and single-threaded versions of the C++ read methods.
Details:
Jeff & Eric and I talked about making a low-dependency stand-alone PLINK Bed reader part of SGKit. I now think I’d like to just be part of the that fastlmm/pysnptools project. This lets it avoid dependencies on the SGKit project (and Dask, etc).
In its current state, it depends only on numpy,pandas,dataclass. I hope there are no hard feelings with this decision. I still consider SGKit and PySnpTools to be my two main customers.
I’ve put it at https://github.com/fastlmm/bed-reader and am calling it “bed_reader”. Usage looks like:
>> from bed_reader import open_bed
>>
>> with open_bed("distributed_bed_test1_X.bed") as bed:
... print(bed.iid)
... print(bed.chromosome)
... print(bed.read())
I’m happy with API and performance. Specifically, users can now access (and/or override) all the information in the *.fam and *.bim files. However, if they never ask for that information, those files are never opened. The reading function supports full slicing and is (by default) multithreaded. Both the reading and writing functions use a “sparse” dictionary of metadata (family id, iid, sex, bp_position, etc) to keep their number of inputs reasonable.
Wheel Problems:
I’m using cibuildwheel to test and build binaries for linux, windows, and mac x {3.7,3.8). Status:
Linux: Everything working great. OpenMP linked statically.
Windows: Working, but not great. Intel doesn’t support static OpenMP on Windows, so it is using the Python ctypes library its own copy of the OpenMP DLL. Could there be versioning problems with other Intel or LVMM versions of the DLL? Maybe.
MacOS: Not working. Default Mac gcc doesn’t support openmp. Using switching to regular gcc breaks “delocate”, the Mac clone of audit and fix that cibuildwheel uses.
Proposal:
Create both OpenMP and single-treaded-only versions of the C++ read function. MacOS will only use single-threaded. Linux and Windows users can choose.
[This will be a pain because it means the number of C++ read functions goes from 6 to 12: {int8,float32,float64}x{order F,C}x{OpenMP,single}. These are all created with C++ Macros, but Python has know about all 12.
I’m very open to comments and suggestions on any topic from the names of things to plans.
Thanks,
* Carl
From: Carl KADIE
Sent: Wednesday, August 12, 2020 5:05 PM
To: pystatgen/sgkit-plink <[email protected]>; pystatgen/sgkit-plink <[email protected]>
Cc: Mention <[email protected]>
Subject: RE: [pystatgen/sgkit-plink] Integrate Bed reader code from pysnptools (#23)
Eric,
I just added a thread test. (I used an old established version of PySnpTools to create the reference). The code passed the treads test on the first try.
I also just ported all the Bed-related tests from PySnpTools.
With respect to testing, I have more to do:
* add tests for coverage
* put .write() tests back in
* Update to use the modern PyTest params, etc y’all are using.
* Carl
From: Eric Czech <[email protected]<mailto:[email protected]>>
Sent: Wednesday, August 12, 2020 8:46 AM
To: pystatgen/sgkit-plink <[email protected]<mailto:[email protected]>>
Cc: Carl Kadie <[email protected]<mailto:[email protected]>>; Mention <[email protected]<mailto:[email protected]>>
Subject: Re: [pystatgen/sgkit-plink] Integrate Bed reader code from pysnptools (#23)
Thank you @CarlKCarlK<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK&data=02%7C01%7C%7Cbb4e63594fe2453aa62708d83ed6d234%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637328439654572143&sdata=Znr971HS1KX4qjrHh86%2BRiDVsNIsuC8x3gg3D%2FDtD2s%3D&reserved=0>! I'll kick the tires on that soon.
In the meantime is it easy to add a test for results with num_threads > 1 being equivalent to results from num_threads=1 in https://github.com/CarlKCarlK/sgkit-plink/blob/d74c60c9fdd2ed6675a2eb2e87d7a4b85f826f36/sgkit_plink/tests/test_open_bed.py<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK%2Fsgkit-plink%2Fblob%2Fd74c60c9fdd2ed6675a2eb2e87d7a4b85f826f36%2Fsgkit_plink%2Ftests%2Ftest_open_bed.py&data=02%7C01%7C%7Cbb4e63594fe2453aa62708d83ed6d234%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637328439654582138&sdata=AQYL7foEC0bNnNQ%2BqHQ%2FUgPCvvYIQAsS1fHYXp%2BTp8w%3D&reserved=0>?
Those other tests look great! The test_properties<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK%2Fsgkit-plink%2Fblob%2Fd74c60c9fdd2ed6675a2eb2e87d7a4b85f826f36%2Fsgkit_plink%2Ftests%2Ftest_open_bed.py%23L62&data=02%7C01%7C%7Cbb4e63594fe2453aa62708d83ed6d234%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637328439654582138&sdata=j6P9CpGjAcKcKS36PiOunKx%2Bm%2FQzRj7qpWIMOIWRZqs%3D&reserved=0> function is a particularly good candidate for pytest.parametrize<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.pytest.org%2Fen%2Fstable%2Fparametrize.html&data=02%7C01%7C%7Cbb4e63594fe2453aa62708d83ed6d234%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637328439654582138&sdata=eLtErzvBj2a7NWHGHn1zvrXaDK0Na9n9xPeuEQzaGJM%3D&reserved=0> if you were interested in learning that. You don't have to actually do anything in the code to switch to using pytest btw, i.e. you could remove the main method in that test and then run something like this from the root level of the repo:
pytest -v --cov=sgkit_plink --cov-report term-missing -k open_bed
The "-k" filter would look for the string "open_bed" in any test file name or function name. Alternatively you could pass a path totest_open_bed.py and it would run any function that starts with "test_" in it.
That would also spit out a coverage report saying whether or not any lines of code (only Python code though) weren't tested.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpystatgen%2Fsgkit-plink%2Fissues%2F23%23issuecomment-672954123&data=02%7C01%7C%7Cbb4e63594fe2453aa62708d83ed6d234%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637328439654592131&sdata=V2oEyikiXOziD1mKG7UPthBHv%2FBi%2Bj%2B81VQFvSRk%2FTM%3D&reserved=0>, or unsubscribe<https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P5JMXOGG6PJFXQJOW3SAK2LVANCNFSM4PWSV7VQ&data=02%7C01%7C%7Cbb4e63594fe2453aa62708d83ed6d234%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637328439654592131&sdata=Q7s5sF96w9OgV2SCCabLsf%2BrCcrGKbsnJMGskY%2B7w10%3D&reserved=0>.
|
I do build a Python project that depends on a C library that does use OpenMP for Macos and Linux (not Windows, for now.): https://github.com/EBI-Metagenomics/imm-py The underlying C library is this one: https://github.com/EBI-Metagenomics/imm/
But that underlying C library can work without OpenMP. So I guess I just don't use OpenMP on MacOS for wheel buids... |
[Sorry if I wasn't suppose to check-in to master with a code review. etc.] I just checked in 823e246 |
Closed by #30 |
To make our dependencies lighter and streamline the library a bit, it would be helpful to vendor in the pysnptools code for reading PLINK.
The text was updated successfully, but these errors were encountered: