-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High level file processing #76
Comments
This all seems great to me. I'd prefer slightly more verbose string macros for readability (eg |
I think our file format packages are more like libraries that provide composable elements/structures and algorithms to use in the Julia ecosystem. A command-line interface is probably not a like comparison to our file format packages, but the comparison certainly demonstrates a desired ease of use. I prefer the series of iterators-like approaches where you can pipe from source to sink. And like @kescobo, I prefer more verbose names. FileIO sets precedence for loading and saving files. I think part of the solution, whatever approach, is registering our file format packages with FileIO and implementing their streaming IO interface -- I have always wanted to register the BioJulia file format packages with FileIO. In terms of designing interfaces and composable data structures, maybe the BioJulia tutorials are a good place to do that. The tutorials would lay out the common tasks and goals, and then the process of writing the tutorial will identify gaps/missing features and sore points. And where there are gaps or sore points, I suspect writing those sections in the way we wish to use the packages will show us or help us design the interfaces. What do you reckon? |
💯💯💯 I've just recently started the process of reigniting that repo - it hadn't really been touched since Julia 0.7. I definitely think that's (one of) the right place (s) to solidify the expected interface. |
So I'm a bit torn here. On one hand, I completely agree that we should play to Julia's strengths of enabling convenient high-level syntax suitable for interactive use while also enabling more strict syntax that is more error resistant, and more optimised, for libraries. And I, too, am a bit tired of the pattern: open(path) do io
io = last(splitext(path)) == ".gz" ? GzipDecompressorStream : io
for record in FASTA.Reader(io)
...
end
close(reader)
end ...which certainly feels like it could be better. That was my original intent for making On the other hand, however, I am a bit wary of conflating low and high level packages. The two pitfalls I see are:
Okay, after writing this I'm pretty sure I'm on board with this, @SabrinaJaye For your concrete proposals: Yes, I think it's a good idea!I think we should just use the existing Perhaps a good idea would be to "just" have readers be iterators of records (any headers can be accessed on instantiation like Lots of bikeshedding potential there. I don't think the details matter too much. An improvement is an improvement. W.r.t tutorials, yes, that would be a nice improvement. But that's a separate issue I think - no matter the outcome of the proposal here, updating the tutorials would be a good idea. |
Ok cool. It's funny you mention Transducers as this issue got me onto reading those - but I think, give that library works with iterators as well, that if we do this right, that Transducers, and Iterate methods should ideally just work - 'cause julia is awesome like that. I'll make some smaller PRs that move towards some of these improvements. |
I think at minima it would be great to have On that I also wrote something along these lines : https://github.com/jonathanBieler/BioRecordsProcessing.jl I think my examples look pretty good (there's minimal boilerplate) but you can't specify source and sink, it's only disk-to-disk. One concern I have is that having sinks, sources with processing in between isn't specific to biological records, and when you think about adding parallel processing it starts to look at lot like Dagger.jl or other generic frameworks. That is to say this doesn't really concern FASTX, BED, ... packages (or only in making sure they are compatible with those tools). I think it becomes a bit more relevant is you include domain specific knowledge and handle common use cases as easy as possible (I have some examples in my readme, but we could also look at common samtools/vcftools/... usages). What I have in BioRecordsProcessing is 1) processing paired files, 2) processing whole folders in parallel 3) grouping records before processing. Another one would be to read only selected regions for indexed files. |
In the queryverse ecosystem, it seems that the convention is to have a separate package that implements the |
Looking at the initial examples with
I think Iterators.filter(fn::Function) = (itr) -> Iterators.filter(fn, itr) I wonder why this is not already in the standard julia libraries. I s'pose it's difficult to decide whether eager or lazy variants should be distinguished. |
I think we would do the usability of BioJulia a lot of favours if we made it easier to use the file format packages to do the processing of HTS files a la seqtk, samtools etc.
It's currently not as convenient to do, for example - comparing to seqtk:
As it is to just do
seqtk seq -a in.fq.gz > out.fa
or
As it is to do
seqtk subseq in.fq name.lst > out.fq
One of julias strengths - if you're not using the part of BioJulia that provides efficient core data structures and algorithms, is the aspects that make it a pretty great glue language. So the ability to very tersely express the processing of HTS files at a high level is important.
There are a few proposals and options I'm considering to improve this area and wanted to get feedback or other ideas.
The
open(FASTQ/A.Reader
pattern gets irritating after a while.I'd like to propose a filehandle kind of string literal that, when used in BioJulia ecosystem functions, does that for you. e.g.
It could even detect FASTA or FASTQ at the metaprogramming stage by inspecting the string. These could generate the code to open a stream, or just create a filehandle type that causes dispatch to use a transcribe function that does the appropriate opens.
A lot of file processing involves opening a source, doing some stuff and writing to a sink. I wonder if we should implement that as a function(s), and then express e.g.
transcribe
, using that instead.With 1, this could look something like:
We could also have a few of these methods e.g. a
filter
,transform
, etc that make various assumptions about the functions.Or we could just have a more general function that, for e.g. if the provided function returns a record, write it, if it returns nothing, don't. That way a user can make a decision about dropping records in the same function as it may transform records.
Another alternative to this would be to do similar to what I did with the DSL in Pseudoseq: https://github.com/bioinfologics/Pseudoseq.jl/blob/master/src/sequencing/DSL.jl
And provide a series of
Iterators
like types, that wrap either a reader, or another iterator, that can be plugged into a writer.So more like:
Basically angling for a kinda API that approaches a DSL on top of the readers and writers.
One such iterator could be one that modifies the iteration behaviour of an unwrapped reader, to do in-place record reading
over the default that allocates a new copy every time. e.g.
Or an alternative to this - we just try to provide as many individual functions like filter, trim, transcode etc. as possible and hope we cover everyone's common needs.
pinging @jakobnissen @CiaranOMara @kescobo
The text was updated successfully, but these errors were encountered: