Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference data set for automated build testing #9

Open
sven1103 opened this issue Dec 27, 2017 · 8 comments
Open

Reference data set for automated build testing #9

sven1103 opened this issue Dec 27, 2017 · 8 comments
Labels

Comments

@sven1103
Copy link
Collaborator

sven1103 commented Dec 27, 2017

Hey guys,

maybe it is to early for that, but I was thinking about which reference data sets to use for pipeline evaluation and automated build testing.

We can use this thread to collect ideas :)

@sven1103 sven1103 changed the title Reference data set for automated built testing Reference data set for automated build testing Dec 27, 2017
@subwaystation
Copy link
Collaborator

I think at one point we want to compare our results with the output of the CellRanger pipeline, right?
So would it make sense to include some datasets from 10X (https://support.10xgenomics.com/single-cell-gene-expression/datasets)? So we would have a first direct comparison.
But we should include other datasets, as well.

@wikiselev
Copy link

I was actually wondering whether we go with the CellRanger or do it in a different way?

@ewels
Copy link
Member

ewels commented Jan 4, 2018

It's never too early! It's super useful to have these for the early development work I've found. If possible it's best to find something from yeast / an organism with a small reference genome, to keep the filesize small. Otherwise we'll need to mess around subsampling the data to a single chromosome or something to make the tests run quickly (possible, but a faff).

@ewels
Copy link
Member

ewels commented Jan 4, 2018

@wikiselev - as to which tools to use, probably nice to create a separate issue for that. But also check out ideas.md if you haven't already. I think it was @subwaystation's idea that we'd want to compare output to cellranger, not necessarily run cellranger.

Phil

@sven1103
Copy link
Collaborator Author

sven1103 commented Jan 8, 2018

@wikiselev - Currently, we would not just "rebuild" CellRanger. I would rather regard it as a reference pipeline, but we are free to build it different, dependent on what we will find out the next few weeks. I think we should probably schedule a new hangout call for the further discussion :)

@wikiselev
Copy link

I feel that CellRanger is quite in use and demand by lots of users, therefore rebuilding makes sense to start with it. Also keeping in mind that it's 10X own solution I doubt we can do significantly better.

@apeltzer
Copy link
Collaborator

apeltzer commented Jan 8, 2018

I agree - we should start with CellRanger and then improve upon that once we have something working reasonably well.

@sven1103
Copy link
Collaborator Author

sven1103 commented Jan 8, 2018

@wikiselev I mean, don't get me wrong, CellRanger might be a good customised solution. Imho the first goal would be to put it in a Scientific WF Framework, including stable environments for the tools with Singularity as container solution. And give the community the possibility to easily install and run it on any cluster plus have it reproducible.

Modularity of the tools should enhance the possibility to customize the pipeline (e.g. different mapper, etc). Moreover, this would be a good basis for future benchmarks of the pipeline.

I do not completely agree with the performance. For example the duplicate removal step. I would really like to see the performance differences between different tools here, as this is a crucial step :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants