Skip to content
This repository has been archived by the owner on Nov 9, 2023. It is now read-only.

link Bokulich et al mock community test files from the QIIME Resources page #2105

Open
gregcaporaso opened this issue Nov 9, 2015 · 34 comments

Comments

@gregcaporaso
Copy link
Contributor

I have all of the links in email. This is a great resource for testing new methods.

@gregcaporaso gregcaporaso self-assigned this Nov 9, 2015
@gregcaporaso gregcaporaso added this to the QIIME 1.9.2 milestone Nov 12, 2015
@colinbrislawn
Copy link
Contributor

👍 Yes please. While these are on qiita, they are not easily accessible.

We could place them in a github repo, similar to www.github.com/torognes/vsearch-data

@colinbrislawn
Copy link
Contributor

I'm still interested in this data. Other people are too.

I'm also interested in contributing to the resource page or to a new github repo for this data set. Let me know how I can help.

@gregcaporaso
Copy link
Contributor Author

@colinbrislawn, if you're available to issue a PR adding the links to the QIIME Resources page (content is here), that'd be fantastic!

@gregcaporaso
Copy link
Contributor Author

Also, ping @nbokulich so he's aware of this.

@colinbrislawn
Copy link
Contributor

Sure thing! Should I link to the studies on qiita, or on a FTP server, or in a github repo like I mentioned before?

@gregcaporaso
Copy link
Contributor Author

I think Qiita is ideal, if @antgonza agrees. Otherwise FTP. The files are
too big to make sense in a GitHub repo. Thank you!

On Fri, Feb 12, 2016 at 9:54 AM, Colin Brislawn [email protected]
wrote:

Sure thing! Should I link to the studies on qiita, or on a FTP server, or
in a github repo like I mentioned before?


Reply to this email directly or view it on GitHub
#2105 (comment).

@colinbrislawn
Copy link
Contributor

I guess I kind of like hosting these on FTP and qiita, if possible.

The vsearch test data and the Mothur example data are really easy to download and this encourages reuse. While I use and love qiita, it's new and we can lower the barrier of entry with an FTP site. Could we host these files along with the other files on ftp://ftp.microbio.me?

@gregcaporaso
Copy link
Contributor Author

Are the files hosted in Qiita accessible through ftp @antgonza? I can't comment on ftp.microbiome.me - that's a Knight Lab resource.

@colinbrislawn
Copy link
Contributor

Oh if we could hard link to the files in qiita, that would remove the barrier of entry without duplicating effort. That would be perfect!

@gregcaporaso
Copy link
Contributor Author

Agree - that would be ideal.

@nbokulich
Copy link

Greg, didn't we copy all raw data to the taxa assignment github? Or still
in the S3 bucket? I know we have these deposited somewhere outside of qiita
already...

On Fri, Feb 12, 2016 at 9:18 AM, Colin Brislawn [email protected]
wrote:

Oh if we could hard link to the files in qiita, that would remove the
barrier of entry without duplicating effort. That would be perfect!


Reply to this email directly or view it on GitHub
#2105 (comment).

@gregcaporaso
Copy link
Contributor Author

@nbokulich, thanks for the reminder. We did, and those links are here and other relevant data here. I thought we had these on S3, in which case we'd be paying for the data transfer and it's pretty expensive there, but these are all already on ftp.microbio.me. So, I think we're good to go, and we can link to these and to the Qiita studies.

All good @colinbrislawn?

@colinbrislawn
Copy link
Contributor

I'm ready to start. Could you assign it to me?

I'll use ftp.microbio.me as much as possible, defaulting to the S3 links when needed. I'll also mention the qiita study IDs.

@colinbrislawn
Copy link
Contributor

I'll do this in waves, starting with qiita and Bokulich, 2013

The original study mentions these data sets:

where data set number can be found in Supplementary Table 7: data set 1, 719; data set 2, 1685; data set 3, 1686; data set 4, 1626; data set 5, 1687; data set 6, 1688; data set 7, 1683; data set 8, 1684; data set 9, 1689; and data set 10, 1690.

From that list, these studies are missing from qiita:
https://qiita.ucsd.edu/study/description/719
https://qiita.ucsd.edu/study/description/1626
Any ideas @antgonza? Maybe these were split into 721 and 722?

Like you mentioned, this one is not yet publicly available:
https://qiita.ucsd.edu/study/description/1685

@josenavas
Copy link
Member

@colinbrislawn the ids that you are seeing in the original study have been kept in Qiita - so you just need to put the study id at the end of those links and you will have all those. @antgonza is working on getting all of them available through Qiita.

@colinbrislawn
Copy link
Contributor

Good to know. Once the links are live I will add them post haste!

What study are 1972 and 1973 from? Those aren't mentioned in the nature paper.
https://qiita.ucsd.edu/study/description/1973

@nbokulich
Copy link

1972 AN

On Fri, Feb 12, 2016 at 3:27 PM, Colin Brislawn [email protected]
wrote:

Good to know. Once the links are live I will add them post haste!

What study are 1972 and 1973 from? Those aren't mentioned in the nature
paper.
https://qiita.ucsd.edu/study/description/1973


Reply to this email directly or view it on GitHub
#2105 (comment).

@nbokulich
Copy link

Sorry, finger slip hit send prematurely.

1972 and 1973 are from a study we are working on now. Unpublished but you
should include these --- they are good 16S V4 mock community datasets. The
ITS links listed will also be useful... included in the same study, which
is described in this preprint https://peerj.com/preprints/934.pdf.

1626 actually = 1517 (it was given a different ID when ported to qiita...
credit to @antgonza https://github.com/antgonza for previously uncovering
this)

719 was split into 721 (5' reads) and 722 (3' reads). (credit again goes to
@antgonza https://github.com/antgonza for sleuthing a few months ago when
we had this problem)

NOTE: some of these are not actually mock communities. The following IDs
are for natural communities that were analyzed in the 2013 paper:
1683
1684
1689
1690

On Fri, Feb 12, 2016 at 3:28 PM, Nicholas Bokulich [email protected]
wrote:

1972 AN

On Fri, Feb 12, 2016 at 3:27 PM, Colin Brislawn [email protected]
wrote:

Good to know. Once the links are live I will add them post haste!

What study are 1972 and 1973 from? Those aren't mentioned in the nature
paper.
https://qiita.ucsd.edu/study/description/1973


Reply to this email directly or view it on GitHub
#2105 (comment).

@colinbrislawn
Copy link
Contributor

Thanks for the info! I've added in those links.

I may be a bit slow on this Friday, but I'm having trouble connecting samples on the ftp site with the Qiita studies mentioned in Supp Table 7.
screen shot 2016-02-12 at 3 56 32 pm

Help!

@nbokulich
Copy link

Yeah... I think the links on the FTP site use a different nomenclature.

I dug up the attached document, which should clear things up: it gives the
old and new names, the links in qiita and the FTP, and some data on the
type of data.

Does that clear it up?

On Fri, Feb 12, 2016 at 4:06 PM, Colin Brislawn [email protected]
wrote:

Thanks for the info! I've added in those links.

I may be a bit slow on this Friday, but I'm having trouble connecting
samples on the ftp site with the Qiita studies mentioned in Supp Table 7.
[image: screen shot 2016-02-12 at 3 56 32 pm]
https://cloud.githubusercontent.com/assets/10355152/13023872/326f0f76-d1a2-11e5-892a-887b15ac6b22.png

Help!


Reply to this email directly or view it on GitHub
#2105 (comment).

@nbokulich
Copy link

Looks like the link won't attach. Here's the relevant text (or email me
directly for the table):
Eval Framework ID Nature Methods ID Eval Framework Link QIITA ID / Link
B1 Data set 5
ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/S16S-1/
http://qiita.ucsd.edu/study/description/1687
B2 Data set 6
ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/S16S-2/
http://qiita.ucsd.edu/study/description/1688
B3 Data set 2
ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/Broad1/
http://qiita.ucsd.edu/study/description/1685
B4 Data set 3
ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/Broad2/
http://qiita.ucsd.edu/study/description/1686
B5 NA ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/Broad3/
1972
B6 Data set 1
http://qiita.ucsd.edu/study/description/719
B7 NA
ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/Turnbaugh1/
1319
B8 NA
ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/Turnbaugh2/
1973
F1 NA ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/RDBW/
1974
F2 NA ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/ITS_SAG/
1975
NA Data set 4
ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/L18S-1/
http://qiita.ucsd.edu/study/description/1626

On Fri, Feb 12, 2016 at 4:14 PM, Nicholas Bokulich [email protected]
wrote:

Yeah... I think the links on the FTP site use a different nomenclature.

I dug up the attached document, which should clear things up: it gives the
old and new names, the links in qiita and the FTP, and some data on the
type of data.

Does that clear it up?

On Fri, Feb 12, 2016 at 4:06 PM, Colin Brislawn [email protected]
wrote:

Thanks for the info! I've added in those links.

I may be a bit slow on this Friday, but I'm having trouble connecting
samples on the ftp site with the Qiita studies mentioned in Supp Table 7.
[image: screen shot 2016-02-12 at 3 56 32 pm]
https://cloud.githubusercontent.com/assets/10355152/13023872/326f0f76-d1a2-11e5-892a-887b15ac6b22.png

Help!


Reply to this email directly or view it on GitHub
#2105 (comment).

@antgonza
Copy link
Contributor

GitHub doesn't like FTPs.

@colinbrislawn
Copy link
Contributor

Oh thanks! I'll take another shot at it.

@colinbrislawn
Copy link
Contributor

I've got most of this wrapped up in a PR. With all your help, I'm really close!

I have a quick question: the following folders and associated qiita studies are never mentioned in the Bokulich paper. Are these the data sets from that peerj paper? How should I present these?

B5 NA ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/Broad3/
1972
B6 Data set 1
http://qiita.ucsd.edu/study/description/719
B7 NA
ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/Turnbaugh1/
1319
B8 NA
ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/Turnbaugh2/
1973
F1 NA ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/RDBW/
1974
F2 NA ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/ITS_SAG/
1975

Inversely, I don't have FTP links to these qiita studies.

1626 (now 1517), 1683, 1684, 1689, 1690, 719 (now 721 and 722)

Thanks for helping me construct this.

@nbokulich
Copy link

That's correct --- those studies not mentioned in the Nature methods paper
are described in the peerJ preprint.

Studies 1683, 1684, 1689, and 1690 are NOT mock communities. They are
natural communities (i.e., real samples) that we examined in the 2013
paper. Hence, these are not in the FTP and not relevant to your current
needs.

Not sure why 719 isn't in the FTP. Think there was another outside link
that we used for this, and hence didn't copy it. @gregcaporaso, this is the
mock community from your 2011 PNAS paper... do you still have another link
to these data?

The datasets on the FTP are those used in the peerJ preprint. 1626 (now
1517) we dropped from the peerJ preprint, since this is an 18S dataset. We
wanted to focus on 16S and ITS. Hence, not in the FTP.

On Sat, Feb 13, 2016 at 9:35 PM, Colin Brislawn [email protected]
wrote:

I've got most of this wrapped up in a PR. With all your help, I'm really
close!

I have a quick question: the following folders and associated qiita
studies are never mentioned in the Bokulich paper. Are these the data sets
from that peerj paper? How should I present these?

B5 NA ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/Broad3/
1972
B6 Data set 1http://qiita.ucsd.edu/study/description/719
B7 http://qiita.ucsd.edu/study/description/719B7 NAftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/Turnbaugh1/
1319
B8 NAftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/Turnbaugh2/
1973
F1 NA ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/RDBW/
1974
F2 NA ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/ITS_SAG/
1975

Inversely, I don't have FTP links to these qiita studies.

1626 (now 1517), 1683, 1684, 1689, 1690, 719 (now 721 and 722)

Thanks for helping me construct this.


Reply to this email directly or view it on GitHub
#2105 (comment).

@colinbrislawn
Copy link
Contributor

I was planning wait for the official publication to post the peerJ paper. Should I just post them now?

I'm pretty close to finishing the 2013 paper. I was hoping to add the 1683, 1684, 1689, and 1690 studies. While they are not mock communities, they are included in the paper. I guess I'll just use qiita links if they are not on the server...

@gregcaporaso
Copy link
Contributor Author

I don't think there's any reason to wait for the peer-reviewed publication.
I also don't think you should post 1683, 1684, 1689, or 1690 - since those
are not mock communities it'll they just add some confusion to what this
resource is. Let's keep it as just the mock communities.

On Sun, Feb 14, 2016 at 5:22 PM, Colin Brislawn [email protected]
wrote:

I was planning wait for the official publication to post the peerJ paper.
Should I just post them now?

I'm pretty close to finishing the 2013 paper. I was hoping to add the
1683, 1684, 1689, and 1690 studies. While they are not mock
communities, they are included in the paper. I guess I'll just use
qiita links if they are not on the server...


Reply to this email directly or view it on GitHub
#2105 (comment).

@gregcaporaso
Copy link
Contributor Author

@nbokulich, aren't the Turnbaugh 1 sequences the one from my 2011 PNAS paper?

@nbokulich
Copy link

Yes, Turnbaugh 1 = your 2011 PNAS paper.

On Mon, Feb 15, 2016 at 4:41 AM, Greg Caporaso [email protected]
wrote:

@nbokulich https://github.com/nbokulich, aren't the Turnbaugh 1
sequences the one from my 2011 PNAS paper?


Reply to this email directly or view it on GitHub
#2105 (comment).

@colinbrislawn
Copy link
Contributor

Just to be sure,
Turnbaugh1 == qiita 721 == One half of Data set 1
Turnbaugh2 == qiita 722 == Other half of Data set 1

@benjjneb
Copy link

I was wondering if there were reference sequences for the Bokulich mock communities that were analyzed in "Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing"?

What I'm looking for is something like a list of the 16S sequences present in each of the mock community strains (eg. http://www.mothur.org/MiSeqDevelopmentData/HMP_MOCK.fasta). All I can find in the original paper are lists of the strains used. The sequences themselves would be very useful to have for certain benchmarking purposes!

@nbokulich
Copy link

@colinbrislawn: Incorrect.
qiita 721 == forward reads for Turnbaugh1. This was also formerly known as
qiime 719
qiita 722 == reverse reads for Turnbaugh1

Turnbaugh2 == qiita 1973

@benjjneb: No, we do not have such ref sequences compiled yet. We are
currently working on this, and believe you had emailed us about this
already. We will post these as a public resource once they are available.

On Mon, Feb 15, 2016 at 11:45 AM, benjjneb [email protected] wrote:

I was wondering if there were reference sequences for the Bokulich mock
communities that were analyzed in "Quality-filtering vastly improves
diversity estimates from Illumina amplicon sequencing"?

What I'm looking for is something like a list of the 16S sequences present
in each of the mock community strains (eg.
http://www.mothur.org/MiSeqDevelopmentData/HMP_MOCK.fasta). All I can
find in the original paper are lists of the strains used. The sequences
themselves would be very useful to have for certain benchmarking purposes!


Reply to this email directly or view it on GitHub
#2105 (comment).

@colinbrislawn
Copy link
Contributor

OK, I think my PR is done. Can someone review #2130 ?

Should I include study 1626 / 1517? It's ITS, not used in the preprint, and private on qiita.
NA Data set 4 ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/L18S-1/ http://qiita.ucsd.edu/study/description/1517

@gregcaporaso gregcaporaso removed their assignment Sep 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants