Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High count of no-barcode that map to my targets #363

Closed
lborcard opened this issue Jun 20, 2024 · 15 comments
Closed

High count of no-barcode that map to my targets #363

lborcard opened this issue Jun 20, 2024 · 15 comments
Labels
question Further information is requested

Comments

@lborcard
Copy link

Dear Readfish team,

Thank you for your work. I am using readfish to enrich for certain targets and I am testing it with a simulated run ( 2hours of WGS). It seems that since dorado there is a work around to deal with the empty barcode signal (ref issue commented by matt Loose). I generally am under the impression that I find a lot of no_barcode_found and unclassified barcodes when doing a simulation with readfish. I was wondering if this was due to the simulation run or an issue linked to Dorado ?

Readfish simulated run with AS

Condition          ,Reads     ,Alignments      ,Alignments          ,Alignments ,Yield             ,Yield               ,Yield     ,Yield   ,Median read lengths ,Median read lengths ,Median read lengths ,Number of targets ,Percent target ,Estimated coverage
                   ,Total     ,On-Target       ,Off-Target          ,Total      ,On-Target         ,Off-Target          ,Total     ,Ratio   ,On-target           ,Off-target          ,Combined            ,Total             ,Total
barcode01          ,"63,753"  ,"4,227 (6.63%)" ,"59,526 (93.37%)"   ,"63,753"   ,10.12 Mb (23.36%) ,33.22 Mb (76.64%)   ,43.34 Mb  ,1:3.28  ,0 b                 ,556 b               ,556 b               ,0                 ,0.00%          ,0.00 X
barcode02          ,"74,569"  ,"5,005 (6.71%)" ,"69,564 (93.29%)"   ,"74,569"   ,12.28 Mb (23.53%) ,39.91 Mb (76.47%)   ,52.20 Mb  ,1:3.25  ,0 b                 ,565 b               ,565 b               ,0                 ,0.00%          ,0.00 X
classified_reads   ,77        ,0 (0.00%)       ,77 (100.00%)        ,77         ,0 b (0.00%)       ,88.17 Kb (100.00%)  ,88.17 Kb  ,0:0.00  ,0 b                 ,1.24 Kb             ,1.24 Kb             ,7                 ,100.00%        ,0.00 X
no_barcode_reads   ,"432,798" ,"9,232 (2.13%)" ,"423,566 (97.87%)"  ,"432,798"  ,22.41 Mb (5.21%)  ,407.29 Mb (94.79%)  ,429.70 Mb ,1:18.18 ,0 b                 ,961 b               ,961 b               ,0                 ,0.00%          ,0.00 X
unclassified_reads ,"294,399" ,0 (0.00%)       ,"294,399 (100.00%)" ,"294,399"  ,0 b (0.00%)       ,334.07 Mb (100.00%) ,334.07 Mb ,0:0.00  ,0 b                 ,1.30 Kb             ,1.30 Kb             ,0                 ,0.00%          ,0.00 X


Original WGS run:

barcode      	alias        	type        	target_unclassified 	acquisition_run_id                       	protocol_group_id                   	sample_id 	flow_cell_id 	started
barcode01    	barcode01    	test_sample 	             144944 	769dbe7dcada77ac43bd9d77ac63a7cd03d1e6d0 	sample 	no_sample 	FAY86484     	2024-05-15T12:37:41.871230+02:00
barcode02    	barcode02    	test_sample 	             170079 	769dbe7dcada77ac43bd9d77ac63a7cd03d1e6d0 	sample 	no_sample 	FAY86484     	2024-05-15T12:37:41.871230+02:00
unclassified 	unclassified 	na          	              36927 	769dbe7dcada77ac43bd9d77ac63a7cd03d1e6d0 	sample 	no_sample 	FAY86484     	2024-05-15T12:37:41.871230+02:00

best,

Loïc

Copy link

Thank you for your issue. Give us a little time to review it.

PS. You might want to check the FAQ if you haven't done so already.

This is an automated reply, generated by FAQtory

@mattloose mattloose added the question Further information is requested label Jun 20, 2024
@mattloose
Copy link
Contributor

When you run readfish on playback with a barcoded sample you can get very unexpected results. What you are seeing here is a consequence of how playback works. If you imagine a single channel on the sequencer, playback provides the signal that passed through the channel exactly as it happened on the original run. If you send a message to the sequencer to throw away the read, the the sequencer will break the playback read at that point and then start a new read from the signal that is coming from the playback file. That is not a true read and so it will therefore not begin with a barcode - you are starting half way through a read and it is correctly classified as unclassified (if that makes sense!).

I hope that helps.

@lborcard
Copy link
Author

Thank you for the answer, I was wondering how it worked. So in general a simulated run will underestimate the number of true hits? Do you usually run several simulated run to get an idea of the perfomance of AS?

@lborcard
Copy link
Author

Do you usually get a better idea of the performance by not testing with barcoded experiments?

@mattloose
Copy link
Contributor

If you wish to simulate barcoded runs (or non barcoded runs) you can use our icarust tool:

https://academic.oup.com/bioinformatics/article/40/4/btae141/7628125

It all depends on what you are trying to test with respect to performance as to whihc approach is better.

@lborcard
Copy link
Author

The idea for us is to compare A wgs run to adaptive sampling basically. So we perform a wgs run and we then run a simulated AS run for the same amount of time using the bulk file.

@mattloose
Copy link
Contributor

OK - assume you are targetting a 1mb region of a 10mb genome. When you run your normal WGS (and record a bulkfile) you obtain 100x coverage of the 10 mb genome (and therefore 100x of your 1mb region). When you run adaptive sampling on the playback of that bulkfile you will still end up with approximately 100x coverage of your 10 mb genome. The reason for this is that in playback you don't actully throw away the read - you merely break a read that you don't want into smaller bits. So to see the effect of adaptive sampling on you run on playback you would need to look at the read lengths on your 1mb target region vs your 9mb of off-target. You should see long reads "on-target" and short reads "off-target".

In contrast, Icarust will simulate reads being removed, but does so from simulated data. With icarust you will see actual enrichment - but it is more theoretical.

@lborcard
Copy link
Author

I was also wondering how the playback works when your amount of "on target" reads is limited (because of WGS), what happens when there no more reads corresponding to your ref (say your 1mb region)?

I saw the release of Icarust, I was under the impression that it was not yet finished but I will try it eventually, my initial thought was that playback runs were a more realistic approach to testing but apparently I was wrong.
thanks for the explanation it helps a lot!

@mattloose
Copy link
Contributor

I wouldn't say you are wrong... it's just that playback doesn't actually remove the molecule from the sequencer and so it's really easy to misinterpret the results. We use playback to look at the relative lengths of the molecules we get on and off target. They should be as short as possible off target and as long as they orginally were on target. You can also cehck mapping efficiency using playback. You just won't see any actual enrichment!

Hope all these comments are useful :-)

@Adoni5
Copy link
Contributor

Adoni5 commented Jun 20, 2024

image

If you look at the yield columns for playback, you can see what Matt is talking about. You can't ever enrich in playback as you can't create new reads, only chunks of the original reads.

@lborcard
Copy link
Author

Ok gotcha! thanks for the plots very informative. How does Minknow pick the next read to come in when using playback, does the initial "timeline" hold or is it randomly picked in the bag of molecules that went through this channel during the original run?

@mattloose
Copy link
Contributor

The initial timeline holds - so reads playout exactly as they were recorded. All that happens is the signal from a read gets broken up into smaller chunks.

@lborcard
Copy link
Author

So what happens to reads that are broken down, the next read will come in at the same time as it was originally sequenced? the pore just waits ?

@mattloose
Copy link
Contributor

No. Imagine you have a read of 10 kb. The sequencer plays back the 1st 1000 bases, but you then "unblock" so the sequencer ends the reads at 1kb. But the signal from the original 10kb read keeps playing, so you will get a new read starting at 1kb (plus a small bit) into the old read. And so on - until the original read finishes - and then it goes on to the next read. So, your original 10kb read could be chopped up into 10 fragments of 1 kb each.

Does that make sense?

That is why you end up with reads without a barcode when you run adaptvie sampling on a playback run.

@lborcard
Copy link
Author

I see what you mean it is much clearer now, so while the single of the unblocked read is playing another reads can be playing in the same "pore" and thus chopping it ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants