Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions About MPRAflow - Low CRS-BC Associations and Sequencing Details #85

Open
BioSenior opened this issue Dec 14, 2024 · 2 comments

Comments

@BioSenior
Copy link

I want to sincerely thank you for developing the lentiMPRA protocol and the MPRAflow tool. This is truly an outstanding piece of work, and it has been incredibly helpful to the field of high-throughput functional genomics. I deeply appreciate the effort and expertise that went into creating such a comprehensive framework.

Recently, I followed your experimental protocol for performing lentiMPRA, and I am now using MPRAflow for analysis. However, I have encountered some issues in the Association step that I would like to ask for your advice on. Specifically, I am observing a surprisingly low number of CRS-BC (candidate regulatory sequence - barcode) associations.

Here are the details:
• In my experimental design, each CRS was assigned 185 barcodes. I have a total of 7,530 CRS, and the sequencing depth I used was 19 million reads (>185 * 7,530 * 10).
• Despite this, the number of associated CRS and barcodes is very low. Even before filtering, the median number of barcodes per CRS is far below 20, and after filtering, the associations drop even further.
• Some of my CRS sequences only differ by a single nucleotide, so I set the mapq parameter to 0 in an attempt to increase associations. However, this setting seemed to have no impact, as the output remained the same.

I am quite puzzled by these results and was wondering if you could help me identify the potential cause. Could this issue arise from my parameter settings, or is it more likely to be a problem with my experimental design or sequencing?

Additionally, I have one more question about the Association step:
•In the sequencing files for this step, the R1.fq.gz file contains the first half of each CRS sequence, while R2.fq.gz contains the second half. How does the program associate barcodes with CRS sequences in this case? I would have expected that at least one read would contain both the barcode and the CRS for the association to be made.

Thank you so much for your time and assistance. I greatly appreciate your help, and I look forward to hearing your insights.

@visze
Copy link
Collaborator

visze commented Dec 17, 2024

MPRAflow has known issuews when edit distances between oligos are small. The mapper (bwa) seems to have issues assigning the reads to the "BEST" oligo and sees both oligos (e.g. reference and alternative when you have a variant) as best hits. Therefore discards it.

A replacement would be MPRAsnakeflow using the bbmap strategy which is optimized for variants (or short edit distances).

Your assignment reads have to overlap! We are using the tool fastq-join here. You can run it by yourself to see you many reads you are able to merge. Here is the line of code:

fastq-join $fastq_insert $fastq_insertPE -o ${fastq_insert}_merged.fastq

Your low number of BCs can be due to the low number of merged reads. There are two solutions.

  1. You can merge them by your own (concatenating) and then run in single end mode
  2. You add e.g. 20 bp from the end of each read to the end of the other read and run it in PE mode.

But you are also saying you have variants. So if the variant is in the center then this approach will totally fail because you will generate errors in the center of the oligos. When having variants in the center you really need overlapping PE sequences.

@BioSenior
Copy link
Author

Thank you so much for taking the time to provide such detailed feedback and for introducing me to MPRAsnakeflow. I would like to update you on my recent progress. Initially, my MPRAflow program encountered an error:

barcodes:   1%|          | 35981/4769119.0 [00:00<00:13, 345425.24it/s]/home/fxu23/MPRAflow
Traceback (most recent call last):
  File "/home/fxu23/MPRAflow/src/nf_ori_map_barcodes.py", line 154, in <module>
    coords_to_barcodes = get_coords_to_barcodes(fastq_in, n_fastq, bamfile, n_bam, mapq, baseq, cigar)
  File "/home/fxu23/MPRAflow/src/nf_ori_map_barcodes.py", line 117, in get_coords_to_barcodes
    min_base_quality = min(map(ord, barcode.quality))
TypeError: 'NoneType' object is not iterable

Based on the error message, I suspected that some of my sequences might be missing quality information. However, after checking all the sequences, I found that they all had quality scores. So, initially, I simply modified the MPRAflow/src/nf_ori_map_barcodes.py file at line 117:

min_base_quality = min(map(ord, barcode.quality))

I added a condition to handle missing quality scores:

if barcode.quality is not None:
    min_base_quality = min(map(ord, barcode.quality))
else:
    min_base_quality = 31

After this change, the code ran normally and completed successfully. However, the result was the one I sent you via email. Then, when I ran the bash .command.sh in the map_element_barcodes working directory, the results were completely different. All barcodes were read and recognized correctly, and the results were as expected. But when I ran bash .command.run, the same error occurred! This is very strange. I tried to investigate the issue, but due to my unfamiliarity with Nextflow, I couldn’t identify the cause.

Regardless, I just wanted to share the issues I’ve encountered and hope that this feedback might help improve MPRAflow.

Thanks again for your enthusiastic help and guidance, as well as for developing the amazing tool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants