Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcf_annotation.gz file cannot be proceed #51

Open
xuefenfei712 opened this issue Jan 11, 2025 · 0 comments
Open

vcf_annotation.gz file cannot be proceed #51

xuefenfei712 opened this issue Jan 11, 2025 · 0 comments

Comments

@xuefenfei712
Copy link

Hi,
Thanks for your software, I use your example data and encounted this problem, the vcf_annotation.gz cannot be read and procced to get Twp.txt file, could you help me figure out it, thankyou

Run command

$ nextflow run https://github.com/cgroza/GraffiTE --reference hs37d5.chr22.fa --assemblies assemblies.csv --reads reads.csv --TE_library human_DFAM3.6.fasta

Command error:
Parse outputs...
Cleanup...

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

  filter, lag

The following objects are masked from ‘package:base’:

  intersect, setdiff, setequal, union


 *****       ***   vcfR   ***       *****
 This is vcfR 1.15.0
   browseVignettes('vcfR') # Documentation
   citation('vcfR') # Citation
 *****       *****      *****       *****

Warning message:
The x argument of as_tibble.matrix() must have unique column names if
.name_repair is omitted as of tibble 2.0.0.
ℹ Using compatibility .name_repair.
Scanning file to determine attributes.
File attributes:
meta lines: 34
header_line: 35
variant count: 259
column count: 10

Meta line 34 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
Character matrix gt rows: 259
Character matrix gt cols: 10
skip: 0
nrows: 259
row_num: 0

Processed variant: 259
All variants processed
[1] "CHROM" "POS" "qry_id" "REF"
[5] "ALT" "n_hits" "fragmts" "match_lengths"
[9] "repeat_ids" "matching_classes" "strands" "RM_id"
compute repeat proportion for each SVs...
Mammalian filters OFF, writing vcf...
The tag "~ID" is not defined in vcf_annotation.gz
Failed to read from standard input: unknown file type

Work dir:
~/TE/GraffiTE/test/GraffiTE_testset/work/3c/39e0978978f03d2ce4a63556c48160

Tip: you can replicate the issue by changing to the process work dir and entering the command bash .command.run

-- Check '.nextflow.log' file for details

Follow the prompts to run:~/TE/GraffiTE/test/GraffiTE_testset/work/3c/39e0978978f03d2ce4a63556c48160/bash .command.run

$ bash .command.run

Mammalian filters OFF, writing vcf...
The tag "~ID" is not defined in vcf_annotation.gz
Failed to read from standard input: unknown file type

Looking further into the vcf_annotation file, there is no header

22 16147398 HG002.mat.svim_asm.INS.1 G GCCTCAGCCTCCCAAAGTGCTGGGATTATAAGCGTGAGCCACTGTGCCCAACCGATTTTTTTGTATTTTTAGTAAAGATGGGGGTTTCATCATCTTGGCTAGGCTGGTCTTGAACTCCTGATCTCGTGATCCACCCA 1 1 136 AluSg4 SINE/Alu C 209
22 16212443 HG002.mat.svim_asm.DEL.1 TTCTGTGAGATGAATGCACACATCACAAAGAAGTTTCTCAGAATGCTTCTGTCTAGTTTTTATGTGAAGATATTCCCTTTTCCACCACAGGCCTCAAAGCGCTCCAAATATCCACTCGCGGTTTCTGCAAAAAGAGTGTTTCAAAACTTCTCAATCAAAAGAAAGGTTCAAC T 0 0 0 None None None None
22 16287420 HG002.mat.svim_asm.DEL.2 GGCCCACTTTGTTCTTGCCGCTCCCCCTGCAGCAGGGGAAGCAGTGGCAGCACCACTTGCCCATCTTGCTCCTGAGTGTCTTCATAGCAGAGTCGTCGTGGTCTCCAGAAGT G 0 0 0 None None None None
22 16658445 HG002.mat.svim_asm.INS.2 A ATCATTTATTTTCTTTTTTTTCCCAGATCTTCGTGTTTTTTTTTAGATTTTTTTTTTTTTTATTTTACTTTAAGCTTTAGTGTACATGTGCACAATGTGCAGGTTAGTTACATATGTATACATGTGCCATGCTGGTGCCCTGCACCCACTAACTCGTCATCTAGCATTAGGTATATCTCCCAATGCTATCCCTCCCCCCACCCCCACCCCATAACAGTCCCCAGAGGGGGATATTCCCCTTCCTGTTTCCTTGTGATCTCATTGTTCAATTCCCACCTATGATTGAGAATATGCGGTGTTTGGTTTTTTGTTCTTGCAATAGTTTACTGAGAATGATGATTTCCAATTTCATCCATGTCCCTACAAAGGACATGAACTCATCATTTTTTATGGCTGCATAGTATTCCATGGTGTATATGTGCCACATTTTCTTAATCCAGTCTATCGTTGTTGGACATTTGGTTTGGTTCCAAGTCTGTGCTATTGTGAATAATGCCACAATAAACATACGTGTGCATGTGTCTTTATAGCAGCATGATTTATAGTCCTTTGGGTATATATCCAGTAATGGGATGGCTGGGTCAAATGGTATTTCCAGCTCTGGATCCCTGAGGAATCGCCACACTGACTTCCACAATGGTTGAACTAGTTTCCAGACCCACCAACAGTGTAAAAGTGTTCCTATTTCTCCACATCCTCTCCAGCACCTGTTGTTTCCTGACTTTTTAATAATCGCCATTCCAACTGGTGTGAGATGGTATCTCATTGTGGTATTGATTTGCATTTCTGTGATGGCCAGTGATGATGAGCATTTTTTCATGGGTTTTTTGGGTGCATAAATGCCTTCTTTTGAGAAGTGCCTGTTCATGTCCTTCGCCCACATTTTGATGGGGTTGTTTGTTTTTTTCTTGTAAATTTGTTTTAGTTCATTGTAGATTCTGGATATTAGCCCTTTGTCAGATGAGTAGGTTGTGAAAATTTTCTCCCATTTTGTGGGTTGCCTGTTCACTCTGATGGTAGTTTCTTTTGCTGTGCAGAAGCTCTTTAGTTTCATTAGATCCCATTTGTCAATTTTGTCTTTTGTTGCCATTGCTTTTGGTGTTTTAGACATGAAGTCCTTGCCCATGCCTATGTCCTGAATGGTAATGCCTAGGTTTTCTTCTAGGGTTTTTATGGTTTTAGGTCTAACATTTAAGTCTTTAATCCATCTTGAATTGATTTTTGTATAAGGTGTAAGGAAGGGATCCAGTTTCAGCTCTCTACATATGGCTAGCCAGTTTTCCCAGCACCATTTATTAAATAGGGAATCTTTTCCCCATTGCTTGTTTTTCTCAGGTTTGTCAAAGATCAGATAGTTGTAGATATGCAGCGTTATTTCTGAGGGCTCTGTTCTGTTCCATTGATCTATATCTCTGTTTTGGTACCAGTACTGTGCTGTTTTGGTTACTGTAGCCGTGTAGTATAGTTTGAAGTCAGGTACCATCATGCCTCCAGCTTTGTTCTTTTGGCTTAGGATTGACTTGGTGATGTGGGCTCTTTTTTGGTTCCATATGAACTTTAAAGTAGTTTTTTCCAATTCTGTGAAGAAAGTCATTGGTAGCTTGATGGGGATGGCATTAAATCTATAAATTACCTTGGGCAGTATGGCCATTTTCACGATATTGATTCTTCCTACCCATGAGCATGGAATGTTCTTCCATTAGTTTGTATCCTCTTTTATTTCCTTGAGCAGTGGTTTGTAGTTCTCCTTGAAGAGGCCCTTCACATCCCCTTTAAGTTGGATTCCTAGGTATTTTATTCTCTTTGAAGCAATTGTGAATGGGAGTTGACTCATGATTTGGCTCTCTGTTTGTCTGTTGTTGGTGTATAAGAATGCTTGTGATTTTTGTACATTGATTTTGTATCCTGAGACTTTGCTGAAGTTTCTTATCAGCTTAAGGAGATTTTGGGCTGAGACGATGGGGTTTTCTAGATATACAATGATGTCGTCTGCAAACAGGGGCAATTTGACTTCCTCTTTTCCTAATTGAATACCCTTTGTTTCCTTCTCCTGCGTAATTGCCCTGGCCAGAACTTCCAACACTATGTTGAATAGGAGTGGTGATAGAGGGCATCAATGTCTTGTGCCAGTTTTCAAAGGGAATGCTTCTAGTTTTTGCCCATTCATTATGATCTTGGCTGTGGGTTTGTCATAGATAGCTCTTATTATTTTGAAATACGTCCCATCAATACCTAATTTATTGAGAGTTTTTAGCATAAAGGGTTGTTGAATTTTGTCAAGGCCTTTTCTGCATCTATTGAGATAATCATGTGGTTTTTGTCTTTGGTTCTGTTTATATGCTGGTTTACATTTATTAATTTGTGTATATTGAACCAGCCTTGCATCCCAGGGATGAAGCCCACTTGATCATGGTGGATAAGCTTTTTGATGTGCTGCTTGATTCGTTTTGCCGGTATTTTATTGAGGATTTTTGCATCAGTGTTCATCAAGGATATTGGTCTAAAATTCTCTTTTTTGGTTGTGTCTCTGGCCGGCTTTGGTATCAGAATGATGCTGGCCTCATAAAATGAGTTAGGAAGGATTCCCTCTTTTTCTATTGATTGGAATAGTTTCAGAAGGAATGGTACCAGTTCCTCCTTGTACCTCTGGTAGAATTCGGCTGTGAATCCATCTGGTCCTGGACTCTTTTTGGTTGGTAAGCTTTTGATTATTGCCACAATTTCAGATCCTGTTATTGGTCTATTCAGAGATTCAACTTCTTCCTGGTTTAGTCTTGGGAGAGGGTATGTGTCCAGGAATTTATCCCTTTCTTCTAGATTTTCTAGTTTATTTGTGTAGAGGTGATTGTAGCATTCTGTGATGGTAGTTTGTATTTCTGTGGGATCGGTGGTGATATCCCCTTTATCAGTTTTTATTGCATCTATTTGATTCTTCTCTCTTTTTTTCTTTATTAGTCTTGCTAGCGGTCTATCAATTTTGTTGATCCTTTCAAAAAACCAGCTCCTGGATTCATTAATTTTTTGAAGGGTTTTTTGTGTCTGTATTTCCTTCAGTTCTGCTCTGATTTTAGTTATTTCTTGCCTTCTGCTAGCTTTTGAATGTGTTTGCTCTTGTTTTTCTAGTTCTTTTAATTGTGATGTTAGGGTGTCAATTTTGGATCTTTCCTGCTTTCTCTTGTGGGCATTTAGTGCTATAAATTTCCCTCTACACACTGCTTTGACTGCATCCCGGAAATTCTGGTATGTTGTATCTTTGTTCTCATTGGTTTCAAAGAACATCTTTATTTCTGCCTTCATTTCGTTATGTACCCAGTAGTCATTCAGGAGCATGTTGTTCATTTTCCATGTAGTTGAGCGGTTTTGAGTGAGTTTCTTAATCCTGAGTTCTAGTTTGATTGCACTGTGGTCTGAGAGATACTTTGTTATAATTTCTGTTCTTTTACATTTGCTGAGGAGAGCTTTACTTCCAAGTATGTGGTCAATTTTGGAATAGGTGTGGTGTGGTGCTGAAAAACATGTATATTCTGTTGATTTGGGGTGGAGAGTTCTGTAGATGTCAATTAGGTCCGCTTGGTGCAGAGCTGAGTTCAATTCCTGGGTGTCCTTGTTGACTTTCTGTCTCGTTGATCTGTCTAATGTTGACAGTGGGGTGTTAAAGTCTCCCATTATTAATGTGTGGGAGTCTAAATCTCTTTGTAGGTCACTCAGGACTTGCTTTATGAATCTGGGTGCTCCTGTATTGTGTGCATATATATTTAAGATAGTTAGCTCTTCTTCTTGAATTGATCCCTTGACCATTATGTAATGGCCTTCTTTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTACCAGAGACTAGGATTGCAACCCCTGCCTTTTTTTGTTTTCCATTTGCTTGGTAGATCTTCCTCCCTCCTTTTATTTTGAGCCTATGTGTGTCTCTGCACGTGAGATGGGTTTCCTGAATACAGCACACTGATGGGTCTTGACTCTTTGTCCAATTTGCCAGTCTGTGTCTCTTAATTGGAGCATTTAATCCATTCACATTTAAAGTTAATATTGTTATGTGTGAATTTGATCCTGTCATTATGATGTTAGCTCGTTATTTTGCTCGTTAGTTGATGTAGTTTCTTCCTAGTCTCGATGGTCTTTACATTTTGGCATGATATTGCAGTGGCTGGTACCTGTTGTTCCTTTCCATGTTTAGCGCTTCCTTCAGGAGCTCTTTTAGGGCAGGCCTGGTGGTGACAAAATCTCTCAGCATTTGCTTGTCTGTAAAGTATTTTATTTCTCCTTCACTTATGAAGCTTAGTTTGGCTGGATATGAAATTCTCTGTTGAAAATTGTTTTCTTTAAGAATATTGAATATTGGCCCCCACTGCCTTCTGACTTGTAGGGTTTCTGCCGAGAGATCCGCTGTTAGTCTGATGGGCTTCCCTTTGAGGGTAACCCGACCTTTCTCTCTGGCTGCCCTTAACATTTTTTCCTTCATTTCAACTTTGGTGAATCTGACAAGTATGTGTCTTGGAGTTGCTCTTCTCGAGGAGTGTGGTGTTCTCTGTATTTCCTGAATCTGAACGTTGGCCTGCCTTGCTAGATTGGGGAAGTTCTCCTGGATAATATCTTGCAGAGTGTTTTCCAACTTGGTTTCATTCTCCCCATCACTTTCAGGTACACCAATCAGACTTAGATTTGGTCTTTTCACATAGTCCCATATTTCTTGGAGGCTTTTCTCATTTCTTTTTATTCTTTTTTCTCTAAACTTCCCTTCTCACTTCATTTCATTCATTTCATCTTCCATCGCTGATACCCTTTCTTCCAGTTGATTGCATCGGCTCCTGAGGCTTCTGCATTCTTCACGTAGTTCTCGAGCCTTGGTTTTCAGCTTCATCAGCTCCTTTAAGCACTTCTCTGTATTCGTTATTCTAGTTTTACATTCTTCTAAATTTTTTTCAAAGTTTTCAACTTCTTTGCCTTTGGTTTGAATATCCTCCCATAGCTCGGAGTAATTTGATCATCTGAAGCCTTCTTCTCTCAGCTCGTCAAAGTCATTCTCCATCCAGCTTTGTTCCGTTGGTGGTGAGGAACTGCGTTCCTTTGGAGGAGGAGAGGTGCTCTGCTTTTTAGAGTTTCCAGTTTTTCTGTTCTCTTTTTTCCCCATCTTTGTGGTTTTATCTACTTTTGGTCTTTGATGATGGTGATGTACAAATGGGTTTTTGATGTGGATGTCCTTTCTGTTAGTTTTCCTTCTACCAGACAGGACCCTCAGCTCCAGGTCTGTTCGAATACCCTGCCGTGTGAGGTGTCAGTGTGCCCCCGCTGGGGGGTGCCTCTCAGTTAGGCTGCTCGGGTGTCAGAGGTCAGGGACCCACTTGAGGAGGCAGTCTGCCCGTTCTCAGATCTCCAGCTGCATTCTGGGAGAACCACTGCTCTCTTCAAAGCTGTCAGACAGGGACATTTAAGTCTGCAGAGGTTACTGCTGTCTTTTTGTTTGTCTGTGCCCTGCCCCCAGAGGTGGAGCCTACAGAGGCAGGCAGGCCTCTTTGAGCTGTGATGGGTTCCACCCAGTTTGAGATTCTCAGCTGCTTTGTTTACCTAAGCAATCCTGGGCAATGGCGGGCGACGGTCCCACAACCTCGCTGCTGCCTTGCAGTTTAAACTCAGACTGTTGTGCTAGCAATCAGCGAGACTCCGTGGGTGTAGGACCCTCCAAGCCAGGTGCGGGATATAATCTCGTGGTGCACCGTTTTTTAAGCCGGTCGGAAAAGCGCAGTAATCGGGTGGGAGTGACCGGATTTTCCAGATGCCGTCTGTCACTCTTTCTTTGACTCGGAAAGGGAACTCCCTTACCCCTTGCGCTTCCCAAGTGAGGCAATGCGTCGCCCTGCTTCAGCTCGCGTATGGTGAGCACACCCACTGACCTGCACCCACTGTCTGGCACTCCCTAGTGAGATGAACCTGGTACCTCAGATGGAAATGCAGAAATCACCCGACTTCTGCGTCGCTGATGCTGGGAGCTGTAGACCGGAGCTGTTCCTATTCGGCCATCTTGGCTCCTCCGCCTGAATATTAT 1 1 6020 L1PA2 LINE/L1 C 301
22 16676838 HG002.mat.svim_asm.INS.3 T TAAGAAACTCACTCAAGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAAGAGATCGAGACCATCCCGGCTAAAACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAAATTAGCCGGGCGTAGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 1 1 312 AluYa5 SINE/Alu + 329

vcf_annotation.gz doesn't open properly after replacing header

$ bcftools view vcf_annotation.gz.bac

Failed to read from vcf_annotation.gz.bac: unknown file type

The size of the TWP.txt file generated by this step in the repmask_vcf.sh script is zero

if --mammal if set, search for L1 5' inversion (Twin Priming and similar) and if SVA hits are within VNTR only (non retrotransposition polymorphism)

if [[ ${MAM} == "MAM" ]]
then
echo "Mammalian filters ON. Filtering..."
# FILTER 1: L1 Twin Priming and similar
rm TwP.txt &> /dev/null
# awk 1: find SVs IDs only matched by TwP, sort by SV ID
# join: join SV ID with its matched TEs from the OneCode output (sort by SV ID and reverse sort by strand to keep the order of "C,+")
# awk 2: write SV ID and coordiante of C L1 (odd line) and + L1 (even line) to one line
# awk 3: compare coordinate and annotate 5P_INV
awk '{
if ($6 == 2 && $11 == "C,+" && $10 ~ /LINE/) {
names = split($10, n, ",", seps);
ids = split($12, i, ",", seps);

    if (n[1] == n[2] && i[1] == i[2]) {
        print $3;
    }
}
}' vcf_annotation | sort | \
join -11 -21 - <(awk '{print $5"\t"$9"\t"$12"\t"$13"\t"$14}' repeatmasker_dir/indels.fa.onecode.out | sort -k1,1 -k2,2r) | \
awk 'NR % 2 == 1 {odd_line = $0; getline; print odd_line"\t"$3"\t"$4"\t"$5}' | \
awk '{if ($4<$7) {print $1"\t5P_INV:plus"} else {print $1"\t5P_INV:minus"} }' > TwP.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant