Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the format of the output MSA, fasta, a3m or a2m? #14

Open
permia opened this issue Nov 15, 2024 · 5 comments
Open

What's the format of the output MSA, fasta, a3m or a2m? #14

permia opened this issue Nov 15, 2024 · 5 comments

Comments

@permia
Copy link

permia commented Nov 15, 2024

I aligned ~3k of the RdRP domain RNA virus. The output is like the following

>1
...................................................................................V--------...............................................................................-..................-PFSVWNRRFPAAQ.-QRIHEL......................................---...TYRETCGGYPE..EKDIRIKTFLKIEDTIQGIDSDMKVK................................................................-AA-RVISGMTPH-..VNVAMGPECLGI--.........---........................................................................................AKTLVKAFDGSDKICYTAGWSAE..AISEVLFGKR.....................RQT..DWS......G...DEELDQ..LHLDSIYLEQRN...........................................................................................................-...ARD...........KIF..Q.....MY........RFGLETDCS..VWDGSITIPLLEFEQWVFQSW...GHQ........................................................................................................----S......................................................................KRFVYS..IDSCECLAVLPTVSNIE..RMA................................................-QD....YQVWQRRQA..........................................-RIPCLIH..............SYMLS...-........SLKNLESLNTSLS--H---wv.................................T--------..........-.-TLPLYQMTHDYCD-CQLSQNWPTLDLGPSQFSTQAYLQKYPSVQHISGQQQMDPSLGQNP.-SE-SWPNLASHI-.QTCGVCLSMLLESL...---KACSTIPTTSLSFTKSYIGSCVMSIRKLMLQLSLKStDHTSQRDMSARRIRMHFSLTFT.......E.PHS.KSSCQDSNRPCHITNCHVSLPTICSTTASVLTCCKD.HLT..........................GHGDPDPLEAT.IHFVLLDFQNTTLPYNNVAQ...........AMPQKSKSNASK-KARSGQRKPTGK..RRIAQVKKAIIAAA--G---paAKPSYNAPTANAEPKYKPPKAASASGAAKEWVDSEDNAMAAGLAYYNMITDPFSTPARGGWVGDIEIGTQEGQDIMKISAPPTILNATYNANYKFCCVGIEAKASDPVVMMTTLGNTGAMTLSSAGNPPSWSGLVSNANLLITRCVGLKINNFNAFQQRNGRAYVLPRWGAYQNGSVVYPANVTDITYNDDTMVYDAANMPEDFLLTVKNELIAGEDLYAPTASVSLNQNTSNMAFVLFVFDGVD...................................................................................................................
>2
pnfrawvrkfpekvrr...................................................................RLEEAMAAMtghkpmadt......................................................................I..................GRVLFRPAFMKDEK.-ANTGYP......................................GGS...KISDPRLIQPG..SPELNAVWGPMFFAIAGMWKACYATH................................................................-SC-LVWAAGLTG-..DEIGDSMYQACCYT.........---........................................................................................NGFQFVENDCSRFDASVQRELLM..LRQDIYSLYF.....................DLD..VDS......P...MGMSYR..KLLKLMCDKQGI...........................................................................................................T...PHG...........IRY..K.....TV........GTVASGDGD..TSMWNFFLNSMADLFAYCTN-...PLFapdgas..................................................................................................LVTPL......................................................................VLYGTS..VLNHHDSQCQFAAVREA..WAQ................................................REA....LESKFEAGE..........................................-PEAWREH..............VATLE...H........WEAAQDWAAKTVL---PSMrlprdvlmmttsiasgsvpdsstatarrgpdsaglGLSENVTVPrehkddpsdaH.RQEWLLTAAASDDG-YVRGRA--AQSYVPTTRASLDGTHRFLVPATAASGPAFAARDVDRE.-ELDRAWAERKDA-.APVSGRLPS-SPLT...TDIPTAHLIERDAGRAHTYVTPGRWNDHAAMDSAVQPNV.RAP--GGTAGHHGAAFAHAPPS.......-.ARP.QRLDGLPSRRTLNETLEVGPSQLALYFVFLAHSEL-.EISsstrsflstwcesrnfsfadylrdlsALTPVQWLEWC.FYKSYTNGDDFSCIQGP-CN...........-PGLAHRHGVYQ-SLGFRPEFKTYQ..-EVAHTEFCSSVLM---PCY..-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------...................................................................................................................

etc.

The MSA seems to be in a2m format. So, I tried to covert the a2m to fasta format by reformat.pl script in hhsuite. However, the MSA seemed to be algined terribly. What's the format of the output MSA, fasta, a3m or a2m?

@felbecker
Copy link
Collaborator

Hi,
I wasn't aware of the a2m and a3m formats (which are technically still fasta). You assumed correct that learnMSA outputs a2m (convert to ordinary fasta by making everything upper case and replacing "." with "-").

I don't think anything went wrong in your conversion.

Otherwise: What do you meant exactly by "aligned terribly"? Could you send me input, learnMSA output and the learnMSA version and settings (if non default) you used?

@permia
Copy link
Author

permia commented Nov 15, 2024

What do you meant e

Three motifs A B C of RdRP domain weren't aligned compared to the muscle super. So, I think, maybe, that I misunderstand the format of the out.

I installed the learnMSA as the README and run the following commond:

learnMSA -i ./RdRP1_dedup.faa -o ./learnMSA_all.afa --sequence_weights

The input and output files are sent to you by e-mail (beckerfelix94).

@felbecker
Copy link
Collaborator

felbecker commented Nov 15, 2024

Update: I checked reformat.pl. Perhaps its the character limit per line. learnMSA does not have a character limit, but reformat.pl seems to assumed one (100 per default). Using the argument -l "alignment length" in reformat.pl could help.

I'll also check your files.

I will add more options to configure the learnMSA output soon and make the documentation clearer.

@permia
Copy link
Author

permia commented Nov 15, 2024

Update: I checked reformat.pl. Perhaps its the character limit per line. learnMSA does not have a character limit, but reformat.pl seems to assumed one (100 per default). Using the argument -l "alignment length" in reformat.pl could help.

I'll also check your files.

I will add more options to configure the learnMSA output soon and make the documentation clearer.

Thanks for reply. I think that some sequences shared low similarity (or having longer insert or longer) to the other RdRP sequences, which made the alignment seem bad. Actually, most of the sequences (about 2500 in 3265 ) are well-aligned.

PS: I expected three motifs are aligned. DXXXX[D/E] [S/T]G [G/S/A]D[D/N]

@felbecker
Copy link
Collaborator

I checked your files and I agree: learnMSA found your motifs, but some sequences seem to be aligned off, because of low amino acid similarity.

I was curious if aligning with language model support (--use_language_model) would improve the alignment (it should in your case). I send you the output, can you confirm that this alignment looks better than the one without pLM support?

To reproduce: I aligned with version 2.0.8 (published today) and learnMSA -i ./RdRP1_dedup.faa -o ./learnMSA2_all.e2m --sequence_weights --use_language_model.

Best,
Felix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants