Skip to content

chenkenbio/SpliceBERT-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SpliceBERT-analysis

Additional analysis on SpliceBERT. The original repository is available at SpliceBERT.

Benchmark

On SpliceAI's GTEx dataset

We fine-tuned SpliceBERT on SpliceAI's GTEx dataset with R-Drop regularization for 5 times using different random seeds (model weights: Google Drive). The average AP scores of SpliceBERT (900nt) is comparable (donor) or slightly superior (acceptor) to SpliceAI-10K, while the ensemble model (averaging the predictions of 5 models) underperforms that of SpliceAI-10K, which is likely because that SpliceBERT models were fine-tuned based on the same pre-trained model and thus lack sufficient diversity.

The source codes are available in benchmark_spliceai-gtex.

model receptive field size AP (donor) AP (acceptor)
SpliceBERT 900 0.8547 $\pm$ 0.0012 0.8458 $\pm$ 0.0009
SpliceAI-10k 10001 0.8547 $\pm$ 0.0027 0.8434 $\pm$ 0.0023
SpliceAI-2k 2001 0.8369 $\pm$ 0.0015 0.8270 $\pm$ 0.0017
SpliceAI-400 401 0.7961 $\pm$ 0.0020 0.7873 $\pm$ 0.0026
SpliceAI-80 81 0.5216 $\pm$ 0.0022 0.4449 $\pm$ 0.0020
model (ensemble) receptive field size AP (donor) AP (acceptor)
SpliceAI-10k (ensemble) 10001 0.8735 0.8644
SpliceBERT (ensemble) 900 0.8608 0.8524

On DeepSTARR's dataset

Though SpliceBERT was pre-trained on primary RNA sequences, it can also be applied to DNA sequences. We finetuned SpliceBERT on DeepSTARR's dataset (https://zenodo.org/records/5502060) to identify sequences with potential enhancer activity. SpliceBERT outperformed DeepSTARR (convolution model) and Nucleotide Transformer (DNA language model). The results are available at benchmark_deepstarr.

model Developmental Housekeeping
SpliceBERT 0.70 0.78
DeepSTARR 0.68 0.74
Nucleotide Transformer (multi-species) 0.64 0.75

SpliceBERT_on_DeepSTARR (show 20% points)

Contact

For any questions, contact chenkenbio_[at]_gmail.com

Citation

@article{chen2024self_bbae163,
  title={Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction},
  author={Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
  journal={Briefings in Bioinformatics},
  volume={25},
  number={3},
  pages={bbae163},
  year={2024},
  publisher={Oxford University Press}
}

About

Additional analysis on SpliceBERT

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published