Summary of medical NLP evaluations/competitions, datasets, papers and pre-trained models.
Since Cris Lee left the medical NLP field in 2021, this repo is now maintained by Xidong Wang, Ziyue Lin, Jing Tang.
- 1. Evaluation
- 2. Competitions
- 3. Datasets
- 4. Open-source Models
- 5. Relevant Papers
- 6. Open-source Toolkits
- 7. Industrial Solutions
- 8. Blog Sharing
- 9. Friendly Links
-
CMB
- GitHub Link:https://github.com/FreedomIntelligence/CMB
- Source:Various clinical medical examinations; complex clinical case consultations
-
CMExam
- GitHub Link:https://github.com/williamliujl/CMExam
- Source:Past questions from the Medical Practitioner Qualification Examination
-
PromptCBLUE
- GitHub Link:https://github.com/michael-wzhu/PromptCBLUE
- Source:CBLUE
-
PromptCBLUE
- GitHub Link:https://github.com/CBLUEbenchmark/CBLUE
- Source:Academic evaluation competitions from past CHIP conferences and datasets from Alibaba Quark's medical search service
-
MultiMedBench
- Desription: A large-scale multimodal generative model
- None at the moment. Additions are welcome~
-
BioNLP Workshop 2023 Shared Task
- Link: https://aclweb.org/aclwiki/BioNLP_Workshop#SHARED_TASKS_2023
- Source: BioNLP Workshop
-
MedVidQA 2023
- Link: https://medvidqa.github.io/index.html
- Source: National Institutes of Health
-
MEDIQA-2021
- Link: https://sites.google.com/view/mediqa2021
- Source: NAACL-BioNLP 2021 workshop
-
ICLR-2021 International Competition for Medical Dialogue Generation and Automatic Diagnosis
-
Source: ICLR 2021 workshop
-
NLP for Medical Imaging - Medical Imaging Diagnostic Report Generation
- Link: https://gaiic.caai.cn/ai2023/
- Source: NLP for Medical Imaging - Medical Imaging Diagnostic Report Generation
-
NLP for Medical Imaging - Medical Imaging Diagnostic Report Generation
- Link: http://challenge.xfyun.cn/topic/info?type=disease-claims-2022&ch=ds22-dw-sq03
- Source: iFlytek
-
Evaluation Task of the 8th China Health Information Processing Conference (CHIP2022)
- Link: http://cips-chip.org.cn/
- Source: CHIP2022
-
iFlytek - Medical Entity and Relationship Recognition Challenge
- Link: http://www.fudan-disc.com/sharedtask/imcs21/index.html
- Source: iFlytek
-
Huatuo-26M
- Link: https://github.com/FreedomIntelligence/Huatuo-26M
- Description: Huatuo-26M is the largest Traditional Chinese Medicine (TCM) Q&A dataset to date.
-
Chinese Medical Dialogue Dataset
- Link: https://github.com/FreedomIntelligence/Huatuo-26M
- Description: Medical Q&A data from six departments
-
CBLUE
- Link: https://github.com/CBLUEbenchmark/CBLUE
- Description: Covers medical text information extraction (entity recognition, relation extraction)
-
cMedQA2 (108K)
- Link: https://github.com/zhangsheng93/cMedQA2
- Description: Chinese medical Q&A dataset with over 100,000 entries.
-
xywy-KG(294K triples)
- Link: https://github.com/baiyang2464/chatbot-base-on-Knowledge-Graph
- Description: 44.1K entities 294.1K triples
-
39Health-KG (210K triples)
- Link: https://github.com/zhihao-chen/QASystemOnMedicalGraph
- Desription: Includes 15 pieces of information, with 7 types of entities, about 37,000 entities, and 210,000 entity relationships.
-
Medical-Dialogue-System
- Link: https://github.com/UCSD-AI4H/Medical-Dialogue-System
- The MedDialog dataset (Chinese) contains conversations (in Chinese) between doctors and patients. It has 1.1 million dialogues and 4 million utterances. The data is continuously growing and more dialogues will be added.
-
Chinese medical dialogue data
- 地址:https://github.com/Toyhom/Chinese-medical-dialogue-data
- The dataset contains a total of 792,099 data from six different departments including Male, Pediatrics, Obstetrics and Gynecology, Internal Medicine, Surgery, and Oncology.
-
Yidu-S4K
- Link: http://openkg.cn/dataset/yidu-s4k
- Description: Named Entity Recognition, Entity and Attribute Extraction
-
Yidu-N7K
- Link: http://openkg.cn/dataset/yidu-n7k
- Description: Clinical Language Standardization
-
Chinese Medical Q&A Dataset
- Link: https://github.com/zhangsheng93/cMedQA2
- Description: Medical Q&A
-
Chinese Doctor-Patient Dialogue Data
- 地址:https://github.com/UCSD-AI4H/Medical-Dialogue-System
- Description: Medical Q&A
-
CPubMed-KG (4.4M triples)
- Link: https://cpubmed.openi.org.cn/graph/wiki
- Description: Full-text journal data of high quality from the Chinese Medical Association
-
Chinese Medical Knowledge Graph CMeKG (1M triples)
- Link: http://cmekg.pcl.ac.cn/
- Description: CMeKG(Chinese Medical Knowledge Graph)
-
CHIP Annual Evaluation (Official Evaluation)
- Link: http://cips-chip.org.cn/2022/callforeval ; http://www.cips-chip.org.cn/2021/ ; http://cips-chip.org.cn/2020/
- Description: CHIP Annual Evaluation (Official Evaluation)
-
Ruijin Hospital Diabetes Dataset (Diabetes)
- Link: https://tianchi.aliyun.com/competition/entrance/231687/information
- Description: Diabetes literature mining and knowledge graph construction using diabetes-related textbooks and research papers
-
Tianchi Novel Coronavirus Pneumonia Question Matching Competition (Novel Coronavirus)
-
Link: https://tianchi.aliyun.com/competition/entrance/231776/information
-
Description: The competition data includes: anonymized medical problem data pairs and annotated data.
-
-
MedMentions
- Link: https://github.com/chanzuckerberg/MedMentions
- Desription: Biomedical entity linking dataset based on PubMed abstracts
-
webMedQA
- Link: https://github.com/hejunqing/webMedQA
- Description: Medical Q&A
-
COMETA
- Link: https://www.siphs.org/
- Description: Medical entity linking data in social media. Published at EMNLP2020
-
PubMedQA
- Link: https://arxiv.org/abs/1909.06146
- Description: Medical entity linking data in social media. Published at EMNLP2020
-
MediQA
- Link: https://sites.google.com/view/mediqa2021
- Description: Text summarization
-
ChatDoctor Dataset-1
- Link: https://drive.google.com/file/d/1lyfqIwlLSClhgrCutWuEe_IACNq6XNUt/view?usp=sharing
- Description: 100k real conversations between patients and doctors from HealthCareMagic.com
-
ChatDoctor Dataset-2
- Link: https://drive.google.com/file/d/1ZKbqgYqWc7DJHs3N9TQYQVPdDQmZaClA/view?usp=sharing
- Description: 10k real conversations between patients and doctors from icliniq.com
-
Visual Med-Alpaca Data
- Link: https://github.com/cambridgeltl/visual-med-alpaca/tree/main/data
- Description: These data are used for Visual Med-Alpaca traning, being produced based on BigBio, ROCO and GPT-3.5-Turbo
-
CheXpert Plus
- Link: https://github.com/Stanford-AIMI/chexpert-plus
- Description: The largest publicly available text dataset in the field of radiology consists of 36 million text tokens, each paired with high-quality images in DICOM format. Additionally, the dataset includes a vast array of images and patient metadata covering various clinical and social groups, as well as numerous pathology labels and RadGraph annotations.
- BioBERT:
- Website: https://github.com/naver/biobert-pretrained
- Introduction: A language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc.
- BlueBERT:
- Website: https://github.com/ncbi-nlp/BLUE_Benchmark
- Introduction: BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora. BLUE benchmark rely on preexisting datasets because they have been widely used by the BioNLP community as shared tasks. These tasks cover a diverse range of text genres (biomedical literature and clinical notes), dataset sizes, and degrees of difficulty and, more importantly, highlight common biomedicine text-mining challenges.
- BioFLAIR:
- Website: https://github.com/flairNLP/flair
- Introduction: Flair is a powerful NLP library, which allows you to apply our state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), sentiment analysis, part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification, with support for a rapidly growing number of languages. Flair is also a A text embedding library and a PyTorch NLP framework.
- COVID-Twitter-BERT:
- Website: https://github.com/digitalepidemiologylab/covid-twitter-bert
- Introduction: COVID-Twitter-BERT (CT-BERT) is a transformer-based model pretrained on a large corpus of Twitter messages on the topic of COVID-19. The v2 model is trained on 97M tweets (1.2B training examples).
- bio-lm (Biomedical and Clinical Language Models)
- Website: https://github.com/facebookresearch/bio-lm
- Introduction: This work evaluates many models used for biomedical and clinical nlp tasks, and train new models that perform much better.
- BioALBERT
- Website: https://github.com/usmaann/BioALBERT
- Introduction: A biomedical language representation model trained on large domain specific (biomedical) corpora for designed for biomedical text mining tasks.
- BenTsao:
- Website: https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese
- Introduction: BenTsao is based on LLaMA-7B and has undergone fine-tuning with Chinese medical instructions through instruct-tuning. Researchers built a Chinese medical instruction dataset using a medical knowledge graph and the GPT3.5 API, and used this dataset as the basis for instruct-tuning LLaMA, thereby improving its question-answering capabilities in the medical field.
- BianQue:
- Website: https://github.com/scutcyr/BianQue
- Introduction: A large medical conversation model fine-tuned through joint training with instructions and multi-turn inquiry dialogues. It is based on ClueAI/ChatYuan-large-v2 and fine-tuned using a blended dataset of Chinese medical question and answer instructions as well as multi-turn inquiry dialogues.
- SoulChat:
- Website: https://github.com/scutcyr/SoulChat
- Introduction: SoulChat initialized with ChatGLM-6B, underwent instruct-tuning using a large-scale dataset of Chinese long-form instructions and multi-turn empathetic conversations in the field of psychological counseling. This instruct-tuning process aimed to enhance the model's empathy ability to guide users in expressing themselves, and capacity to provide thoughtful advice.
- DoctorGLM:
- Website: https://github.com/Kent0n-Li/ChatDoctor
- Introduction: A large Chinese diagnostic model based on ChatGLM-6B. This model has been fine-tuned using a Chinese medical conversation dataset, incorporating various fine-tuning techniques such as Lora and P-tuningv2, and has been deployed for practical use.
- HuatuoGPT:
- Website: https://github.com/FreedomIntelligence/HuatuoGPT
- Introduction: HuaTuo GPT is a GPT-like model that has undergone fine-tuning with specific medical instructions in Chinese. This model is a Chinese Language Model (LLM) designed specifically for medical consultation. Its training data includes distilled data from ChatGPT and real data from doctors. During the training process, reinforcement learning from human feedback (RLHF) has been incorporated to improve its performance.
- HuatuoGPT-II:
- Website: https://github.com/FreedomIntelligence/HuatuoGPT-II
- Introduction: HuatuoGPT2 employs an innovative domain adaptation method to significantly boost its medical knowledge and dialogue proficiency. It showcases state-of-the-art performance in several medical benchmarks, especially surpassing GPT-4 in expert evaluations and the fresh medical licensing exams.
- GatorTron:
- Website: https://github.com/uf-hobi-informatics-lab/GatorTron
- Introduction: An early LLM developed for the Healthcare domain, aims to investigate how systems utilizing unstructured EHRs can benefit from clinical LLMs with billions of parameters.
- Codex-Med:
- Website: https://github.com/vlievin/medical-reasoning
- Introduction: Codex-Med aimed to investigate the effectiveness of GPT-3.5 models. Two multiple-choice medical exam question datasets, namely USMLE and MedMCQA, as well as a medical reading comprehension dataset called PubMedQA were utilized.
- Galactica:
- Website: https://galactica.org/
- Aiming to solve the problem of information overload in the scientific field, Galactica was proposed to store, combine, and reason about scientific knowledge, including Healthcare. Galactica was trained on a large corpus of papers, reference material, and knowledge bases to potentially discover hidden connections between different research and bring insights to the surface.
- DeID-GPT:
- Website: https://github.com/yhydhx/ChatGPT-API
- Introduction: A novel GPT4-enabled de-identification framework which enables to automatically identify and remove the identifying information.
- ChatDoctor:
- Website: https://github.com/Kent0n-Li/ChatDoctor
- Introduction: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge.
- MedAlpaca:
- Website: https://github.com/kbressem/medAlpaca
- Introduction: MedAlpaca employed an open-source policy that enables on-site implementation, aiming at mitigating privacy concerns. MedAlpaca is built upon the LLaMA with 7 and 13 billion parameters.
- PMC-LLaMA:
- Website: https://github.com/chaoyi-wu/PMC-LLaMA
- Introduction: PMC-LLaMA is an open-source language model that by tuning LLaMA-7B on a total of 4.8 million biomedical academic papers for further injecting medical knowledge, enhancing its capability in the medical domain.
- Visual Med-Alpaca:
- Website: https://github.com/cambridgeltl/visual-med-alpaca
- Introduction: Visual Med-Alpaca is an open-source, parameter-efficient biomedical foundation model that can be integrated with medical "visual experts" for multimodal biomedical tasks. Built upon the LLaMa-7B architecture, this model is trained using an instruction set curated collaboratively by GPT-3.5-Turbo and human experts.
- GatorTronGPT:
- Website: https://github.com/uf-hobi-informatics-lab/GatorTronGPT
- Introduction: GatorTronGPT is a clinical generative LLM designed with a GPT-3 architecture comprising 5 or 20 billion parameters. It utilizes a vast corpus of 277 billion words, consisting of a combination of clinical and English text.
- MedAGI:
- Website: https://github.com/JoshuaChou2018/MedAGI
- Introduction: A paradigm to unify domain-specific medical LLMs with the lowest cost and a possible path to achieving medical AGI, rather than it is a LLM.
- LLaVA-Med:
- Website: https://github.com/microsoft/LLaVA-Med
- Introduction: LLaVA-Med was initialized with the general-domain LLaVA and then continuously trained in a curriculum learning fashion (first biomedical concept alignment then full-blown instruction-tuning).
- Med-Flamingo:
- Website: https://github.com/snap-stanford/med-flamingo
- Introduction: Med-Flamingo is a vision language model specifically designed to handle interleaved multimodal data comprising both images and text. Building on the achievements of Flamingo, Med-Flamingo further enhances these capabilities for the medical domain by pre-training diverse multimodal knowledge sources across various medical disciplines.
- Large Language Models Encode Clinical Knowledge. Paper Link
- Performance of ChatGPT on USMLE: The Potential of Large Language Models for AI-Assisted Medical Education. Paper Link
- Turing Test for Medical Advice from ChatGPT. Paper Link
- Toolformer: Language Models Can Self-Learn to Use Tools. Paper Link
- Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automatic Feedback. Paper Link
- Capability of GPT-4 in Medical Challenge Questions. Paper Link
- Pretrained Language Models in Biomedical Field: A Systematic Review. Paper Link
- Deep Learning Guide for Healthcare. Paper Link Published in Nature Medicine.
- A Survey of Large Language Models for Healthcare. Paper Link
Articles Related to Electronic Health Records
- Transfer Learning from Medical Literature for Section Prediction in Electronic Health Records. Paper Link
- MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records. Paper Link
Medical Relation Extraction
- Leveraging Dependency Forest for Neural Medical Relation Extraction. Paper Link
Medical Knowledge Graph
- Learning a Health Knowledge Graph from Electronic Medical Records. Paper Link
Auxiliary Diagnosis
- Evaluation and Accurate Diagnoses of Pediatric Diseases Using Artificial Intelligence. Paper Link
Medical Entity Linking (Normalization)
- Medical Entity Linking Using Triplet Network. Paper Link
- A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization. Paper Link
- Deep Neural Models for Medical Concept Normalization in User-Generated Texts. Paper Link
List of Medical-Related Papers from ACL 2020
- A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization. Paper Link
- Biomedical Entity Representations with Synonym Marginalization. Paper Link
- Document Translation vs. Query Translation for Cross-Lingual Information Retrieval in the Medical Domain. Paper Link
- MIE: A Medical Information Extractor towards Medical Dialogues. Paper Link
- Rationalizing Medical Relation Prediction from Corpus-level Statistics. Paper Link
List of Medical NLP Related Papers from AAAI 2020
- On the Generation of Medical Question-Answer Pairs. Paper Link
- LATTE: Latent Type Modeling for Biomedical Entity Linking. Paper Link
- Learning Conceptual-Contextual Embeddings for Medical Text. Paper Link
- Understanding Medical Conversations with Scattered Keyword Attention and Weak Supervision from Responses. Paper Link
- Simultaneously Linking Entities and Extracting Relations from Biomedical Text without Mention-level Supervision. Paper Link
- Can Embeddings Adequately Represent Medical Terminology? New Large-Scale Medical Term Similarity Datasets Have the Answer! Paper Link
List of Medical NLP Related Papers from EMNLP 2020
- Towards Medical Machine Reading Comprehension with Structural Knowledge and Plain Text. Paper Link
- MedDialog: Large-scale Medical Dialogue Datasets. Paper Link
- COMETA: A Corpus for Medical Entity Linking in the Social Media. Paper Link
- Biomedical Event Extraction as Sequence Labeling. Paper Link
- FedED: Federated Learning via Ensemble Distillation for Medical Relation Extraction. Paper Link Paper Analysis
- Infusing Disease Knowledge into BERT for Health Question Answering, Medical Inference and Disease Name Recognition. Paper Link
- A Knowledge-driven Generative Model for Multi-implication Chinese Medical Procedure Entity Normalization. Paper Link
- BioMegatron: Larger Biomedical Domain Language Model. Paper Link
- Querying Across Genres for Medical Claims in News. Paper Link
- Tokenization tool: PKUSEG Project Link Project Description: A multi-domain Chinese tokenization tool launched by Peking University, which supports selection in the medical field.
- Lingyi Wisdom
- Left Hand Doctor
- Yidu Cloud Research Institute - Medical Natural Language Processing
- Baidu - Medical Text Structuring
- Alibaba Cloud - Medical Natural Language Processing
- Alpaca: A Powerful Open Source Instruction Following Model
- Lessons Learned from Building Natural Language Processing Systems in the Medical Field
- Introduction to Medical Public Databases and Data Mining Techniques in the Big Data Era
- Looking at the Development of NLP Application in the Medical Field from ACL 2021, with Resource Download
- awesome_Chinese_medical_NLP
- Chinese NLP Dataset Search
- medical-data(Large Amount of Medical Related Data)
- Tianchi Dataset (Includes Multiple Medical NLP Datasets)
@misc{medical_NLP_github,
author = {Xidong Wang, Ziyue Lin and Jing Tang},
title = {Medical NLP},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/FreedomIntelligence/Medical_NLP}}
}