ChatGPT has demonstrated the capability of machine learning in the natural language processing (NLP) and text mining (TM) of unstructured data. While OpenAI’s GPT-4 works across many domains of knowledge, it is a “black box” with details about the training data not made public.
Open source resources are available to support the TM of biomedical literature and unstructured clinical text (e.g., clinical letters, imaging reports) to automatically identify information required by health data researchers. These resources include rule-based TM methods, machine learning methods involving state of the art neural-network-based techniques (e.g., BERT, BERN2), and text corpora to test (and train) the methods. Public corpora include annotated publication abstracts and full-texts, and unstructured clinical text for synthetic or anonymised patients. It remains a challenge for researchers to find the most appropriate TM resources for their use case, and to store the identified data in a standard annotation format to maximise reuse.
This project will make sense of current TM resources that are appropriate for the health data domain, by collating them into a single catalogue and producing a decision tree diagram to support researchers to identify the most appropriate approach. Health data use cases contributed by project participants will be evaluated against the decision tree and new exemplar TM pipelines will be built. Existing TM methods will be optimised (and trained if necessary) and tested for identifying biomedical entities of interest. The entities will be represented using standardised annotation formats such as BioC and the Web Annotation Data Model specification.
The health data TM resources catalogue and decision tree diagram will be completed during BH 2023. The complexity of the methods required to support the use cases will determine how long the exemplars take to complete. This will also be influenced by the presence of an annotated corpus to validate the outputs, for example, the NLM provides corpora of abstracts annotated with diseases and drugs which could be used to validate results.
We will focus on selecting use cases where we can validate the results (and train machine learning models if necessary) using existing corpora. We will continue the coordination of longer-term project goals through the HDFG. There is no minimum number of participants. Experience with TM would be an advantage, although health data domain experts are also welcome to provide use cases and validate results.
Tim Beck, University of Nottingham
Venkata Satagopam, University of Luxembourg