We built a little system and wrote a short paper about it to submit to the Computation+Journalism Symposium 2024: the Science De-jargonizer can simplify scientific jargon for journalists without scientific backgrounds, using Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). This tool transforms complex terms into clear explanations, aiding accurate and accessible science reporting. It is a prototype, intended as a roof-of-concept to demonstrate the potential benefits and drawbacks of such an application. Here's our codebase for the study, with all our data collection code, data analysis, prompts, and datasets.
- Jargon Identification: Automatically identifies complex scientific terms within academic texts.
- Personalization: Identifies jargon terms based on the user’s expertise and background.
- Clear Explanations: Generates easy-to-understand definitions and explanations based on the context of a paper.
We ran a short pilot study to evaluate the potential of GPT-4 and RAG for identifying and defining jargon terms for the benefit of science reporters. We tested different prompts to (i) personalize jargon identification based on the reader's science expertise, and (ii) to generate accurate, high-quality definitions of jargon terms for easy reading.
We evaluated the identified jargon terms and definitions by comparing them to ground-truth annotations from two annotators with varying scientific expertise. This was a relatively small-scale study, we looked at jargon terms for arXiv CS abstracts (n=64), sampled from articles published in March 2024 and in the following primary categories: cs.AI, cs.HC, cs.CY. We also compared two different approaches for definition generation: using RAG with context from the fulltext, and just using the paper abstract as context in the prompt. The abstract-only condition performed a little better!
We aim to continue this work further, at scale, and with improved appraoches to prompting and UI design.
The interface displays the jargon identified by one of our annotators, as well as all the correct definitions for it generated by the abstract-only system. You can run it locally as follows:
cd my-app
npm start
System prompt for GPT-4:
Your task is to identify jargon terms in scientific abstracts for readers, specific to the methods and concepts introduced in the study discussed. Jargon terms can encompass multiple words that refer to a concept. Only identify jargon that prevents readers from developing a basic understanding of the important concepts in the study. If a term is defined in the abstract, it is not jargon. Here is some information about the reader, their expertise, and the domain of the scientific abstract they are reading:
Reader Description:ANNOTATOR DESCRIPTION
Domain of Abstract: Computer Science, focusing onSUB-CATEGORY OF ABSTRACT
.
From the provided abstract, only return a comma-separated list of jargon terms given what you know about the reader, their expertise, and the abstract domain. Retain the exact wording of the jargon terms as they appear in the abstract. Do not make any changes in wording or punctuation.
We retained the default system prompt from llama-index
for both the RAG-based and the abstract-based jargon definitions:
You are an expert Q&A system that is trusted around the world. Always answer the query using the provided context information, and not prior knowledge. Some rules to follow:
- Never directly reference the given context in your answer.
- Avoid statements like "Based on the context, ..." or "The context information ..." or anything along those lines.
The query prompt for the RAG-based system:
Context information is below:
CONTEXT FROM RETRIEVAL STEP
Given the context information and not prior knowledge, answer the query. Query: Please use 1-2 sentences to explain the following term so that even a reader without deep scientific and technical knowledge can understand it easily:
JARGON TERM
Answer:
The query prompt for the abstract-based system was similar to the RAG-based system, but it used the entire abstract as context instead of retrieved fulltext snippets.
- Project Maintainers:
- Sachita Nishal: [email protected]
- Eric Lee: [email protected]
- Thanks to the Computational Journalism Lab at Northwestern for their support.