Web interface for human classification of NLP results and related source records #151

RudolfCardinal · 2024-04-15T09:16:53Z

Perhaps this exists elsewhere, but I'm not aware of it if so. It's a common problem to need expert human review of NLP output to establish its accuracy, e.g. (see https://en.wikipedia.org/wiki/Precision_and_recall) as

recall, e.g. P(NLP says X occurred | X actually occurred)
precision, e.g. P(X actually occurred | NLP says X occurred)

Assessing precision in this format is conceptually straightforward: sample N records from the NLP "hits" for a topic X, where N is a number large that the human rater can tolerate; then view the corresponding source records, and rate whether X did or did not happen for each. This could be very straightforward in CRATE, which surrounds all NLP tools, be they internal or external, with a relational database format that indexes back to the source record and the relevant position within the source record (which, of course, is expected to be de-identified).

Assessing recall is trickier because the total number of records is usually unfeasibly large. An approximation is often a keyword search. (And, for CRATE's internal NLP tools, every tool has a corresponding "Validator" tool, e.g. for "Crp" [C-reactive protein, CRP] there is "CrpValidator", which uses the same keywords as the core NLP tool but omits surrounding "grammar"-like structure and numerical requirements.)

Doing this in practice for arbitrary NLP is challenging for those without significant computational experience, and overly laborious.

So I was wondering if it would be worth a system for researchers, within the existing Django set site, to:

create a "sample" (e.g. 200 records where "Crp" found a hit; or 150 records where "CrpValidator" found a hit) and save it for later; creation being at random (e.g. from integer PKs) +/- a prespecified random number generator seed for reproducibility (if not specified, using a standard quasi-random seed);
define a question and possible answers (typically binary -- ?not always, but binary would deal with most use cases);
associate a question with a sample to make a job (NB obvious that the same question might be applied to different samples, but also sometimes you want >1 question per sample -- e.g. "was clozapine prescribed" or a more specific "was clozapine prescribed at the time of the source record", e.g. https://pubmed.ncbi.nlm.nih.gov/27336041/);
present unclassified records from the job to a human, along with the question (and perhaps the NLP record that prompted the question); record the decision; allowing the human to stop
... and perhaps allowing >1 user to contribute to the work, ensuring that they don't overlap (job-sharing) -- or in other methodologies, that they do (inter-rater reliability).
enabling the classification results to be saved/downloaded (even if the source database is later wiped/recreated).

The question might involve seeing the NLP output. For example, if there is a NLP "hit" for CRP that says the CRP is 43 mg/L, the users might be happy with a result-independent question ["Does this text show a C-reactive protein (CRP) value?"]. But they might want a specific one ["Does this text show a C-reactive protein (CRP) value AND that value matches the NLP output?"] -- in case the NLP mistakently thinks it's 4.3, or 430 (e.g. via unit mis-conversion), or some other failure. So displaying the NLP record is likely often required when classifying.

RudolfCardinal · 2024-07-03T09:43:05Z

Just moving some old notes here so I can clear out some paper. Vague thoughts about potential table structure:

rating_task (concept like "assessing CRP accuracy for Bob's study");
rating_question (1 or more per task);
rating_options (2 or more per question, assuming all questions are MCQ to start with);
rating_sample (freezing a selection, e.g. random, of records, potentially across several source tables) [might be "sample creation" options to create a random sample, or a sample non-overlapping with other sample(s)];
rating_job (combination of rating_task and rating_sample and one or more users? Somehow dealing with the "job share" versus "rate twice" option -- perhaps if all users on a job are job-sharing, then you create two jobs if you want to do interrater reliability);
rating_answer (responses for each question for each record for the job, and who responded when)?

Presentationally the NLP record might need to be defined (e.g. in the job?) as the point picked out by the NLP result with a backwards and forwards span.

The CRATE NLP output (normal and valudat

martinburchell added the enhancement label Apr 15, 2024

martinburchell self-assigned this Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web interface for human classification of NLP results and related source records #151

Web interface for human classification of NLP results and related source records #151

RudolfCardinal commented Apr 15, 2024

RudolfCardinal commented Jul 3, 2024

Web interface for human classification of NLP results and related source records #151

Web interface for human classification of NLP results and related source records #151

Comments

RudolfCardinal commented Apr 15, 2024

RudolfCardinal commented Jul 3, 2024