The query expansion code requires a local copy of the UMLS database, and the code which maps SNOMED codes to UMLS concepts also requires it.
NOTE! The UMLS CANNOT be redistributed or used without an agreement.
NOTE! The 2019 version of the UMLS database must be used for two reasons:
- the annotation creation part of SemEHR uses the 2019 version so it makes sense to keep the concepts identical
- the 2021 version of the UMLS removed the relationships from the MRREL file and the replacement MRHIER file is not so useful
There are two types of parent/child relationship:
- Broader/Narrower
- Parent/Child
See the abbreviations used https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html
Honghan thinks only Narrower is needed but the UMLS people say:
- RB/RN are generally used when the relations are not part of a broader hierarchy.
- PAR/CHD relations are generally part of a hierarchy. There may be some exceptions to this.
Maybe they mean that PAR/CHD are used between concepts within one particular source vocabulary (e.g. SNOMED) whereas RB/RN are not.
"The Semantic Structure of the UMLS Metathesaurus" by Stuart J. Nelson says:
In some instances these relationships are labelled, indicating that the two concepts are similar, with one being broader in meaning (but otherwise similar) than the other. Contextual information shows where the concept occurs in a source hierarchy, together with its parents, and siblings. The relationship with a parent is often that of an "is-a" relationship, but some other vertical relationships have been labelled as well. While not entirely true, it is frequently useful to think of non-labelled parent-child relationships as being of some type of relationship which involves subsumption. That is, in some sense the parent concept is broader in meaning than the child. Co-occurring data do not have any label on the link between concepts; the fact that two concepts have both been used to index the same article implies an empirically discovered relationship. Two entries with the same semantic type have an implied relationship, that of "similar to".
Download the UMLS metathesaurus (warning, needs 40GB space!). Unpack:
unzip umls-2019.zip
cd 2019*
unzip 2019*-1-meta.nlm
unzip 2019*-2-meta.nlm
cd 2019*/META
gunzip *.gz
cat MRCONSO.RRF.a? > MRCONSO.RRF
cat MRHIER.RRF.a? > MRHIER.RRF
cat MRREL.RRF.a? > MRREL.RRF
cat MRSAT.RRF.a? > MRSAT.RRF
cat MRXNS_ENG.RRF.a? > MRXNS_ENG.RRF
cat MRXNW_ENG.RRF.a? > MRXNW_ENG.RRF
cat MRXW_ENG.RRF.a? > MRXW_ENG.RRF
wget https://lhncbc.nlm.nih.gov/ii/tools/MetaMap/Docs/SemanticTypes_2018AB.txt
wget https://lhncbc.nlm.nih.gov/ii/tools/MetaMap/Docs/SemGroups_2018.txt
Convert the useful parts of the database to our own format CSV files. These have a useful selection of columns and also reformat the rows so, for example, the Narrower Concepts are all listed in a single row for each concept rather than as multiple rows.
./umls_to_csv.py [-h] [--sources SOURCES] [--chd] [--test TEST]
--sources SOURCES comma-separated list of sources default MTH,SNOMEDCT_US
--chd whether to include CHD (child) relationships, default True
--test TEST CUI to display narrower concepts, default C0205076
There is an option to specify whether the relationships should be filtered to include only those defined in the Metathesaurus or in SMOMED. By default both of those are included. This means that relationships defined by other vocabularies are ignored. You can choose to filter only to MTH or only to SNOMEDCT_US or both using
umls_to_csv.py --sources MTH
umls_to_csv.py --sources SNOMEDCT_US
umls_to_csv.py --sources MTH,SNOMEDCT_US
The difference, as an example using concept C0205076 (Chest Wall) is:
- MTH - only has 1 child concept, fully expands to 2 children
- SNOMED - has 16 child concepts, fully expands to 2283 children
- both - same 16 child concepts, fully expands to 8075 children
The option --chd
can be used to disable the inclusion of Child relations,
as the default is to include them.
umls.py --cfg . --csvs SNOMED/
umls.py --cfg . --csvs MTH/
umls.py --cfg . --csvs MTH+SNOMED/
N.B. you can read the
Maps a semantic type (tui) to a semantic type group
i.e. groups together related tui into a set
Group|GroupName|Type|TypeLabel
eg. ACTI|Activities & Behaviors|T052|Activity
Maps a concept to a semantic type
cui|tui|stn|sty|atui|cvf...
eg. C0000005|T116|A1.4.1.2.1.7|Amino Acid, Peptide, or Protein|AT17648347|256|
Maps from concept (cui) to other vocabularies
cui|lang|ts|lui| stt|sui|IsPref|aui |saui |scui|sdui|sab|tty|code|str|srl|suppress|cvf
e.g.
C0205076|ENG|S|L0248726|PF|S2717960|N|A32395146| |C62484||NCI_caDSR|SY|C62484|Chest Wall|0|N||
C0205076|ENG|S|L0248726|PF|S2717960|Y|A26648386| |M0407552|D035441|MSH|ET|D035441|Chest Wall|0|N|256|
C0205076|ENG|P|L0780053|PF|S0836022|N|A3108835|503946010|78904004||SNOMEDCT_US|PT|78904004|Chest wall structure|9|N|256|
C0205076|ENG|S|L0248726|VO|S0282525|Y|A2895894|130920016|78904004||SNOMEDCT_US|SY|78904004|Chest wall|9|N|256|
Relationships between CUIs
cui1 |aui1 |typ1|rel|cui2 |aui2 |typ2|rela|rui|srui|sab|sl|rg|dir|suppress|cvf
e.g.
C0000005|A13433185|SCUI|RB|C0036775|A7466261|SCUI||R86000559||MSHFRE|MSHFRE|||N||
Note that the files are pipe-separated so that the columns can contain commas more easily.
Maps CUI (concept id) to TUI (semantic type id), and also gives the TUI group names and a label (description) of the CUI. The CUI can have several semantic types so they are comma-separated. The source file (MRCONSO) has multiple rows per CUI so the preferred label is chosen from LAT='ENG' and TS='P' and STT='PF' and ISPREF='Y' ref
cui|tui|tuigroup|cuilabel
C0000005|T116,T121,T130|CHEM,CHEM,CHEM|(131)I-MAA
(In practice the 3 group names are identical so only CHEM is output)
Relationship between CUI1 and CUI2 with the implied meaning "has", i.e. cui1 has a narrow (RN) concept cui2. The list of narrower concepts is comma-separated.
cui1|has|cui2
C0000039|RN|C0043950,C0615231,C3253442,C3885037,C0621533,C0216971,C1611431,C0381030,C4489915,C1310941
Map a SNOMED code to a CUI. If there were multiple CUIs only one was chosen.
snomed|cui
100000000|C0308478
Describes a Semantic Type. The identifier TUI is a member of a group of similar types tuigroup, and the group has a label tuigrouplabel. This is used to find out if a concept is related to other concepts that also share a similar Semantic Type, i.e. their types are both members of the same Semantic Group.
tui|tuigroup|tuigrouplabel
T001|LIVB|Living Beings
Use this script to load theresulting CSV files into a database:
./umls_create_postgres.py
Schema: umls Permission: granted to semehr_user, and semehr_admin.
CREATE TABLE umls.cui(cui varchar(16), tui varchar(99), tuigroup varchar(99), cuilabel text);
CREATE INDEX icui ON umls.cui (cui);
CREATE TABLE umls.rel(cui1 varchar(16), has varchar(4), cui2 text);
CREATE INDEX icui1 ON umls.rel (cui1);
CREATE TABLE umls.snomed(snomed text, cui text NOT NULL);
CREATE INDEX isnomed ON umls.snomed (snomed);
CREATE TABLE umls.sty(tui varchar(8), tuigroup varchar(8), tuigrouplabel varchar(99));
CREATE INDEX isty ON umls.sty (tui);