Skip to content
Binh Vu edited this page Nov 18, 2021 · 18 revisions

CLI

We can invoke GRAMS from command line by calling grams or python -m grams.cli. Run grams --help to see the following input options.

Usage: cli.py [OPTIONS]

  Annotate tables using GRAMS

  Args:
      infiles: path to input tables. You can use a combination of wildcard (`*`) or named wildcard wrapped by curly brackets (e.g., `{name}` or `{group1}`), which behaves as wildcard. The named wildcard can be used in `outfiles` to name the output file. The list of input tables is retrieved by glob.glob (unix style pathname pattern expansion). Examples:
          - `/tables/{filename}.json`
          - `/tables/{filename}.csv`
      outfiles: path to the output annotations. You can only use named wildcard defined in `infiles`. Examples:
          - `/outputs/{filename}/version.01.json`
      data_dir: pass through option to GRAMS.data_dir
      proxy: pass through option to GRAMS.proxy
      cfg_file: cfg_file contains configuration of GRAMS

Options:
  -i, --infiles TEXT   path to input tables. You can use a combination of
                       wildcard (`*`) or named wildcard wrapped by curly
                       brackets (e.g., `{name}` or `{group1}`), which behaves
                       as wildcard. The named wildcard can be used in
                       `outfiles` to name the output file. The list of input
                       tables is retrieved by glob.glob (unix style pathname
                       pattern expansion). Examples: -
                       `/tables/{filename}.json` - `/tables/{filename}.csv`
                       [required]

  -o, --outfiles TEXT  path to the output annotations. You can only use named
                       wildcard defined in `infiles`. Examples: -
                       `/outputs/{filename}/version.01.json`

  -d, --data_dir TEXT  pass through option to GRAMS.data_dir
  -p, --proxy          pass through option to GRAMS.proxy
  --cfg_file TEXT      cfg_file contains configuration of GRAMS
  -r, --viz            visualize the annotated models
  --help               Show this message and exit.  

Besides the input tables and output directory, you can specify:

  • -d or --data_dir: the directory to store Wikidata's data. You can download pre-built databases as mentioned in the installation guide to avoid sending queries to a remote Wikidata service.
  • -p or --proxy: if enabled, we will send queries to the Wikidata service to retrieve entities that we do not found in the local databases.
  • --cfg_file: location of GRAMS's configuration (you can find an example here)
  • -r or --viz: if enabled, we will visualize the output semantic model using graphviz. You must have dot available in your PATH to use this function.

Here is an example:

cd <gram_repo>
python -m grams.cli -pv \
    -i ./examples/semtab2020_novartis/tables/{filename}.csv \
    -o /tmp/test/{filename}/version.01.json

Output of a table from the command:

predicted model

The correct semantic description:

gold model

Input format

The data structure of input table is specified in the grams.inputs.linked_table.LinkedTable class. An input can be loaded from JSON file, which properties are aligned with properties of the class (see the class definitions for the schema of JSON object), or can be loaded from a csv file containing table data and a tsv file containing links/entities associated with each cell in the table using the function grams.inputs.linked_table.LinkedTable.from_csv_file. By default, if the tsv file (link file) is not provided, grams.inputs.linked_table.LinkedTable.from_csv_file will look for a file of same name in the same folder but ends with .links.tsv.

The csv file containing table data follows the common format: the first row is header and next rows are rows in the table. Below is an example:

col1, col2, col3
row1-cell1,row1-cell2,row1-cell3
row2-cell1,row2-cell2,row2-cell3
...

The tsv file containing links/entities in the table has a simple format. Each row is <row_index>\t<col_index>\t(<link>|(<entity_id>(\t<entity_id>)*)), where:

  • row_index and col_index start from 0
  • row_index does not count the header of the table (i.e., skip the first row of infile if it's the header)
  • (<link>|(<entity_id>(\t<entity_id>)*)) is either:
    • <link> a json string encoding Link object and can be deserialized using Link.from_dict function
    • or (<entity_id>(\t<entity_id>)*) a list of entity ids joined by \t tab character, each entity id can be a wikidata qnode id (e.g., Q414) or a full qnode uri (e.g., "http://www.wikidata.org/entity/Q414"). The first entity is considered as the correct entity of the cell, and the rest are considered as the candidate entities of the cell

Note that a pair (<row_index>, <col_index>) don't have to be unique as each cell in a table can have more than one link.

Output format

The JSON output file contains a list of semantic_models. Currently, the list only has one item. Each item is a serialized graph of a semantic model following the Python class definition (see here). In particular, it is: {"nodes": List<Node>, "edges": List<Edge>}, where basic properties of an edge are source, target (ids of source & target nodes), abs_uri (uri of ontology predicate), and basic properties of a node is id and label. A column node also has col_index and a class node also has abs_uri denoting uri of the ontology class.

API

GRAMS can also be used as a library. The starting point is the class GRAMS in the main.py. The class takes three parameters: data_dir, cfg, and proxy that we have explained above in the CLI section. To annotate a table, we only need to pass a table to the function annotate. The function performs two steps as illustrated in the below image. You can found the code that construct the candidate graph (also sometimes called semantic graph) in two files: grams.algorithm.data_graph and grams.algorithm.semantic_graph. The code to implement the last step can be found in grams.algorithm.psl_solver.

approach.

Demo

We created two Jupyter notebooks demo and semtab-medicine demonstrating how to use GRAMS's API. The former notebook shows how to fetch table from a Wikipedia page and run GRAMS to annotate the table. The later notebook shows how to load tables from disk and annotate them with the help of GRAMS.

Below is the screenshot of the second half of the demo notebook. As the notebook uses Javascript to support data curation, if we open the notebook in Github, we can't see rendered results as it is rendered as just <IPython.core.display.Javascript object>.

screenshot of the demo notebook

Clone this wiki locally