Merge pull request #69 from usc-isi-i2/dev

merging dev to master for a release
usc-isi-i2 · Jun 12, 2020 · e694425 · e694425
2 parents a569138 + 0c4697e
commit e694425
Show file tree

Hide file tree

Showing 113 changed files with 13,205 additions and 1,847 deletions.
diff --git a/.gitignore b/.gitignore
@@ -136,3 +136,6 @@ dmypy.json
 
 # MacOS file system hidden file
 .DS_Store
+
+# warning log exception
+!corrupted_warning.log
diff --git a/Makefile b/Makefile
@@ -7,3 +7,11 @@ lint:
 
 requirements:
 	pip install -r requirements.txt
+	pip install -r requirements-dev.txt
+
+mkdocs:
+	mkdocs build --clean
+	mkdocs serve -a localhost:8080
+
+precommit:
+	tox
diff --git a/README.md b/README.md
@@ -1,4 +1,6 @@
-# KGTK: Knowledge Graph Toolkit [![doi](https://zenodo.org/badge/DOI/10.5281/zenodo.3828068.svg)](https://doi.org/10.5281/zenodo.3828068)
+# KGTK: Knowledge Graph Toolkit 
+
+[![doi](https://zenodo.org/badge/DOI/10.5281/zenodo.3828068.svg)](https://doi.org/10.5281/zenodo.3828068)  ![travis ci](https://travis-ci.org/usc-isi-i2/kgtk.svg?branch=dev)
 
 KGTK is a Python library for easy manipulation with knowledge graphs. It provides a flexible framework that allows chaining of common graph operations, such as: extraction of subgraphs, filtering, computation of graph metrics, validation, cleaning, generating embeddings, and so on. Its principal format is TSV, though we do support a number of other inputs. 
 
@@ -23,7 +25,15 @@ https://kgtk.readthedocs.io/en/latest/
 
 * [Source code](https://github.com/usc-isi-i2/kgtk/releases)
 
-## Installation
+## Installation through Docker
+
+```
+docker pull uscisii2/kgtk:0.2.0
+```
+
+More information about versions and tags here: https://hub.docker.com/repository/docker/uscisii2/kgtk
+
+## Local installation
 
 0. Our installations will be in a conda environment. If you don't have a conda installed, follow [link](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to install it.
 1. Set up your own conda environment:
@@ -48,6 +58,9 @@ You can test if `kgtk` is installed properly now with: `kgtk -h`.
 
 More installation options for `mlr` can be found [here](https://johnkerl.org/miller/doc/build.html).
 
+If you can't install miller with `yum install` on centOS, please follow this [link](https://centos.pkgs.org/7/openfusion-x86_64/miller-5.3.0-1.of.el7.x86_64.rpm.html).
+
+
 ## Running KGTK commands
 
 To list all the available KGTK commands, run:

diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -0,0 +1,21 @@
+FROM continuumio/miniconda3
+
+RUN apt-get update && apt-get install -y \
+  libxdamage-dev \
+  libxcomposite-dev \
+  libxcursor1 \
+  libxfixes3 \
+  libgconf-2-4 \
+  libxi6 \
+  libxrandr-dev \
+  libxinerama-dev\
+  gcc \
+  miller
+
+RUN pip install thinc==7.4.0
+
+RUN pip install kgtk
+
+RUN conda update -n base -c defaults conda
+
+RUN conda install -c conda-forge graph-tool
diff --git a/docker/readme.md b/docker/readme.md
@@ -0,0 +1,29 @@
+## KGTK as a Docker image
+
+How to build this docker image:
+
+```
+docker build -t uscisii2/kgtk:0.2.0 .
+```
+
+How to run this docker image (from DockerHub):
+
+```
+docker run -it uscisii2/kgtk:0.2.0 /bin/bash
+```
+
+This will log you into the image and let you operate with KGTK. Once you executed the step above, just type:
+
+```
+kgtk -h
+```
+
+to see the KGTK help command.
+
+## Next features:
+We will include the following features in the next releases of KGTK:
+
+- Examples on how to load volumes with your data.
+
+- How to launch a Jupyter notebook to operate with KGTK in your browser.
+
diff --git a/docs/connected_components.md → docs/analysis/connected_components.md b/docs/connected_components.md → docs/analysis/connected_components.md
diff --git a/docs/instances.md → docs/analysis/instances.md b/docs/instances.md → docs/analysis/instances.md
diff --git a/docs/reachable_nodes.md → docs/analysis/reachable_nodes.md b/docs/reachable_nodes.md → docs/analysis/reachable_nodes.md
diff --git a/docs/analysis/stats.md b/docs/analysis/stats.md
@@ -0,0 +1,104 @@
+Given a KGTK edge file, this command can compute centrality metrics and connectivity statistics. The set of metrics to compute are specified by the user. 
+
+The statistics for individual nodes are printed as edges to stdout. The summary statistics over the entire graph can be written to a summary file.
+
+## Usage
+```
+kgtk graph_statistics [-h] [--directed] [--degrees] [--pagerank]
+                             [--hits] [--summary LOG_FILE] [--statistics-only]
+                             [--vertex-in-degree-property VERTEX_IN_DEGREE]
+                             [--vertex-out-degree-property VERTEX_OUT_DEGREE]
+                             [--page-rank-property VERTEX_PAGERANK]
+                             [--vertex-hits-authority-property VERTEX_AUTH]
+                             [--vertex-hits-hubs-property VERTEX_HUBS]
+                             filename
+```
+
+positional arguments:
+```
+  filename              filename here
+```
+
+optional arguments:
+```
+  -h, --help            show this help message and exit
+  --directed            Is the graph directed or not?
+  --degrees             Whether or not to compute degree distribution.
+  --pagerank            Whether or not to compute PageRank centraility.
+  --hits                Whether or not to compute HITS centraility.
+  --summary LOG_FILE    Summary file for the global statistics of the graph.
+  --statistics-only     If this flag is set, output only the statistics edges.
+                        Else, append the statistics to the original graph.
+  --vertex-in-degree-property VERTEX_IN_DEGREE
+                        Label for edge: vertex in degree property
+  --vertex-out-degree-property VERTEX_OUT_DEGREE
+                        Label for edge: vertex out degree property
+  --page-rank-property VERTEX_PAGERANK
+                        Label for pank rank property
+  --vertex-hits-authority-property VERTEX_AUTH
+                        Label for edge: vertext hits authority
+  --vertex-hits-hubs-property VERTEX_HUBS
+                        Label for edge: vertex hits hubs
+```
+
+## Examples
+
+Given this file `input.tsv`:
+
+| node1 | label | node2 |
+| -- | -- | -- |
+| john | zipcode | 12345 |
+| john | zipcode | 12346 |
+| peter | zipcode | 12040 |
+| peter | zipcode | 12040 |
+| steve | zipcode | 45601 |
+| steve | zipcode | 45601 |
+
+We can use the following command to compute degree and PageRank statistics over the graph:
+
+```
+kgtk graph_statistics --directed --summary summary.txt --pagerank --statistics-only input.tsv
+```
+
+The output (printed to stdout) is as follows:
+
+| node1 | label | node2 | id |
+| -- | -- | -- | -- |
+| john | vertex_in_degree | 0 | john-vertex_in_degree-0 |
+| john | vertex_out_degree | 2 | john-vertex_out_degree-1 |
+| john | vertex_pagerank | 0.10471144347252878 | john-vertex_pagerank-2 |
+| 12345 | vertex_in_degree | 1 | 12345-vertex_in_degree-3 |
+| 12345 | vertex_out_degree | 0 | 12345-vertex_out_degree-4 |
+| 12345 | vertex_pagerank | 0.14921376206743192 | 12345-vertex_pagerank-5 |
+| 12346 | vertex_in_degree | 1 | 12346-vertex_in_degree-6 |
+| 12346 | vertex_out_degree | 0 | 12346-vertex_out_degree-7 |
+| 12346 | vertex_pagerank | 0.14921376206743192 | 12346-vertex_pagerank-8 |
+| peter | vertex_in_degree | 0 | peter-vertex_in_degree-9 |
+| peter | vertex_out_degree | 2 | peter-vertex_out_degree-10 |
+| peter | vertex_pagerank | 0.10471144347252878 | peter-vertex_pagerank-11 |
+| 12040 | vertex_in_degree | 2 | 12040-vertex_in_degree-12 |
+| 12040 | vertex_out_degree | 0 | 12040-vertex_out_degree-13 |
+| 12040 | vertex_pagerank | 0.1937160806623351 | 12040-vertex_pagerank-14 |
+| steve | vertex_in_degree | 0 | steve-vertex_in_degree-15 |
+| steve | vertex_out_degree | 2 | steve-vertex_out_degree-16 |
+| steve | vertex_pagerank | 0.10471144347252878 | steve-vertex_pagerank-17 |
+| 45601 | vertex_in_degree | 2 | 45601-vertex_in_degree-18 |
+| 45601 | vertex_out_degree | 0 | 45601-vertex_out_degree-19 |
+| 45601 | vertex_pagerank | 0.1937160806623351 | 45601-vertex_pagerank-20 |
+
+Note that the statistics are printed as edges. Also, the original graph is not printed because we set the flag `statistics-only`. We have also stored a summary of our metrics in `summary.txt`, which looks like this:
+
+```
+graph loaded! It has 7 nodes and 6 edges
+
+###Top relations:
+zipcode	6
+
+###PageRank
+Max pageranks
+5	steve	0.104711
+1	12345	0.149214
+4	12040	0.193716
+2	12346	0.149214
+6	45601	0.193716
+```
diff --git a/kgtk/cli/text_embedding_README.md → docs/analysis/text_embedding.md b/kgtk/cli/text_embedding_README.md → docs/analysis/text_embedding.md
@@ -1,16 +1,41 @@
 # KGTK Text Embedding Utilities
+
 ## Install
 The program requires Python vesion >= `3` and `kgtk` package installed.
 The corresponding packages requirement are stored at `text_embedding_requirement.txt`
 
+## Assumptions
+The input is an edge file sorted by subject.
+
 ## Usage
+```
+kgtk text_embedding OPTIONS
+```
+Computes embeddings of nodes using properties of nodes. The values are concatenated into sentences defined by a template, and embedded using a pre-trained language model.
+
+The output is an edge file where each node appears once; a user defined property is used to store the embedding, and the value is a string containing the embedding. For example:
+
+To generate the embeddings, the command first generates a sentence for each node using the properties listed in the label-properties, description-properties, isa-properties and has-properties options. Each sentence is generated using the following template:
+
+```
+{label-properties}, {description-properties} is a {isa-properties}, and has {has-properties}
+```
+
+An example sentence is “Saint David, patron saint of Wales is a human, Catholic priest, Catholic bishop, and has date of death, religion and canonization status”
+
+```
+subject        predicate        object
+Q1        text_embedding    “0.222, 0.333, ..”
+Q2        text_embedding    “0.444, 0.555, ..”
+```
+
 ### Run
 You can call the functions directly with given args as 
 ```
 kgtk text_embedding \ 
-    --input/ -i <string> \ # * required, path to the file
-    --format/ -f <string> \ # optional, default is `kgtk_format`
-    --model/ -m <list_of_string> \  # optional, default is `bert-base-wikipedia-sections-mean-tokens`
+    <string> \ # * required, path to the file
+    --format / -f <string> \ # optional, default is `kgtk_format`
+    --model / -m <list_of_string> \  # optional, default is `bert-base-wikipedia-sections-mean-tokens`
     --label-properties <list_of_string> \ # optional, default is ["label"]
     --description-properties <list_of_string> \ # optional, default is ["description"]
     --isa-properties <list_of_string> \ # optional, default is ["P31"]
@@ -20,21 +45,21 @@ kgtk text_embedding \
     --output-property <string> \ # optional, default is "text_embedding"
     --embedding-projector-metatada <list_of_string> \ # optional
     --embedding-projector-path/ -o <string> # optional, default is the home directory of current user
-    --black-list/ -b <string> # optional,default is None
-    --logging-level/ -l <string> \ # optional, default is `info`
+    --black-list / -b <string> # optional,default is None
+    --logging-level / -l <string> \ # optional, default is `info`
     --dimensional-reduction pca \ # optional, default is none
     --dimension 5 \ #optional, default is 2
     --parallel 4 # optional, default is 1
     --save-embedding-sentence # optional
 ```
 ##### Example 1:
-For easiest running, just give the input file as 
-`kgtk text_embedding -i input_file.csv`
+For easiest running, just give the input file and let it write output to `output_embeddings.csv` at current folder
+`kgtk text_embedding < input_file.csv > output_embeddings.csv`
 ##### Example 2:
 Running with more specific parameters and then run TSNE to reduce output dimension:
 ```
 kgtk text_embedding --debug \ 
-    --input test_edges_file.tsv \
+    test_edges_file.tsv \
     --model bert-base-wikipedia-sections-mean-tokens bert-base-nli-cls-token \
     --label-properties P1449 P1559 \
     --description-properties P94 \
@@ -44,16 +69,14 @@ kgtk text_embedding --debug \
 Running with test format input and tsv output(for visulization at google embedding projector)
 ```
 kgtk text_embedding \ 
-    --countries_candidates.csv \
+    countries_candidates.csv \
     --model bert-base-wikipedia-sections-mean-tokens bert-base-nli-cls-token \
     --black-list all_instances_of_Q732577.tsv.zip \
     --output-format tsv_format
 ```
 
-#### --input / -i (input files)
-The path to the input file(s). If multiple file given, please separate each with a white space ` `.
-
-For example: `input_file1.csv input_file2.csv`
+#### (input files)
+The path to the input file. For example: `input_file1.csv`, it also support to send like `< input_file1.csv`
 
 #### --format/ -f (input format)
 The input file should be a CSV file, it support 2 different type of input for different purposes.
@@ -159,6 +182,7 @@ User can specify where to store the metadata file for the vectors. If not given,
 
 ##### Embedding Vectors
 This will have all the embedded vectors values for each Q nodes. This will be print on stddout and can be redirected to a file.
+Note: There will only texet embedding related things outputed, please run other commands 
 
 If output as `kgtk_format`, the output file will looks like:
 ```
@@ -187,7 +211,7 @@ This will have embedded vectors values after running dimensional reduction algor
 
 #### Query / cache related
 ##### --query-server
-You can change the query wikidata server address when the input format is `test_format`. The default is to use wikidata official query server, but it has limit on query time and frequency. Alternatively, you can choose to use dsbox02's one as `https://dsbox02.isi.edu:8888/bigdata/namespace/wdq/sparql` (vpn needed).
+You can change the query wikidata server address when the input format is `test_format`. The default is to use wikidata official query server, but it has limit on query time and frequency. Alternatively, you can choose to use dsbox02's one as `https://dsbox02.isi.edu:8888/bigdata/namespace/wdq/sparql` (vpn needed, only for ISI users).
 
 ##### --use-cache
 If set to be true, the system will try to get the cached results for embedding computations. The default value is False, not to use cache. Basically the cache service is a Redis server.

diff --git a/docs/cat.md b/docs/cat.md