Skip to content

Commit

Permalink
Merge pull request #69 from usc-isi-i2/dev
Browse files Browse the repository at this point in the history
merging dev to master for a release
  • Loading branch information
saggu authored Jun 12, 2020
2 parents a569138 + 0c4697e commit e694425
Show file tree
Hide file tree
Showing 113 changed files with 13,205 additions and 1,847 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -136,3 +136,6 @@ dmypy.json

# MacOS file system hidden file
.DS_Store

# warning log exception
!corrupted_warning.log
8 changes: 8 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,11 @@ lint:

requirements:
pip install -r requirements.txt
pip install -r requirements-dev.txt

mkdocs:
mkdocs build --clean
mkdocs serve -a localhost:8080

precommit:
tox
17 changes: 15 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# KGTK: Knowledge Graph Toolkit [![doi](https://zenodo.org/badge/DOI/10.5281/zenodo.3828068.svg)](https://doi.org/10.5281/zenodo.3828068)
# KGTK: Knowledge Graph Toolkit

[![doi](https://zenodo.org/badge/DOI/10.5281/zenodo.3828068.svg)](https://doi.org/10.5281/zenodo.3828068) ![travis ci](https://travis-ci.org/usc-isi-i2/kgtk.svg?branch=dev)

KGTK is a Python library for easy manipulation with knowledge graphs. It provides a flexible framework that allows chaining of common graph operations, such as: extraction of subgraphs, filtering, computation of graph metrics, validation, cleaning, generating embeddings, and so on. Its principal format is TSV, though we do support a number of other inputs.

Expand All @@ -23,7 +25,15 @@ https://kgtk.readthedocs.io/en/latest/

* [Source code](https://github.com/usc-isi-i2/kgtk/releases)

## Installation
## Installation through Docker

```
docker pull uscisii2/kgtk:0.2.0
```

More information about versions and tags here: https://hub.docker.com/repository/docker/uscisii2/kgtk

## Local installation

0. Our installations will be in a conda environment. If you don't have a conda installed, follow [link](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) to install it.
1. Set up your own conda environment:
Expand All @@ -48,6 +58,9 @@ You can test if `kgtk` is installed properly now with: `kgtk -h`.

More installation options for `mlr` can be found [here](https://johnkerl.org/miller/doc/build.html).

If you can't install miller with `yum install` on centOS, please follow this [link](https://centos.pkgs.org/7/openfusion-x86_64/miller-5.3.0-1.of.el7.x86_64.rpm.html).


## Running KGTK commands

To list all the available KGTK commands, run:
Expand Down
21 changes: 21 additions & 0 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
FROM continuumio/miniconda3

RUN apt-get update && apt-get install -y \
libxdamage-dev \
libxcomposite-dev \
libxcursor1 \
libxfixes3 \
libgconf-2-4 \
libxi6 \
libxrandr-dev \
libxinerama-dev\
gcc \
miller

RUN pip install thinc==7.4.0

RUN pip install kgtk

RUN conda update -n base -c defaults conda

RUN conda install -c conda-forge graph-tool
29 changes: 29 additions & 0 deletions docker/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
## KGTK as a Docker image

How to build this docker image:

```
docker build -t uscisii2/kgtk:0.2.0 .
```

How to run this docker image (from DockerHub):

```
docker run -it uscisii2/kgtk:0.2.0 /bin/bash
```

This will log you into the image and let you operate with KGTK. Once you executed the step above, just type:

```
kgtk -h
```

to see the KGTK help command.

## Next features:
We will include the following features in the next releases of KGTK:

- Examples on how to load volumes with your data.

- How to launch a Jupyter notebook to operate with KGTK in your browser.

File renamed without changes.
File renamed without changes.
File renamed without changes.
104 changes: 104 additions & 0 deletions docs/analysis/stats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
Given a KGTK edge file, this command can compute centrality metrics and connectivity statistics. The set of metrics to compute are specified by the user.

The statistics for individual nodes are printed as edges to stdout. The summary statistics over the entire graph can be written to a summary file.

## Usage
```
kgtk graph_statistics [-h] [--directed] [--degrees] [--pagerank]
[--hits] [--summary LOG_FILE] [--statistics-only]
[--vertex-in-degree-property VERTEX_IN_DEGREE]
[--vertex-out-degree-property VERTEX_OUT_DEGREE]
[--page-rank-property VERTEX_PAGERANK]
[--vertex-hits-authority-property VERTEX_AUTH]
[--vertex-hits-hubs-property VERTEX_HUBS]
filename
```

positional arguments:
```
filename filename here
```

optional arguments:
```
-h, --help show this help message and exit
--directed Is the graph directed or not?
--degrees Whether or not to compute degree distribution.
--pagerank Whether or not to compute PageRank centraility.
--hits Whether or not to compute HITS centraility.
--summary LOG_FILE Summary file for the global statistics of the graph.
--statistics-only If this flag is set, output only the statistics edges.
Else, append the statistics to the original graph.
--vertex-in-degree-property VERTEX_IN_DEGREE
Label for edge: vertex in degree property
--vertex-out-degree-property VERTEX_OUT_DEGREE
Label for edge: vertex out degree property
--page-rank-property VERTEX_PAGERANK
Label for pank rank property
--vertex-hits-authority-property VERTEX_AUTH
Label for edge: vertext hits authority
--vertex-hits-hubs-property VERTEX_HUBS
Label for edge: vertex hits hubs
```

## Examples

Given this file `input.tsv`:

| node1 | label | node2 |
| -- | -- | -- |
| john | zipcode | 12345 |
| john | zipcode | 12346 |
| peter | zipcode | 12040 |
| peter | zipcode | 12040 |
| steve | zipcode | 45601 |
| steve | zipcode | 45601 |

We can use the following command to compute degree and PageRank statistics over the graph:

```
kgtk graph_statistics --directed --summary summary.txt --pagerank --statistics-only input.tsv
```

The output (printed to stdout) is as follows:

| node1 | label | node2 | id |
| -- | -- | -- | -- |
| john | vertex_in_degree | 0 | john-vertex_in_degree-0 |
| john | vertex_out_degree | 2 | john-vertex_out_degree-1 |
| john | vertex_pagerank | 0.10471144347252878 | john-vertex_pagerank-2 |
| 12345 | vertex_in_degree | 1 | 12345-vertex_in_degree-3 |
| 12345 | vertex_out_degree | 0 | 12345-vertex_out_degree-4 |
| 12345 | vertex_pagerank | 0.14921376206743192 | 12345-vertex_pagerank-5 |
| 12346 | vertex_in_degree | 1 | 12346-vertex_in_degree-6 |
| 12346 | vertex_out_degree | 0 | 12346-vertex_out_degree-7 |
| 12346 | vertex_pagerank | 0.14921376206743192 | 12346-vertex_pagerank-8 |
| peter | vertex_in_degree | 0 | peter-vertex_in_degree-9 |
| peter | vertex_out_degree | 2 | peter-vertex_out_degree-10 |
| peter | vertex_pagerank | 0.10471144347252878 | peter-vertex_pagerank-11 |
| 12040 | vertex_in_degree | 2 | 12040-vertex_in_degree-12 |
| 12040 | vertex_out_degree | 0 | 12040-vertex_out_degree-13 |
| 12040 | vertex_pagerank | 0.1937160806623351 | 12040-vertex_pagerank-14 |
| steve | vertex_in_degree | 0 | steve-vertex_in_degree-15 |
| steve | vertex_out_degree | 2 | steve-vertex_out_degree-16 |
| steve | vertex_pagerank | 0.10471144347252878 | steve-vertex_pagerank-17 |
| 45601 | vertex_in_degree | 2 | 45601-vertex_in_degree-18 |
| 45601 | vertex_out_degree | 0 | 45601-vertex_out_degree-19 |
| 45601 | vertex_pagerank | 0.1937160806623351 | 45601-vertex_pagerank-20 |

Note that the statistics are printed as edges. Also, the original graph is not printed because we set the flag `statistics-only`. We have also stored a summary of our metrics in `summary.txt`, which looks like this:

```
graph loaded! It has 7 nodes and 6 edges
###Top relations:
zipcode 6
###PageRank
Max pageranks
5 steve 0.104711
1 12345 0.149214
4 12040 0.193716
2 12346 0.149214
6 45601 0.193716
```
Original file line number Diff line number Diff line change
@@ -1,16 +1,41 @@
# KGTK Text Embedding Utilities

## Install
The program requires Python vesion >= `3` and `kgtk` package installed.
The corresponding packages requirement are stored at `text_embedding_requirement.txt`

## Assumptions
The input is an edge file sorted by subject.

## Usage
```
kgtk text_embedding OPTIONS
```
Computes embeddings of nodes using properties of nodes. The values are concatenated into sentences defined by a template, and embedded using a pre-trained language model.

The output is an edge file where each node appears once; a user defined property is used to store the embedding, and the value is a string containing the embedding. For example:

To generate the embeddings, the command first generates a sentence for each node using the properties listed in the label-properties, description-properties, isa-properties and has-properties options. Each sentence is generated using the following template:

```
{label-properties}, {description-properties} is a {isa-properties}, and has {has-properties}
```

An example sentence is “Saint David, patron saint of Wales is a human, Catholic priest, Catholic bishop, and has date of death, religion and canonization status”

```
subject predicate object
Q1 text_embedding “0.222, 0.333, ..”
Q2 text_embedding “0.444, 0.555, ..”
```

### Run
You can call the functions directly with given args as
```
kgtk text_embedding \
--input/ -i <string> \ # * required, path to the file
--format/ -f <string> \ # optional, default is `kgtk_format`
--model/ -m <list_of_string> \ # optional, default is `bert-base-wikipedia-sections-mean-tokens`
<string> \ # * required, path to the file
--format / -f <string> \ # optional, default is `kgtk_format`
--model / -m <list_of_string> \ # optional, default is `bert-base-wikipedia-sections-mean-tokens`
--label-properties <list_of_string> \ # optional, default is ["label"]
--description-properties <list_of_string> \ # optional, default is ["description"]
--isa-properties <list_of_string> \ # optional, default is ["P31"]
Expand All @@ -20,21 +45,21 @@ kgtk text_embedding \
--output-property <string> \ # optional, default is "text_embedding"
--embedding-projector-metatada <list_of_string> \ # optional
--embedding-projector-path/ -o <string> # optional, default is the home directory of current user
--black-list/ -b <string> # optional,default is None
--logging-level/ -l <string> \ # optional, default is `info`
--black-list / -b <string> # optional,default is None
--logging-level / -l <string> \ # optional, default is `info`
--dimensional-reduction pca \ # optional, default is none
--dimension 5 \ #optional, default is 2
--parallel 4 # optional, default is 1
--save-embedding-sentence # optional
```
##### Example 1:
For easiest running, just give the input file as
`kgtk text_embedding -i input_file.csv`
For easiest running, just give the input file and let it write output to `output_embeddings.csv` at current folder
`kgtk text_embedding < input_file.csv > output_embeddings.csv`
##### Example 2:
Running with more specific parameters and then run TSNE to reduce output dimension:
```
kgtk text_embedding --debug \
--input test_edges_file.tsv \
test_edges_file.tsv \
--model bert-base-wikipedia-sections-mean-tokens bert-base-nli-cls-token \
--label-properties P1449 P1559 \
--description-properties P94 \
Expand All @@ -44,16 +69,14 @@ kgtk text_embedding --debug \
Running with test format input and tsv output(for visulization at google embedding projector)
```
kgtk text_embedding \
--countries_candidates.csv \
countries_candidates.csv \
--model bert-base-wikipedia-sections-mean-tokens bert-base-nli-cls-token \
--black-list all_instances_of_Q732577.tsv.zip \
--output-format tsv_format
```

#### --input / -i (input files)
The path to the input file(s). If multiple file given, please separate each with a white space ` `.

For example: `input_file1.csv input_file2.csv`
#### (input files)
The path to the input file. For example: `input_file1.csv`, it also support to send like `< input_file1.csv`

#### --format/ -f (input format)
The input file should be a CSV file, it support 2 different type of input for different purposes.
Expand Down Expand Up @@ -159,6 +182,7 @@ User can specify where to store the metadata file for the vectors. If not given,

##### Embedding Vectors
This will have all the embedded vectors values for each Q nodes. This will be print on stddout and can be redirected to a file.
Note: There will only texet embedding related things outputed, please run other commands

If output as `kgtk_format`, the output file will looks like:
```
Expand Down Expand Up @@ -187,7 +211,7 @@ This will have embedded vectors values after running dimensional reduction algor

#### Query / cache related
##### --query-server
You can change the query wikidata server address when the input format is `test_format`. The default is to use wikidata official query server, but it has limit on query time and frequency. Alternatively, you can choose to use dsbox02's one as `https://dsbox02.isi.edu:8888/bigdata/namespace/wdq/sparql` (vpn needed).
You can change the query wikidata server address when the input format is `test_format`. The default is to use wikidata official query server, but it has limit on query time and frequency. Alternatively, you can choose to use dsbox02's one as `https://dsbox02.isi.edu:8888/bigdata/namespace/wdq/sparql` (vpn needed, only for ISI users).

##### --use-cache
If set to be true, the system will try to get the cached results for embedding computations. The default value is False, not to use cache. Basically the cache service is a Redis server.
Expand Down
75 changes: 0 additions & 75 deletions docs/cat.md

This file was deleted.

Loading

0 comments on commit e694425

Please sign in to comment.