Skip to content

fusion-jena/JenTab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JenTab DOI

  • Prvious Releases
    • SemTab 2020 and Knowledge Graph Construction Workshop @ESWC 2021 DOI

Matching Tabular Data to Knowledge Graphs

tasks image!

Architecture

architecture image!

The image above shows the distributed architecture of JenTab. Here you are a brief description of each service:

  • Manager: a central node, is responsible for load balancing and collects results, errors and audit records.
  • Runner: client node which handles the communication among
    • pre-processing services (Clean Cells, Type Prediction )
    • Approach
    • Manager
  • Generic Lookup: pre-computed service, our primary solution handling miss-spellings.
  • Solver: Encapsulates our pipeline in terms of several calls across the dependent services.
  • Wikidata_Proxy encapsulates the lookup up and SPARQL query endpoint for DBpedia
  • DBpedia_Proxy encapsulates the lookup up and SPARQL query endpoint for Wikidata
  • Caching Server Centralized caching server

Quick Setup

The first step of JenTab setup is to structure the assets folder. For demonstration, here, we will setup the first round,

  1. Input configuration (dataset)
    • 2020 Dataset per Round
    • Download tables and targets for Round 1
    • Your downloaded tables should go under
      • /assets/data/input/2020/Round 1/
    • Your downloaded CEA_Round1_Targets.csv, CTA_Round1_Targets.csv and CPA_Round1_Targets.csv should go under
      • /assets/data/input/2020/Round 1/targets/
  2. Pre-computed Generic_Lookup db3 files
    • Generic_Lookup per Round
    • Download the db3 file for R1
    • Your downloaded lookup.db3 should go under
      • /assets/cache/Generic_Lookup/
  3. Baseline_Approach requires the stopwords
    • Download stopwords.txt
    • Rename the downloaded file to stopwords.txt
    • Download lid.176.ftz for fastText model
    • Your files should go under:
      • /assets/Baseline_Approach/
  • assets must have the following directory structure after the previous steps
+--assets
\----data
|   \----cache
|   |   \----Generic_Lookup
|   |           lookup.db3
|   |           
|   \----input
|       \----2020
|           +----Round 1
|           |   +----tables
|           |   \----targets
|                       CEA_Round1_Targets.csv
|                       CTA_Round1_Targets.csv
|                       CPA_Round1_Targets.csv
|
\---Baseline_Approach
|       stopwords.txt
|       lid.176.ftz
|       
\---Wikidata_Endpoint
        excluded_classes.csv
        excluded_colheaders.csv

After the assets are ready, the fastest way to get JenTab up and running is via docker setup, with the following order.

  1. cd /services
  2. Manager
    • Change the default credentials in services/Manager/config.py to yours
      • username: YourManagerUsername
      • password: YourManagerPassword
    • Make sure that the dataset configuration in services/Manager/config.py is set to:
      • ROUND = 1
      • YEAR = 2020
    • Use the following command to lanuch the Manager node
      • docker-compose -f docker-compose.manager.yml up
    • Manager is suppose to run at http://localhost:5100
  3. All other services docker-compose -f docker-compose.yml up
  4. Runner
    • cd /Runner
    • Change manager credentials in services/Runner/config.py to your selected ones
    • Make sure that manager_url = 'http://127.0.0.1:5100' #local in the services/Runner/config.py
    • Build an image for the Runner docker build -t runner .
    • Run docker run --network="host" runner
  • Note1: for basic understanding of docker commands, please visit the official documentation of docker.
  • Note2: We also support native execution, but, in this case, you will setup each service on its own. So, we refer to:
    • each folder of each service under services.
    • services.md summarizes the currently used services and their ports.

Results

Materials

Citation

@inproceedings{abdelmageed_semtab2021, title={{JenTab Meets SemTab 2021's New Challenges}}, author={Abdelmageed, Nora and Schindler, Sirko}, booktitle={The 20th International Semantic Web Conference (ISWC)}, year={2021} }

@article{abdelmageed2021jentab, title={JenTab: A Toolkit for Semantic Table Annotations}, author={Abdelmageed, Nora and Schindler, Sirko}, booktitle={Knowledge Graph Construction (KGC) Workshop ESWC 2021, Accepted} year={2021} }

@inproceedings{abdelmageed2020jentab, title={Jentab: Matching tabular data to knowledge graphs}, author={Abdelmageed, Nora and Schindler, Sirko}, booktitle={The 19th International Semantic Web Conference (ISWC)}, year={2020} }