Skip to content

Batch Mode for JSON LD Generation

dkapoor edited this page Dec 8, 2014 · 23 revisions

Karma can be used in a batch mode to generate JSON-LD for large datasets. This can be done using a command line Utility OfflineRDFGenerator or using the Karma JSON-LD Generation API

OfflineRDFGenerator

This is a command line utility to load a model and a source, and then generate RDF and JSON-LD. The source can be JSON, XML, CSV or database. With database, the API loads 10,000 rows at a time.

Using maven to run batch mode

To generate RDF when the source is a file, go the the karma-offline sub-directory of Karma and execute the following command:

mvn exec:java -Dexec.mainClass="edu.isi.karma.rdf.OfflineRdfGenerator" -Dexec.args="
--sourcetype <sourcetype> \
--filepath <filepath> \
--modelfilepath <modelfilepath> \
--sourcename <sourcename> \
--outputfile <rdf-outputfile> \
--jsonoutputfile <json-outputfile> \
[--contextfile <contextfile> | --contexturl <contextUrl>] \
[--selection <selectionName] \
[--root <rootClassForJsonLD>] \
[--killtriplemap <triplemapid to stop from expansion> ] \
[--stoptriplemap <stop the rdf generation from this triplemapid onwards> ] \
" -Dexec.classpathScope=compile

Example invocation for a JSON file:

mvn exec:java -Dexec.mainClass="edu.isi.karma.rdf.OfflineRdfGenerator" -Dexec.args="
--sourcetype JSON \
--filepath \"/files/data/wikipedia.json\" \
--modelfilepath \"/files/models/model-wikipedia.ttl\" \
--sourcename wikipedia \
--outputfile wikipedia-rdf.n3 \
--contextfile wiki-context.json \
--root \"http://schema.org/Document\"
--jsonoutputfile wikipedia.json" -Dexec.classpathScope=compile

To generate JSON-LD of a database table, go to the karma-offline subdirectory of Karma and run the following command from terminal:

mvn exec:java -Dexec.mainClass="edu.isi.karma.rdf.OfflineRdfGenerator" -Dexec.args="
--sourcetype DB \
--modelfilepath <modelfilepath> \
--outputfile <outputfile> \
--jsonoutputfile <json-outputfile> \
[--contextfile <contextfile> | --contexturl <contextUrl>] \
[--selection <selectionName] \
[--root <rootClassForJsonLD>] \
[--killtriplemap <triplemapid to stop from expansion> ] \
[--stoptriplemap <stop the rdf generation from this triplemapid onwards> ] \
--dbtype <dbtype> \
--hostname <hostname> \
--username <username> \
--password <password> \
--portnumber <portnumber> \
--dbname <dbname> \
--tablename <tablename>" -Dexec.classpathScope=compile

Valid argument values for dbtype are Oracle, MySQL, SQLServer, PostGIS, Sybase

Example invocation:

mvn exec:java -Dexec.mainClass="edu.isi.karma.rdf.OfflineRdfGenerator" -Dexec.args="
--sourcetype DB --dbtype SQLServer \
--hostname example.com --username root --password secret \
--portnumber 1433 --dbname Employees --tablename Person \
--modelfilepath \"/files/models/db-r2rml-model.ttl\" \
--outputfile db-rdf.n3 \
--contextfile db-context.json \
--root \"http://schema.org/Person\"
--jsonoutputfile db.json" -Dexec.classpathScope=compile

Using self-container JAR to run Batch Mode

If instead of maven, you wish to use a JAR file, generate the jar file using the following commands:

cd karma-offline
mvn assembly:assembly -DdescriptorId=jar-with-dependencies
cp target/karma-offline-0.0.1-SNAPSHOT-jar-with-dependencies.jar ./karma-offline.jar

Now, To generate RDF when the source is a file, go the the karma-offline sub-directory of Karma and execute the following command:

java -jar karma-offline.jar --sourcetype 
<sourcetype> --filepath <filepath> --modelfilepath <modelfilepath> --sourcename <sourcename> --outputfile <rdf-outputfile> --outputfile <rdf-outputfile> --jsonoutputfile <json-outputfile> [--contextfile <contextfile> | --contexturl <contextUrl>] [--selection <selectionName] [--root <rootClassForJsonLD>] [--killtriplemap <triplemapid to stop from expansion> ] [--stoptriplemap <stop the rdf generation from this triplemapid onwards> ]

Example:

java -jar karma-offline.jar --sourcetype CSV --filepath "/files/datasets/person.csv" --modelfilepath "/files/models/person-model.ttl" --sourcename person --outputfile person-rdf.n3 --contentfile person-context.json --jsonoutputfile person-jdonld.json --root "http://schema.org/Person"

To generate RDF of a database table, go to the karma-offline subdirectory of Karma. Copy the database connector to this directory and run the following command from terminal:

java -cp <db-connector>.jar:karma-offline.jar edu.isi.karma.rdf.OfflineRdfGenerator --sourcetype DB
--modelfilepath <modelfilepath> --outputfile <rdf-outputfile> --dbtype <dbtype> --hostname <hostname> 
--username <username> --password <password> --portnumber <portnumber> --dbname <dbname> --tablename <tablename> --jsonoutputfile <json-outputfile> [--contextfile <contextfile> | --contexturl <contextUrl>] [--selection <selectionName] [--root <rootClassForJsonLD>] [--killtriplemap <triplemapid to stop from expansion> ] [--stoptriplemap <stop the rdf generation from this triplemapid onwards> ]

Valid argument values for dbtype are Oracle, MySQL, SQLServer, PostGIS, Sybase

Example invocation:

java -cp mysql-connector-java-5.0.8-bin.jar:karma-offline.jar edu.isi.karma.rdf.OfflineRdfGenerator --sourcetype DB --dbtype MySQL --hostname localhost --username root --password mypassword --portnumber 3306 --dbname karma --tablename offlineUsers --modelfilepath "/Users/dipsy/karma-projects/offlineUsers-model.ttl" --outputfile offlineUsers-rdf.n3 --contentfile person-context.json --jsonoutputfile offlineUsers-jdonld.json --root "http://schema.org/Person"

Using Selection Feature in Offline Mode

If the model requires a selection, the selection name 'DEFAULT_TEST 'needs to be passed as a command line argument --selectionName to the OfflineRDFGenerator. This makes it possible to execute the same model with or without selection in offline mode. Example invocation:

mvn exec:java -Dexec.mainClass="edu.isi.karma.rdf.OfflineRdfGenerator" -Dexec.args="
--sourcetype DB --dbtype SQLServer \
--hostname example.com --username root --password secret \
--portnumber 1433 --dbname Employees --tablename Person \
--modelfilepath \"/files/models/db-r2rml-model.ttl\" \
--outputfile db-rdf.n3 \
--contextfile db-context.json \
--root \"http://schema.org/Person\"
--sourcename wikipedia \
--jsonoutputfile db.json" -Dexec.classpathScope=compile

GenericRDFGenerator

This API is meant for repeated RDF/JSON-LD generation from the same model. In this setting we load the models at the beginning and then every time the user does a query we use the model to generate RDF. The input can be JSON, CSV or an XML File / String / InputStream.

edu.isi.karma.rdf.GenericRDFGenerator

API to add a model to the RDF Generator

// modelIdentifier : Provides a name and location of the model file
void addModel(R2RMLMappingIdentifier modelIdentifier); 

API to generate the JSON-LD For a Request

//request : Provides all details for the Inputs to the RDF Generator like the input data, setting for provenance etc
void generateRDF(RDFGeneratorRequest request)

edu.isi.karma.rdf.RDFGeneratorRequest

API to set the input data

//inputData : Input Data as String
public void setInputData(String inputData)

//inputStream: Input data as a Stream
public void setInputStream(InputStream inputStream)

//inputFile: Input data file
public void setInputFile(File inputFile)

API to set the input data type

//dataType: Valid values: CSV,JSON,XML,AVRO
public void setDataType(InputType dataType)

Setting to generate provenance information

//addProvenance -> flag to indicate if provenance information should be added to the RDF
public void setAddProvenance(boolean addProvenance) 

The writer for RDF

//writer -> Writer for the output. For JSON-LD generation, this should be JSONKR2RMLRDFWriter
public void addWriter(KR2RMLRDFWriter writer)

Example use:

GenericRDFGenerator rdfGenerator = new GenericRDFGenerator();

//Construct a R2RMLMappingIdentifier that provides the location of the model and a name for the model and add the model to the JSONRDFGenerator. You can add multiple models using this API.
R2RMLMappingIdentifier modelIdentifier = new R2RMLMappingIdentifier(
				"people-model", new File("/files/models/people-model.ttl").toURI().toURL());
rdfGenerator.addModel(modelIdentifier);

String filename = "files/data/people.json";
StringWriter sw = new StringWriter();
PrintWriter pw = new PrintWriter(sw);
JSONKR2RMLRDFWriter writer = new JSONKR2RMLRDFWriter(pw);
RDFGeneratorRequest request = new RDFGeneratorRequest("people-model", filename);
request.setInputFile(new File(getTestResource(filename).toURI()));
request.setAddProvenance(true);
request.setDataType(InputType.JSON);
request.addWriter(writer);
rdfGenerator.generateRDF(request);
String jsonld = sw.toString();
System.out.println("Generated JSON-LD: " + jsonld);

Using Selection Feature in the API

If the model requires a selection, GenericRDFGenerator provides a contructor that takes in the selection name 'DEFAULT_TEST 'as the argument.

Example use:

GenericRDFGenerator rdfGenerator = new GenericRDFGenerator('DEFAULT_TEST');

//Construct a R2RMLMappingIdentifier that provides the location of the model and a name for the model and add the model to the JSONRDFGenerator. You can add multiple models using this API.
R2RMLMappingIdentifier modelIdentifier = new R2RMLMappingIdentifier(
				"people-model", new File("/files/models/people-model.ttl").toURI().toURL());
rdfGenerator.addModel(modelIdentifier);

String filename = "files/data/people.json";
StringWriter sw = new StringWriter();
PrintWriter pw = new PrintWriter(sw);
JSONKR2RMLRDFWriter writer = new JSONKR2RMLRDFWriter(pw);
RDFGeneratorRequest request = new RDFGeneratorRequest("people-model", filename);
request.setInputFile(new File(getTestResource(filename).toURI()));
request.setAddProvenance(true);
request.setDataType(InputType.JSON);
request.addWriter(writer);
rdfGenerator.generateRDF(request);
String jsonld = sw.toString();
System.out.println("Generated JSON-LD: " + jsonld);