Skip to content

Pass Command Line Arguments

Jan Ehmueller edited this page Oct 15, 2017 · 2 revisions

Set options via the command line

The command line arguments are parsed using scallop. To see the available options use --help. They can be set by either a short flag (e.g., -o) or a long flag (e.g., --option).

# example call starting a Deduplication with a specific config file
spark.sh -m yarn -c de.hpi.ingestion.deduplication.Deduplication ingestion_master.jar --config deduplication_wikidata.xml

Access options set via the command line in a job

The SparkJob trait defines the value conf: CommandLineConf. The parsed command line options are written to this value in the method execute(). Values in the config can be accessed in two ways:

// return an Option of the value (is None when the option was not set)
conf.configOpt

// return the value or throw an error if it is not set
conf.config

Overview of the command line options

  • config: sets the config file used by the job
  • importConfig: sets the import config file used by the job (only used by DataLakeImports)
  • commitJson: sets the input for the Commit Job (created by the Curation Interface)
  • comment: sets the comment used by a Blocking Job
  • tokenizer: sets the tokeniser used by the TermFrequencyCounter. Can be up to three options (tokenizer, stop words, stemming). An example call would be: --tokenizer CleanCoreNLPTokenizer true true.
  • toReduced: sets whether or not the LinkAnalysis writes to the reduced columns. This option is used by the ReducedLinkAnalysis
  • restoreVersion: sets the version to which the subject table is restored (used by the the VersionRestore Job)
  • diffVersions: sets the versions to diff in the VersionDiff Job. Must be exactly two versions. An example call would be: --diffVersions 7b410340-243e-11e7-937a-ad9adce5e136 f44df8b0-2425-11e7-aec2-2d07f82c7921