Skip to content

config.yml

SharonGoliath edited this page Dec 8, 2022 · 25 revisions

config.yml is configuration information for the 'collection'2caom2 repositories in this organization. An 'entry' may be a CAOM2 observation ID value, or a file name, as found in CADC storage.

It contains:

  • working directory - this is the WORKDIR value on the container. It can change based on a Dockerfile directive, or the docker run command.
  • netrc_filename - this is the name of the netrc file supplied to the container. It must be a fully-qualified name. One of netrc_filename or proxy_filename must have a value.
  • proxy_file_name - this is the name of the proxy certificate file supplied to the container. It must be a fully-qualified name. One of netrc_filename or proxy_filename must have a value.
  • resource_id - this identifies which service to use for metadata storage. ivo://cadc.nrc.ca/sc2repo is the default, and will result in entries written to sc2.canfar.net. ivo://cadc.nrc.ca/ams is for modifying the content of production collections.
  • tap_id - this identifies which service to use for metadata queries. ivo://cadc.nrc.ca/sc2tap is the default, and will query entries visible on sc2.canfar.net. ivo://cadc.nrc.ca/ams/<collection> is for querying production collections.
  • todo_file_name - this is the name of the file containing the list of file ids to process
  • use_local_files
    • When False, this will retrieve metadata and data to a temporary local location.
    • When True:
      • the application will look for files ending in data_source_extensions in the directories listed in data_sources.
      • the md5 checksum for the local file will have to match the md5 checksum for the file stored at CADC. This check is done for the store task type.
  • store_modified_files_only
    • When False, has no effect
    • When True:
      • if use_local_files is also True, checks that the local version of the file has a different md5 checksum than the file at CADC before transferring the file for storage at CADC. This affects only the store task type.
  • data_sources - If use_local_files is True, specify directories in which to search for files. This is a YAML list. It may be a list of length 1, set to the same value as the working_directory.
  • data_source_extensions - the file extensions to be recognized for processing by the pipeline. e.g. '.fits', '.fits.fz', '.hdf5'. Specify as a YAML list.
  • recurse_data_sources - set it to True if the items in data_sources have a hierarchy.
  • source_host - if different from where the pipeline runs
  • logging_level - set it to one of DEBUG, INFO, WARNING, ERROR, depending on how much output you'd like
  • log_to_file - set it to True if you want an entry.log file for each work item
  • log_file_directory - set a fully qualified value - log and footprint files will be written here.
  • success_log_file_name - the filename where successes are written, default is success_log.txt. This file is written in the log_file_directory.
  • failure_log_file_name - the filename where failures are written, default is failure_log.txt. This file is written in the log_file_directory.
  • retry_file_name - the filename where entries are written if there was a failure in 'collection'_run for the entry. This file is written to the log_file_directory.
  • retry_failures - if True, the pipeline will retry execution for any entries in the retry_file_name.
  • retry_count - the number of times that the pipeline will retry execution for any entries in the retry_file_name. Defaults to 1.
  • retry_decay - factor applied to how long the application will wait before retrying the entries in the retries.txt file. The default delay is 1 minute, so a value of 0.25 for retry_factor will result in a 15 second delay. A value of 10 will result in a 10 minute delay.
  • rejected_directory - if the pipeline for the collection tracks known failures, this is the location where the information is persisted. Defaults to <working_directory>/rejected
  • rejected_file_name - if the pipeline for the collection tracks known failures, this is the file where the information is persisted. Defaults to rejected.yml.
  • progress_file_name - an on-going log of numbers of entries processed by a pipeline. Defaults to progress.txt, and is found in log_file_directory.
  • state_file_name - for information that needs to be persisted between pipeline executions. Defaults to state.yml, and is found in working_directory.
  • interval - if using a state file to time-box execution chunks, this is the interval, in minutes, that define the start and end of the time-box.
  • observe_execution - set to True if you want metrics on CADC service execution time.
  • observable_directory - if observe_execution is True, the location where files are written that accumulate CADC service execution times, for later evaluation.
  • stream - set it to default, if using the Task Type store. CADC will provide other values for this entry.
  • collection - the collection string that shows up in the UI. Will default to TEST.
  • archive - the name of the CADC storage namespace.
  • task_types - this controls the work that gets done by the application. The possible options are: scrape, store, ingest, modify, visit.
    • use scrape by itself when you want to test CAOM model observation creation - the output will be written to the working directory
    • use scrape, modify when you want to test any model observation augmentation that requires access to the file on disk (e.g. preview generation, footprint generation, time bounds, depending on collection) locally as well
    • use store, ingest, modify with use_local_files set to True, when you want to store data to CADC from the working directory, as well as create the CAOM model observations, and augment them
    • use ingest when the data is already at CADC, but you want to update something in the metadata for the records
    • use ingest, modify when you need to update existing records at CADC that rely on the metadata, and the data
    • use store to update files at CADC, without updating any of the associated metadata. This may be via http, ftp, or if use_local_storage is set to True, it will copy the file from local disk.
    • use visit to retrieve existing CAOM observation records, update their content without access to data or metadata, and store the result back
  • state_file_name - if the pipeline is run in increments, this is the name of the file that keeps the latest bookmark for the last successful increment. An example state file can be seen below.
  • cache_file_name - metadata that is looked up once can be retrieved from this file. Not all pipelines have a cache file.
  • storage_inventory_resource_id - only required if features.supports_latest_client is True. Possible values are under the heading "storage inventory services" from here.
  • cleanup_files_when_storing- when False has no effect. When True will move files that transferred successfully to CADC to the directory in cleanup_success_destination. Files that failed to transfer are moved to cleanup_failure_destination. FITS files that are transferred are run through astropy.io.fits.open().verify('warn') before transfer.
  • cleanup_success_destination - if cleanup_files_when_storing is True, files that end in CADC storage with the same md5sum as locally will be moved to this location. If a file already has the same md5sum at CADC, and store_modified_files_only is set to True, files will also be moved to this location.
  • cleanup_failure_destination - if cleanup_files_when_storing is True, files that fail to be sent to CADC storage will be moved to this location. FITS files that fail astropy.io.fits.open().verify('warn') will end up in this location.
  • features - this defines which features will be supported by the application. By default, all features are turned on (set to True). There are currently no feature flags.

Example config.yml file content:

working_directory: /usr/src/app
# The proxy_filename must be a fully-qualified name
proxy_file_name: /usr/src/app/cadcproxy.pem
# operational value is ivo://cadc.nrc.ca/ams
resource_id: ivo://cadc.nrc.ca/sc2repo
todo_file_name: todo.txt
# values True False
use_local_files: False
# values DEBUG INFO WARNING ERROR
logging_level: DEBUG
# values True False
log_to_file: False
# fully qualified name for a directory to write log files
log_file_directory: /usr/src/app/logs
# ad stream value - sorry
stream: raw
retry_failures: True
retry_count: 1
# how to control the work that gets done
task_types: 
  - ingest
  - modify

Example state.yml file content:

bookmarks:
  collection_timestamp: 
    last_record: 2020-08-21 06:04:34.418794

Retries

When retrying, the application will:

  1. use the retries.txt file as the todo list
  2. retry as many times as the retry_count in the config.yml file.
  3. the default retry_count is 1
  4. make a new log directory, in the working directory, with the name logs_{retry_count}. Any failures for the retry execution that need to be logged will be logged here.
  5. in the new log directory, make a new .xml file for the output, with the name {obs_id}.xml

The pipeline attempts to only retry transient failures.