-
Notifications
You must be signed in to change notification settings - Fork 1
config.yml
SharonGoliath edited this page Dec 8, 2022
·
25 revisions
config.yml is configuration information for the 'collection'2caom2 repositories in this organization. An 'entry' may be a CAOM2 observation ID value, or a file name, as found in CADC storage.
It contains:
-
working directory
- this is the WORKDIR value on the container. It can change based on a Dockerfile directive, or the docker run command. -
netrc_filename
- this is the name of the netrc file supplied to the container. It must be a fully-qualified name. One of netrc_filename or proxy_filename must have a value. -
proxy_file_name
- this is the name of the proxy certificate file supplied to the container. It must be a fully-qualified name. One of netrc_filename or proxy_filename must have a value. -
resource_id
- this identifies which service to use for metadata storage.ivo://cadc.nrc.ca/sc2repo
is the default, and will result in entries written to sc2.canfar.net.ivo://cadc.nrc.ca/ams
is for modifying the content of production collections. -
tap_id
- this identifies which service to use for metadata queries.ivo://cadc.nrc.ca/sc2tap
is the default, and will query entries visible on sc2.canfar.net.ivo://cadc.nrc.ca/ams/<collection>
is for querying production collections. -
todo_file_name
- this is the name of the file containing the list of file ids to process -
use_local_files
- When False, this will retrieve metadata and data to a temporary local location.
- When True:
- the application will look for files ending in
data_source_extensions
in the directories listed indata_sources
. - the md5 checksum for the local file will have to match the md5 checksum for the file stored at CADC. This check is done for the
store
task type.
- the application will look for files ending in
-
store_modified_files_only
- When False, has no effect
- When True:
- if
use_local_files
is also True, checks that the local version of the file has a different md5 checksum than the file at CADC before transferring the file for storage at CADC. This affects only thestore
task type.
- if
-
data_sources
- Ifuse_local_files
isTrue
, specify directories in which to search for files. This is a YAML list. It may be a list of length 1, set to the same value as theworking_directory
. -
data_source_extensions
- the file extensions to be recognized for processing by the pipeline. e.g. '.fits', '.fits.fz', '.hdf5'. Specify as a YAML list. -
recurse_data_sources
- set it to True if the items indata_sources
have a hierarchy. -
source_host
- if different from where the pipeline runs -
logging_level
- set it to one of DEBUG, INFO, WARNING, ERROR, depending on how much output you'd like -
log_to_file
- set it to True if you want an entry.log file for each work item -
log_file_directory
- set a fully qualified value - log and footprint files will be written here. -
success_log_file_name
- the filename where successes are written, default issuccess_log.txt
. This file is written in the log_file_directory. -
failure_log_file_name
- the filename where failures are written, default isfailure_log.txt
. This file is written in the log_file_directory. -
retry_file_name
- the filename where entries are written if there was a failure in 'collection'_run for the entry. This file is written to the log_file_directory. -
retry_failures
- if True, the pipeline will retry execution for any entries in theretry_file_name
. -
retry_count
- the number of times that the pipeline will retry execution for any entries in theretry_file_name
. Defaults to 1. -
retry_decay
- factor applied to how long the application will wait before retrying the entries in theretries.txt
file. The default delay is 1 minute, so a value of 0.25 for retry_factor will result in a 15 second delay. A value of 10 will result in a 10 minute delay. -
rejected_directory
- if the pipeline for the collection tracks known failures, this is the location where the information is persisted. Defaults to<working_directory>/rejected
-
rejected_file_name
- if the pipeline for the collection tracks known failures, this is the file where the information is persisted. Defaults torejected.yml
. -
progress_file_name
- an on-going log of numbers of entries processed by a pipeline. Defaults toprogress.txt
, and is found inlog_file_directory
. -
state_file_name
- for information that needs to be persisted between pipeline executions. Defaults tostate.yml
, and is found inworking_directory
. -
interval
- if using a state file to time-box execution chunks, this is the interval, in minutes, that define the start and end of the time-box. -
observe_execution
- set to True if you want metrics on CADC service execution time. -
observable_directory
- if observe_execution is True, the location where files are written that accumulate CADC service execution times, for later evaluation. -
stream
- set it todefault
, if using the Task Typestore
. CADC will provide other values for this entry. -
collection
- the collection string that shows up in the UI. Will default toTEST
. -
archive
- the name of the CADC storage namespace. -
task_types
- this controls the work that gets done by the application. The possible options are:scrape
,store
,ingest
,modify
,visit
.- use
scrape
by itself when you want to test CAOM model observation creation - the output will be written to the working directory - use
scrape
,modify
when you want to test any model observation augmentation that requires access to the file on disk (e.g. preview generation, footprint generation, time bounds, depending on collection) locally as well - use
store
,ingest
,modify
with use_local_files set to True, when you want to store data to CADC from the working directory, as well as create the CAOM model observations, and augment them - use
ingest
when the data is already at CADC, but you want to update something in the metadata for the records - use
ingest
,modify
when you need to update existing records at CADC that rely on the metadata, and the data - use
store
to update files at CADC, without updating any of the associated metadata. This may be via http, ftp, or if use_local_storage is set to True, it will copy the file from local disk. - use
visit
to retrieve existing CAOM observation records, update their content without access to data or metadata, and store the result back
- use
-
state_file_name
- if the pipeline is run in increments, this is the name of the file that keeps the latestbookmark
for the last successful increment. An example state file can be seen below. -
cache_file_name
- metadata that is looked up once can be retrieved from this file. Not all pipelines have a cache file. -
storage_inventory_resource_id
- only required iffeatures.supports_latest_client
isTrue
. Possible values are under the heading "storage inventory services" from here. -
cleanup_files_when_storing
- whenFalse
has no effect. WhenTrue
will move files that transferred successfully to CADC to the directory incleanup_success_destination
. Files that failed to transfer are moved tocleanup_failure_destination
. FITS files that are transferred are run throughastropy.io.fits.open().verify('warn')
before transfer. -
cleanup_success_destination
- ifcleanup_files_when_storing
isTrue
, files that end in CADC storage with the same md5sum as locally will be moved to this location. If a file already has the same md5sum at CADC, andstore_modified_files_only
is set toTrue
, files will also be moved to this location. -
cleanup_failure_destination
- ifcleanup_files_when_storing
isTrue
, files that fail to be sent to CADC storage will be moved to this location. FITS files that failastropy.io.fits.open().verify('warn')
will end up in this location. -
features
- this defines which features will be supported by the application. By default, all features are turned on (set to True). There are currently no feature flags.
working_directory: /usr/src/app
# The proxy_filename must be a fully-qualified name
proxy_file_name: /usr/src/app/cadcproxy.pem
# operational value is ivo://cadc.nrc.ca/ams
resource_id: ivo://cadc.nrc.ca/sc2repo
todo_file_name: todo.txt
# values True False
use_local_files: False
# values DEBUG INFO WARNING ERROR
logging_level: DEBUG
# values True False
log_to_file: False
# fully qualified name for a directory to write log files
log_file_directory: /usr/src/app/logs
# ad stream value - sorry
stream: raw
retry_failures: True
retry_count: 1
# how to control the work that gets done
task_types:
- ingest
- modify
bookmarks:
collection_timestamp:
last_record: 2020-08-21 06:04:34.418794
When retrying, the application will:
- use the retries.txt file as the todo list
- retry as many times as the
retry_count
in the config.yml file. - the default retry_count is 1
- make a new log directory, in the working directory, with the name logs_{retry_count}. Any failures for the retry execution that need to be logged will be logged here.
- in the new log directory, make a new .xml file for the output, with the name {obs_id}.xml
The pipeline attempts to only retry transient failures.