The bashdatacatalog is a command-line tool that facilitates synchronizing local data collections with a remote data source. With the bashdatacatalog, you can run queries on your local data collections to answer questions like "What files am I missing?" or "What files aren't bitwise identical to remote data?". Queries can include a date range, in which case collections with temporal assets are filtered-out accordingly. The bashdatacatalog can format the results of queries as: a URL download list, a Globus transfer list, an rsync transfer list, or simply a file list.
The bashdatacatalog was written to facilitate downloading input data for users of the GEOS-Chem atmospheric chemistry model. The canonical GEOS-Chem input data repository has >1 M files and >100 TB of data, and the input data required for a simulation depends on the model version and simulation parameters such as start and end date.
Note: Consider giving the bashdatacatalog a Star ⭐ if you find it useful. This increase visibility and helps justify maintaining this repository.
Data is organized with collections and catalogs.
collection - A data collection is a directory (folder) that has data files. A data collection may have any number of files, any types of files, and have subdirectories.
catalog - A file that lists collections and collection settings. A catalog file includes (1) the local paths to data collections, (2) the URLs of the data sources, and (3) boolean flags to enable/disable data collections.
You can install the bashdatacatalog with the following command. Follow the prompts and restart your terminal.
$ bash <(curl -s https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/install.sh)
Note: This command upgrades the bashdatacatalog if it's already installed.
-
Download a catalog file:
$ curl https://raw.githubusercontent.com/LiamBindle/bashdatacatalog/main/sandbox/catalog1.csv -o catalog1.csv
-
Fetch collection metadata:
$ bashdatacatalog-fetch catalog1.csv
-
Run listing queries (e.g., download all missing files with 4 parallel download streams):
$ bashdatacatalog-list -am -f xargs-curl catalog1.csv | xargs -P 4 curl
See the Wiki for documentation on using the bashdatacatalog: https://github.com/LiamBindle/bashdatacatalog/wiki.
If you are a GEOS-Chem user, see the standalone instructions for GEOS-Chem users here.
For options and arguments to the bashdatacatalog-list
command, see
$ bashdatacatalog-list -h