Skip to content

legacy Distcp

Fabiano V. Santos edited this page Jun 13, 2017 · 1 revision

Nightfall Distcp

Make distributed copy of S3 data to a file system destination, it runs as a job on Spark.

Copying builds on noted backup format performed by Secor, where all files of a certain date are aggregated into a single file at the destination. If a file already exists on the target, it will be removed before they start a new copy of S3.

Example, consider the following structure:

    ---- dt=2016-05-22
               ---- 10000589.gz
               ---- 10000689.gz
    ---- dt=2016-05-23
               ---- 10000789.gz
               ---- 10000889.gz

Considering the destination as /tmp/distcp, we have the following structure:

        ---- 2016-05-22.gz
        ---- 2016-05-23.gz

Base configurations:

  • aws.region: region where the bucket, optional. Example: aws.region=us-east-1.
  • aws.access.key: AWS access key, optional. Example: aws.access.key=SECRET_ID.
  • aws.secret.key: AWS secret key, optional. Example: aws.secret.key=SECRET_KEY.
  • aws.s3.bucket: bucket from which the data will be read, required. Example:
  • aws.s3.path: path where you can find the backups, required. Example: aws.s3.path=raw_logs/secor/backup.
  • distcp.output.dir: destination where will be saved the S3 backups, required. Formats:
    • /tmp/distcp: local file system
    • hdfs://tmp/discp: HDFS
  • distcp.window.size.days: size of the data window in days, indicates how many days will be copied retroactively. Default: 1. Exemplo: distcp.window.size.days=30.
  • end date of the data window. Default current day. Example:
  • distcp.window.type: Distcp implementation of the type
    • DAY: Performs the download of DataPoints by subtracting the number of days reported in property distcp.window.size.days (Default)
    • MONTH: Performs the download of the month DataPoints informed on property

Example for download day 15,16 e 17 of may de 2016


Example to download the entire month of January 2016


./gradlew ':distcp':run


For execution in the development environment:

  • Generate from the sample file: cp distcp/src/main/resources/ distcp/src/main/resources/
  • Change the required properties