Skip to content

Automatically convert CSV or TSV files to TFRecord, and upload them to Google Cloud Storage.

Notifications You must be signed in to change notification settings

harrydrippin/tfrecorder

Repository files navigation

TFRecorder

Simple tool which converts your NLP dataset to TFRecord file and uploads to Cloud Storage for GPU/TPU training

Key Feature

  • Generate TFRecord files by given CSV, TSV files and Metadata
  • Upload to Google Cloud Storage for convenient usage while using TPU with TensorFlow
  • Do convert and upload simultaneously by using asyncio.Queue

Metadata File Usage

{
  "name": "<Name of the dataset. Will be used for TFRecord file name and GCS bucket name.>",
  "convert": {
    "from_path": "<Path of the dataset, by glob pattern.>",
    "file_type": "<Type of the file. Can use 'csv' or 'tsv'.>",
    "skip_header": false,
    "to_path": "<Path to store TFRecord files.>"
  },
  "columns": [
    {
      "name": "<Name of the column>",
      "feature_type": "<Type of the column value. 'str', 'bool', 'int', 'float'>"
    }
  ]
}

Tool Usage

$ tfr -h
usage: tfr [-h] [-m MODE] [--max-error MAX_ERROR] [-c COMPRESSION_TYPE]
           [--only-convert [ONLY_CONVERT]] [--only-upload [ONLY_UPLOAD]]
           [--batch-size BATCH_SIZE] [-g GOOGLE_APPLICATION_CREDENTIALS]
           [--max-pool-size MAX_POOL_SIZE] [--chunksize CHUNKSIZE]
           METADATA_PATH

Automatically convert CSV or TSV files to TFRecord, and upload them to Google
Cloud Storage.

positional arguments:
  METADATA_PATH         Path of JSON file which have metadata

optional arguments:
  -h, --help            show this help message and exit
  -m MODE, --mode MODE  Method to parse the file. Use tsv by default.
  --max-error MAX_ERROR
                        Max error records while parsing. Not set (-1) by
                        default.
  -c COMPRESSION_TYPE, --compression-type COMPRESSION_TYPE
                        TFRecord compression type. Use GZIP by default.
  --only-convert [ONLY_CONVERT]
                        Only convert the files, not upload to GCS
  --only-upload [ONLY_UPLOAD]
                        Only upload the files to GCS, not convert (will read
                        TFRECORD_PATH only)
  --batch-size BATCH_SIZE
                        Size of the examples one file should have. Use 1000 by
                        default.
  -g GOOGLE_APPLICATION_CREDENTIALS, --google-application-credentials GOOGLE_APPLICATION_CREDENTIALS
                        Google Application Credential JSON file path. Will use
                        environment variable as a default.
  --max-pool-size MAX_POOL_SIZE
                        Max pool size for multiprocessing. Use all cores by
                        default.
  --chunksize CHUNKSIZE
                        Chunksize for multiprocessing. Use 10 by default.

Test

To test this, run the script below on your machine.

python setup.py develop

About

Automatically convert CSV or TSV files to TFRecord, and upload them to Google Cloud Storage.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages