Skip to content

Commit

Permalink
Merge pull request #10 from sebastianswms/filepath
Browse files Browse the repository at this point in the history
Test PR with filepath
  • Loading branch information
sebastianswms authored Jul 24, 2023
2 parents 338e031 + aaf5605 commit ecea52c
Show file tree
Hide file tree
Showing 7 changed files with 58 additions and 30 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/main_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ jobs:
env:
AWS_ACCESS_KEY_ID: AKIAZPOBIXUJJ434XT5T
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
TAP_UNIVERSAL_FILE_FILEPATH: derek-tap-filetesting/2023
TAP_UNIVERSAL_FILE_FILE_REGEX: ^airtravel\.csv$
TAP_UNIVERSAL_FILE_FILE_PATH: derek-tap-filetesting/2023
TAP_UNIVERSAL_FILE_FILE_REGEX: ^.*airtravel\.csv$
TAP_UNIVERSAL_FILE_PROTOCOL: s3
steps:
- uses: actions/checkout@v3
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/s3_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ jobs:
env:
AWS_ACCESS_KEY_ID: AKIAZPOBIXUJJ434XT5T
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
TAP_UNIVERSAL_FILE_FILEPATH: derek-tap-filetesting/2023
TAP_UNIVERSAL_FILE_FILE_REGEX: ^airtravel\.csv$
TAP_UNIVERSAL_FILE_FILE_PATH: derek-tap-filetesting/2023
TAP_UNIVERSAL_FILE_FILE_REGEX: ^.*airtravel\.csv$
TAP_UNIVERSAL_FILE_PROTOCOL: s3
steps:
- uses: actions/checkout@v3
Expand Down
66 changes: 47 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ pipx install git+https://github.com/MeltanoLabs/tap-universal-file.git
|:----------------------------|:--------:|:-------:|:------------|
| stream_name | False | file | The name of the stream that is output by the tap. |
| protocol | True | None | The protocol to use to retrieve data. Must be either `file` or `s3`. |
| filepath | True | None | The path to obtain files from. Example: `/foo/bar`. Or, for `protocol==s3`, use `s3-bucket-name` instead. |
| file_path | True | None | The path to obtain files from. Example: `/foo/bar`. Or, for `protocol==s3`, use `s3-bucket-name` instead. |
| file_regex | False | None | A regex pattern to only include certain files. Example: `.*\.csv`. |
| file_type | False | delimited | Must be one of `delimited`, `jsonl`, or `avro`. Indicates the type of file to sync, where `delimited` is for CSV/TSV files and similar. Note that *all* files will be read as that type, regardless of file extension. To only read from files with a matching file extension, appropriately configure `file_regex`. |
| compression | False | detect | The encoding used to decompress data. Must be one of `none`, `zip`, `bz2`, `gzip`, `lzma`, `xz`, or `detect`. If set to `none` or any encoding, that setting will be applied to *all* files, regardless of file extension. If set to `detect`, encodings will be applied based on file extension. |
Expand Down Expand Up @@ -51,7 +51,7 @@ pipx install git+https://github.com/MeltanoLabs/tap-universal-file.git
| batch_config.storage | False | None | Object containing information about how batch files should be stored. Has two child entries: `root` and `prefix`. |
| batch_config.encoding.format | False | None | Format to store batch files in. Example: `jsonl`. |
| batch_config.encoding.compression | False | None | Method with which to compress batch files. Example: `gzip`. |
| batch_config.storage.root | False | None | Location to store batch files. Examples: `file:///foo/bar`, `file://output`, `s3://bar/foo`. Note that the triple-slash is not a typo: it indicates an absolute filepath. |
| batch_config.storage.root | False | None | Location to store batch files. Examples: `file:///foo/bar`, `file://output`, `s3://bar/foo`. Note that the triple-slash is not a typo: it indicates an absolute file path. |
| batch_config.storage.prefix | False | None | Prepended to the names of all batch files. Example: `batch-`. |

A full list of supported settings and capabilities for this
Expand All @@ -61,11 +61,33 @@ tap is available by running:
tap-universal-file --about
```

### Additional S3 Dependency
### Regular Expressions

If you use `protocol=s3` and/or if you use batching to send data to S3, you will need to add the additional dependency `s3`. For example, you could update `meltano.yml` to have `pip_url: -e .[s3]`.
To allow configuration for which files are synced, this tap supports the use of regular expressions to match file paths. First, the tap will find the directory specified by the `file_path` config option. Then it will compare the provided regular expression to the full file path of each file in that directory.

### Sample Batching Config
To demonstrate this, consider the following directory structure and suppose that you want to sync only the file `apple.csv`.

```
.
└── top-level/
├── alpha/
└── bravo/
├── apple.csv
├── pineapple.csv
└── orange.csv
```

If you set `file_path` to be `/top-level/bravo` and you've set `file_regex` to be `^apple\.csv$`, you won't sync any files. That's because the regular expression you provide is compared against the string `"top-level/bravo/apple.csv"`. Instead, correct values for `file_regex` include `^.*\/apple\.csv$`, `^.*bravo\/apple\.csv$`, or `^top-level\/bravo\/apple\.csv$`. Alternatively, to sync both `apple.csv` and `pineapple.csv`, you could use `^.*\/(pine)?apple\.csv$`.

### Using S3

Some additional configuration is needed when using Amazon S3.

#### Additional Dependency

If you use `protocol==s3` and/or if you use batching to send data to S3, you will need to add the additional dependency `s3`. For example, you could update `meltano.yml` to have `pip_url: -e .[s3]`.

#### Sample Batching Config

Here is an example `meltano.yml` entry to configure batch files, and then the same sample configuration in JSON.
```yml
Expand Down Expand Up @@ -95,21 +117,9 @@ config:
}
```

### Incremental Replication

If this tap is provided a state or `start_date`, it assumes that incremental replication is desired, in which case only files most recently modified will be synced. Attempting to override this behavior in `meltano.yml` can cause unintended behavior due this tap's use of state during the discovery process. Further note that this tap does not support incremental replication on any column other than `_sdc_last_modified`.

### Configure using environment variables

This Singer tap will automatically import any environment variables within the working directory's
`.env` if the `--config=ENV` is provided, such that config values will be considered if a matching
environment variable is set either in the terminal context or in the `.env` file.

### Source Authentication and Authorization

#### S3
#### Authentication and Authorization

If you use S3, either for fetching files or for batching, you will need to obtain an access key and secret from AWS IAM. Specifically, `protocol=s3` requires the ListBucket and GetObject permissions, and batching requires the PutObject permission.
If you use `protocol==s3` and/or if you use batching to send data to S3, you will need to obtain an access key and secret from AWS IAM. Specifically, `protocol==s3` requires the ListBucket and GetObject permissions, and batching requires the PutObject permission.

You can create a policy that grants the requisite permissions with the following JSON:

Expand Down Expand Up @@ -146,6 +156,24 @@ If you already have two access keys for an account, you will have to delete one
aws iam delete-access-key --user-name=YOUR_ACCOUNT_NAME --access-key-id=YOUR_ACCESS_KEY_ID
```

#### Subfolders

To sync a subfolder in S3, add it like you would add any other file path. For example to sync all files in `foo` subfolder of the S3 bucket named `bar-bucket`, set `file_path==bar-bucket/foo`.

### Incremental Replication

If this tap is provided a state or `start_date`, it assumes that incremental replication is desired, in which case only files most recently modified will be synced. Attempting to override this behavior in `meltano.yml` can cause unintended behavior due this tap's use of state during the discovery process. Further note that this tap does not support incremental replication on any column other than `_sdc_last_modified`.

### Configure using environment variables

This Singer tap will automatically import any environment variables within the working directory's
`.env` if the `--config=ENV` is provided, such that config values will be considered if a matching
environment variable is set either in the terminal context or in the `.env` file.

### Source Authentication and Authorization

To authnticate or authorize using S3, see [S3 Authentication and Authorization](#authentication-and-authorization) above

## Usage

You can easily run `tap-universal-file` by itself or in a pipeline using [Meltano](https://meltano.com/).
Expand Down
2 changes: 1 addition & 1 deletion meltano.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ plugins:
settings:
- name: stream_name
- name: protocol
- name: filepath
- name: file_path
- name: file_regex
- name: file_type
- name: compression
Expand Down
4 changes: 2 additions & 2 deletions tap_universal_file/files.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ def get_files(

file_dict_list = []

for file_path in self.filesystem.find(self.config["filepath"]):
for file_path in self.filesystem.find(self.config["file_path"]):
file = self.filesystem.info(file_path)
if (
file["type"] == "directory" # Ignore nested folders.
Expand Down Expand Up @@ -135,7 +135,7 @@ def get_files(

if none_found:
msg = (
"No files found. Choose a different `filepath` or try a more lenient "
"No files found. Choose a different `file_path` or try a more lenient "
"`file_regex`."
)
raise RuntimeError(msg)
Expand Down
2 changes: 1 addition & 1 deletion tap_universal_file/tap.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ class TapUniversalFile(Tap):
),
),
th.Property(
"filepath",
"file_path",
th.StringType,
required=True,
description=(
Expand Down
6 changes: 3 additions & 3 deletions tests/test_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,14 @@ def data_dir() -> str:
"""Gets the directory in tests/data where data is stored.
Returns:
A str representing a filepath to tests/data.
A str representing a file path to tests/data.
"""
return str(Path(__file__).parent / Path("./data"))


base_file_config = {
"protocol": "file",
"filepath": data_dir(),
"file_path": data_dir(),
}


Expand Down Expand Up @@ -134,7 +134,7 @@ def test_avro_execution():
def test_s3_execution():
s3_config = {
"protocol": "s3",
"filepath": "derek-tap-filetesting/2023",
"file_path": "derek-tap-filetesting/2023",
"file_regex": "^.*airtravel\\.csv$",
}
execute_tap(s3_config)
Expand Down

0 comments on commit ecea52c

Please sign in to comment.