Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JSONPath parser support #44

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ Each object in the 'tables' array describes one or more CSV or Excel spreadsheet
- **worksheet_name**: (optional) the worksheet name to pull from in the targeted xls file(s). Only required when format is excel
- **delimiter**: (optional) the delimiter to use when format is 'csv'. Defaults to a comma ',' but you can set delimiter to 'detect' to leverage the csv "Sniffer" for auto-detecting delimiter.
- **quotechar**: (optional) the character used to surround values that may contain delimiters - defaults to a double quote '"'
- **json_path**: (optional) the JSON key under which the list of objets to use is located. Defaults to None, corresponding to an array at the top level of the JSON tree.
- **json_path**: (optional) the JSON key under which the list of objects to use is located (corresponding to an array at the top level of the JSON tree) or [JSONPath](https://pypi.org/project/jsonpath-ng/) (should return array of objects, could be tested on (https://jsonpath.com)). Defaults to None.

### Automatic Config Generation

Expand Down Expand Up @@ -152,6 +152,18 @@ JSON files are expected to parse as a root-level array of objects where each obj
]
```

JSONPath could be used to parse deep nested array of objects, i.e., `json_path: response.data[*]` could be used to parse the following JSON file:
```json
{
"response": {
"data": [
{ "name": "row one", "key": 42 },
{ "name": "row two", "key": 43 }
]
}
}
```

### JSONL (JSON Lines) support

JSONL files are expected to parse as one object per line, where each row in a file is a set of key-value pairs.
Expand Down
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@
'openpyxl',
'xlrd',
'paramiko',
'azure-storage-blob>=12.14.0'
'azure-storage-blob>=12.14.0',
'jsonpath-ng>=1.5.3'
],
entry_points="""
[console_scripts]
Expand Down
11 changes: 6 additions & 5 deletions tap_spreadsheets_anywhere/json_handler.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import json
from jsonpath_ng.ext import parse
import re
from json import JSONDecodeError
import logging
Expand All @@ -25,8 +26,12 @@ def get_row_iterator(table_spec, reader):
try:
json_array = json.load(reader)
json_path = table_spec.get('json_path', None)

if json_path is not None:
json_array = json_array[json_path]
if json_path in json_array:
json_array = json_array[json_path]
else:
return generator_wrapper(match.value for match in parse(json_path).find(json_array))

# throw a TypeError if the root json object can not be iterated
return generator_wrapper(iter(json_array))
Expand All @@ -39,7 +44,3 @@ def get_row_iterator(table_spec, reader):
return generator_wrapper(json_objects)
else:
raise jde




18 changes: 12 additions & 6 deletions tap_spreadsheets_anywhere/test/test_json.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these test cases be added as net-new test cases, rather than updating existing tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to clean it up a bit to remove a confusion and also added one new test for JSONPath case to the end of the file.

All of the tests call only one function json_handler.get_row_iterator which expects only one option (json_path) from the configuration. Table specs dict had excel config which is a bit confusing for json handler test, so I updated it, and regrouped specs and their related tests to make them more transparent. And I still missed "badnewlines" name of the first table spec (:

If you consider changes in old tests as a bad practise, I could rollback my changes and add new tests on top of the old ones.

Copy link
Contributor Author

@TyShkan TyShkan Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've managed to build env with working dependencies and run tests. Some of the old tests don't work though, but it's not caused by the changes in this pull request and related to Excel-files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean test branch converted to poetry & GitHub workflow running test on different python versions committed here: https://github.com/TyShkan/tap-spreadsheets-anywhere/commits/poetry

Original file line number Diff line number Diff line change
Expand Up @@ -9,27 +9,27 @@
{
"path": "file://./tap_spreadsheets_anywhere/test",
"name": "badnewlines",
"pattern": ".*\\.xlsx",
"pattern": ".*\\.json",
"start_date": "2017-05-01T00:00:00Z",
"key_properties": [],
"format": "excel",
"worksheet_name": "sample_with_bad_newlines"
"format": "detect"
},
{
"path": "file://./tap_spreadsheets_anywhere/test",
"name": "badnewlines",
"name": "nestedlist",
"pattern": ".*\\.json",
"start_date": "2017-05-01T00:00:00Z",
"key_properties": [],
"json_path": "someKey",
"format": "detect"
},
{
"path": "file://./tap_spreadsheets_anywhere/test",
"name": "nestedlist",
"name": "deepnestedlist",
"pattern": ".*\\.json",
"start_date": "2017-05-01T00:00:00Z",
"key_properties": [],
"json_path": "someKey",
"json_path": "response.data[*]",
"format": "detect"
}
]
Expand All @@ -48,6 +48,12 @@ def test_json_object_lists(self):

def test_json_nested_array(self):
reader = StringIO('{"someKey": [{"k":"v"},{"k":"v"},{"k":"v"}]}')
iterator = json_handler.get_row_iterator(TEST_TABLE_SPEC['tables'][1], reader)
for row in iterator:
self.assertEqual(row['k'], 'v')

def test_json_deep_nested_array(self):
reader = StringIO('{"response": {"data": [{"k":"v"},{"k":"v"},{"k":"v"}]}}')
iterator = json_handler.get_row_iterator(TEST_TABLE_SPEC['tables'][2], reader)
for row in iterator:
self.assertEqual(row['k'], 'v')