Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable .json formats for uploading and metadata editing #443

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions internetarchive/cli/ia_metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@
from internetarchive.cli.argparser import get_args_dict, get_args_dict_many_write,\
get_args_header_dict
from internetarchive.exceptions import ItemLocateError
from internetarchive.utils import (is_json)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parentheses should be removed.


# Only import backports.csv for Python2 (in support of FreeBSD port).
PY2 = sys.version_info[0] == 2
Expand Down Expand Up @@ -270,8 +271,11 @@ def main(argv, session):
if args['--spreadsheet']:
if not args['--priority']:
args['--priority'] = -5
with io.open(args['--spreadsheet'], 'rU', newline='', encoding='utf-8') as csvfp:
spreadsheet = csv.DictReader(csvfp)
with io.open(args['--spreadsheet'], 'rU', newline='', encoding='utf-8') as fp:
if is_json(args['--spreadsheet']):
spreadsheet = json.load(fp)
else:
spreadsheet = csv.DictReader(fp)
responses = []
for row in spreadsheet:
if not row['identifier']:
Expand Down
10 changes: 7 additions & 3 deletions internetarchive/cli/ia_upload.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@
from internetarchive.cli.argparser import get_args_dict, convert_str_list_to_unicode
from internetarchive.session import ArchiveSession
from internetarchive.utils import (InvalidIdentifierException, get_s3_xml_text,
is_valid_metadata_key, validate_s3_identifier)
is_valid_metadata_key, validate_s3_identifier, is_json)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the new import into its proper alphabetical position, (between get_s3_xml_text and is_valid_metadata_key)
while minding the proper maximum line length of 79 characters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, 102 in internetarchive:

max-line-length = 102

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past I've shot for 90 chars max per line, personallly. If 88 is the tendency, I'm happy to change to that! much less than that tends to frustrate me personally though (creating a lot of weird multi-line statements).


# Only import backports.csv for Python2 (in support of FreeBSD port).
PY2 = sys.version_info[0] == 2
Expand Down Expand Up @@ -260,8 +260,12 @@ def main(argv, session):
# Bulk upload using spreadsheet.
else:
# Use the same session for each upload request.
with io.open(args['--spreadsheet'], 'rU', newline='', encoding='utf-8') as csvfp:
spreadsheet = csv.DictReader(csvfp)

with io.open(args['--spreadsheet'], 'rU', newline='', encoding='utf-8') as fp:
if is_json(args['--spreadsheet']):
spreadsheet = json.load(fp)
else:
spreadsheet = csv.DictReader(fp)
Comment on lines +265 to +268
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is strange to call a dict a spreadsheet.
Does the for row in spreadsheet: logic below really work for all .json files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The csv.DictReader is an iterator producing a dict for each row in the CSV file. You only get dicts in the loop below.
All that means is that the JSON file must have a similar structure, i.e. an array of objects:

[
	{"identifier": "item", "file": "./foobar"},
	{"file": "./foobaz"}
]

This needs test coverage though and an example in the documentation.

prev_identifier = None
for row in spreadsheet:
for metadata_key in row:
Expand Down
5 changes: 5 additions & 0 deletions internetarchive/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -398,3 +398,8 @@ def merge_dictionaries(dict0, dict1, keys_to_drop=None):
# Items from `dict1` take precedence over items from `dict0`.
new_dict.update(dict1)
return new_dict


def is_json(filename):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name should most likely be changed to has_json_extension since it does not inspect and verify the content of a file and therefore could lead to future confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have dropped legacy Python, let's add type hints on the function parameters and return type.
Also, given that import os is already done above, let's leverage that.

def is_json(file_path: str) -> bool:
    return os.path.splitext(file_path)[1] == ".json"

https://docs.python.org/3/library/os.path.html#os.path.splitext

Of course, my favorite would be import pathlib above and then:

def is_json(file_path: str) -> bool:
    return Path(file_path).suffix == ".json"

https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.suffix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any value in doing either of those over file_path.endswith('.json') in this case. They're exactly equivalent, just with extra machinery.

Copy link
Owner

@jjjake jjjake Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to make more use of pathlib now that we're PY3 only. With that in mind, it might be worth using it here vs. the simplified file_path.endswith('.json') for the sake of consistency. I'm all for type hints, sounds great.

I hear your argument against it though, too, @JustAnotherArchivist. I'm fine with either, but would lean towards cclauss's suggestion above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion the os.path.splitextsolution is clearly inferior to path.endswith.

Yet if me managed to change to paths where possible and over time consistently manged to use pathlib, that would be pretty nice e.g. if functions where it made sense would already take a path as input instead of a string.

# Checks whether filetype is .json, else fallsback to .csv file format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"falls back" should be written as two words.

return bool(re.fullmatch('^.*\.json$', filename))
Copy link
Contributor

@maxz maxz Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use filename.endswith('.json') instead of a regex. The bool will not be required either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxz In this instance, the bool() is appropriate because a one-line utility is_xyz() function should return True or False. Without bool() the function could return two different data types which is a bit confusing for the caller.

>>> import re
>>> type(re.fullmatch('^.*\.json$', "filename"))
<class 'NoneType'>
>>> type(re.fullmatch('^.*\.json$', "filename.json"))
<class 're.Match'>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cclauss I'm pretty sure they meant that the bool isn't necessary with str.endswith.