Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable .json formats for uploading and metadata editing #443

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

laptopsftw
Copy link

it's a small edit of mine: i was having problems uploading files since it has commas on it. instead of fixing it with double quotes, i just tried editing the code. just a small one.

what it does is it checks *.json in the filename, else it's a *.csv. that's pretty much it.

Checks if file ends in *.json, will be processed as json, else *.csv

def is_json(filename):
# Checks whether filetype is .json, else fallsback to .csv file format
return bool(re.fullmatch('^.*\.json$', filename))
Copy link
Contributor

@maxz maxz Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use filename.endswith('.json') instead of a regex. The bool will not be required either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxz In this instance, the bool() is appropriate because a one-line utility is_xyz() function should return True or False. Without bool() the function could return two different data types which is a bit confusing for the caller.

>>> import re
>>> type(re.fullmatch('^.*\.json$', "filename"))
<class 'NoneType'>
>>> type(re.fullmatch('^.*\.json$', "filename.json"))
<class 're.Match'>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cclauss I'm pretty sure they meant that the bool isn't necessary with str.endswith.



def is_json(filename):
# Checks whether filetype is .json, else fallsback to .csv file format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"falls back" should be written as two words.

@@ -66,6 +66,7 @@
from internetarchive.cli.argparser import get_args_dict, get_args_dict_many_write,\
get_args_header_dict
from internetarchive.exceptions import ItemLocateError
from internetarchive.utils import (is_json)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parentheses should be removed.

@@ -78,7 +78,7 @@
from internetarchive.cli.argparser import get_args_dict, convert_str_list_to_unicode
from internetarchive.session import ArchiveSession
from internetarchive.utils import (InvalidIdentifierException, get_s3_xml_text,
is_valid_metadata_key, validate_s3_identifier)
is_valid_metadata_key, validate_s3_identifier, is_json)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the new import into its proper alphabetical position, (between get_s3_xml_text and is_valid_metadata_key)
while minding the proper maximum line length of 79 characters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, 102 in internetarchive:

max-line-length = 102

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past I've shot for 90 chars max per line, personallly. If 88 is the tendency, I'm happy to change to that! much less than that tends to frustrate me personally though (creating a lot of weird multi-line statements).

@@ -398,3 +398,8 @@ def merge_dictionaries(dict0, dict1, keys_to_drop=None):
# Items from `dict1` take precedence over items from `dict0`.
new_dict.update(dict1)
return new_dict


def is_json(filename):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name should most likely be changed to has_json_extension since it does not inspect and verify the content of a file and therefore could lead to future confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have dropped legacy Python, let's add type hints on the function parameters and return type.
Also, given that import os is already done above, let's leverage that.

def is_json(file_path: str) -> bool:
    return os.path.splitext(file_path)[1] == ".json"

https://docs.python.org/3/library/os.path.html#os.path.splitext

Of course, my favorite would be import pathlib above and then:

def is_json(file_path: str) -> bool:
    return Path(file_path).suffix == ".json"

https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.suffix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any value in doing either of those over file_path.endswith('.json') in this case. They're exactly equivalent, just with extra machinery.

Copy link
Owner

@jjjake jjjake Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to make more use of pathlib now that we're PY3 only. With that in mind, it might be worth using it here vs. the simplified file_path.endswith('.json') for the sake of consistency. I'm all for type hints, sounds great.

I hear your argument against it though, too, @JustAnotherArchivist. I'm fine with either, but would lean towards cclauss's suggestion above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion the os.path.splitextsolution is clearly inferior to path.endswith.

Yet if me managed to change to paths where possible and over time consistently manged to use pathlib, that would be pretty nice e.g. if functions where it made sense would already take a path as input instead of a string.

@maxz
Copy link
Contributor

maxz commented Oct 28, 2021

Additionally to the previously suggested changes, you should definitely amend the first line of your commit to something more descriptive than First Commit.

I also don't really like these changes since a JSON file is not a spreadsheet. So the functionality should probably be separate from --spreadsheet and e. g. have a --json equivalent.

@jjjake
Copy link
Owner

jjjake commented Nov 1, 2021

Thanks for the contribution @laptopsftw, and thanks for reviewing @maxz!

I agree with all of @maxz's suggestions. Happy to merge once these are resolved. Thanks again.

@@ -78,7 +78,7 @@
from internetarchive.cli.argparser import get_args_dict, convert_str_list_to_unicode
from internetarchive.session import ArchiveSession
from internetarchive.utils import (InvalidIdentifierException, get_s3_xml_text,
is_valid_metadata_key, validate_s3_identifier)
is_valid_metadata_key, validate_s3_identifier, is_json)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -398,3 +398,8 @@ def merge_dictionaries(dict0, dict1, keys_to_drop=None):
# Items from `dict1` take precedence over items from `dict0`.
new_dict.update(dict1)
return new_dict


def is_json(filename):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have dropped legacy Python, let's add type hints on the function parameters and return type.
Also, given that import os is already done above, let's leverage that.

def is_json(file_path: str) -> bool:
    return os.path.splitext(file_path)[1] == ".json"

https://docs.python.org/3/library/os.path.html#os.path.splitext

Of course, my favorite would be import pathlib above and then:

def is_json(file_path: str) -> bool:
    return Path(file_path).suffix == ".json"

https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.suffix


def is_json(filename):
# Checks whether filetype is .json, else fallsback to .csv file format
return bool(re.fullmatch('^.*\.json$', filename))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxz In this instance, the bool() is appropriate because a one-line utility is_xyz() function should return True or False. Without bool() the function could return two different data types which is a bit confusing for the caller.

>>> import re
>>> type(re.fullmatch('^.*\.json$', "filename"))
<class 'NoneType'>
>>> type(re.fullmatch('^.*\.json$', "filename.json"))
<class 're.Match'>

Comment on lines +265 to +268
if is_json(args['--spreadsheet']):
spreadsheet = json.load(fp)
else:
spreadsheet = csv.DictReader(fp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is strange to call a dict a spreadsheet.
Does the for row in spreadsheet: logic below really work for all .json files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The csv.DictReader is an iterator producing a dict for each row in the CSV file. You only get dicts in the loop below.
All that means is that the JSON file must have a similar structure, i.e. an array of objects:

[
	{"identifier": "item", "file": "./foobar"},
	{"file": "./foobaz"}
]

This needs test coverage though and an example in the documentation.

@maxz
Copy link
Contributor

maxz commented Apr 14, 2022

Frankly, I do not understand how the recent review activity on this pull request was motivated.

To me it seems like @laptopsftw does not have too much interest in integrating this pull request or else they would at least have notified us that they are working on the changes (unless something major in their life came up which hindered them).

@laptopsftw
Copy link
Author

laptopsftw commented Apr 17, 2022

Frankly, I do not understand how the recent review activity on this pull request was motivated.

To me it seems like @laptopsftw does not have too much interest in integrating this pull request or else they would at least have notified us that they are working on the changes (unless something major in their life came up which hindered them).

honestly i was still on the edge of actually integrating it on the mainline, since .csv was already a good workaround for this and it's just me being dum dum (unfamiliar) with the file format.

as for me i have a separate local fork working for my case. would try to update a new pull request if i started working for 3.0.0 (i'm still uploading my files in 2.1.0 atm)

@maxz
Copy link
Contributor

maxz commented Apr 17, 2022

Good to know, thanks for your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants