enable .json formats for uploading and metadata editing #443

laptopsftw · 2021-10-27T08:11:08Z

it's a small edit of mine: i was having problems uploading files since it has commas on it. instead of fixing it with double quotes, i just tried editing the code. just a small one.

what it does is it checks *.json in the filename, else it's a *.csv. that's pretty much it.

Checks if file ends in *.json, will be processed as json, else *.csv

maxz · 2021-10-28T09:18:42Z

internetarchive/utils.py

+
+def is_json(filename):
+    # Checks whether filetype is .json, else fallsback to .csv file format
+    return bool(re.fullmatch('^.*\.json$', filename))


Use filename.endswith('.json') instead of a regex. The bool will not be required either.

@maxz In this instance, the bool() is appropriate because a one-line utility is_xyz() function should return True or False. Without bool() the function could return two different data types which is a bit confusing for the caller.

>>> import re >>> type(re.fullmatch('^.*\.json$', "filename")) <class 'NoneType'> >>> type(re.fullmatch('^.*\.json$', "filename.json")) <class 're.Match'>

@cclauss I'm pretty sure they meant that the bool isn't necessary with str.endswith.

maxz · 2021-10-28T09:19:03Z

internetarchive/utils.py

+
+
+def is_json(filename):
+    # Checks whether filetype is .json, else fallsback to .csv file format


"falls back" should be written as two words.

maxz · 2021-10-28T09:19:28Z

internetarchive/cli/ia_metadata.py

@@ -66,6 +66,7 @@
 from internetarchive.cli.argparser import get_args_dict, get_args_dict_many_write,\
    get_args_header_dict
 from internetarchive.exceptions import ItemLocateError
+from internetarchive.utils import (is_json)


The parentheses should be removed.

maxz · 2021-10-28T09:22:03Z

internetarchive/cli/ia_upload.py

@@ -78,7 +78,7 @@
 from internetarchive.cli.argparser import get_args_dict, convert_str_list_to_unicode
 from internetarchive.session import ArchiveSession
 from internetarchive.utils import (InvalidIdentifierException, get_s3_xml_text,
-                                   is_valid_metadata_key, validate_s3_identifier)
+                                   is_valid_metadata_key, validate_s3_identifier, is_json)


Move the new import into its proper alphabetical position, (between get_s3_xml_text and is_valid_metadata_key)
while minding the proper maximum line length of 79 characters.

The tendency seems to be towards 88 chars max per line.
https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length

Actually, 102 in internetarchive:

internetarchive/setup.cfg

Line 4 in 2456376

max-line-length = 102

In the past I've shot for 90 chars max per line, personallly. If 88 is the tendency, I'm happy to change to that! much less than that tends to frustrate me personally though (creating a lot of weird multi-line statements).

maxz · 2021-10-28T09:25:20Z

internetarchive/utils.py

@@ -398,3 +398,8 @@ def merge_dictionaries(dict0, dict1, keys_to_drop=None):
    # Items from `dict1` take precedence over items from `dict0`.
    new_dict.update(dict1)
    return new_dict
+
+
+def is_json(filename):


The name should most likely be changed to has_json_extension since it does not inspect and verify the content of a file and therefore could lead to future confusion.

Now that we have dropped legacy Python, let's add type hints on the function parameters and return type.
Also, given that import os is already done above, let's leverage that.

def is_json(file_path: str) -> bool: return os.path.splitext(file_path)[1] == ".json"

https://docs.python.org/3/library/os.path.html#os.path.splitext

Of course, my favorite would be import pathlib above and then:

def is_json(file_path: str) -> bool: return Path(file_path).suffix == ".json"

https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.suffix

I don't see any value in doing either of those over file_path.endswith('.json') in this case. They're exactly equivalent, just with extra machinery.

I'd love to make more use of pathlib now that we're PY3 only. With that in mind, it might be worth using it here vs. the simplified file_path.endswith('.json') for the sake of consistency. I'm all for type hints, sounds great.

I hear your argument against it though, too, @JustAnotherArchivist. I'm fine with either, but would lean towards cclauss's suggestion above.

In my opinion the os.path.splitextsolution is clearly inferior to path.endswith.

Yet if me managed to change to paths where possible and over time consistently manged to use pathlib, that would be pretty nice e.g. if functions where it made sense would already take a path as input instead of a string.

maxz · 2021-10-28T09:29:09Z

Additionally to the previously suggested changes, you should definitely amend the first line of your commit to something more descriptive than First Commit.

I also don't really like these changes since a JSON file is not a spreadsheet. So the functionality should probably be separate from --spreadsheet and e. g. have a --json equivalent.

jjjake · 2021-11-01T18:11:09Z

Thanks for the contribution @laptopsftw, and thanks for reviewing @maxz!

I agree with all of @maxz's suggestions. Happy to merge once these are resolved. Thanks again.

cclauss · 2022-04-12T06:10:49Z

internetarchive/cli/ia_upload.py

@@ -78,7 +78,7 @@
 from internetarchive.cli.argparser import get_args_dict, convert_str_list_to_unicode
 from internetarchive.session import ArchiveSession
 from internetarchive.utils import (InvalidIdentifierException, get_s3_xml_text,
-                                   is_valid_metadata_key, validate_s3_identifier)
+                                   is_valid_metadata_key, validate_s3_identifier, is_json)


The tendency seems to be towards 88 chars max per line.
https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length

cclauss · 2022-04-12T06:13:17Z

internetarchive/utils.py

@@ -398,3 +398,8 @@ def merge_dictionaries(dict0, dict1, keys_to_drop=None):
    # Items from `dict1` take precedence over items from `dict0`.
    new_dict.update(dict1)
    return new_dict
+
+
+def is_json(filename):


Now that we have dropped legacy Python, let's add type hints on the function parameters and return type.
Also, given that import os is already done above, let's leverage that.

def is_json(file_path: str) -> bool: return os.path.splitext(file_path)[1] == ".json"

https://docs.python.org/3/library/os.path.html#os.path.splitext

Of course, my favorite would be import pathlib above and then:

def is_json(file_path: str) -> bool: return Path(file_path).suffix == ".json"

https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.suffix

cclauss · 2022-04-12T06:13:28Z

internetarchive/utils.py

+
+def is_json(filename):
+    # Checks whether filetype is .json, else fallsback to .csv file format
+    return bool(re.fullmatch('^.*\.json$', filename))


@maxz In this instance, the bool() is appropriate because a one-line utility is_xyz() function should return True or False. Without bool() the function could return two different data types which is a bit confusing for the caller.

>>> import re >>> type(re.fullmatch('^.*\.json$', "filename")) <class 'NoneType'> >>> type(re.fullmatch('^.*\.json$', "filename.json")) <class 're.Match'>

cclauss · 2022-04-12T06:17:39Z

internetarchive/cli/ia_upload.py

+            if is_json(args['--spreadsheet']):
+                spreadsheet = json.load(fp)
+            else:
+                spreadsheet = csv.DictReader(fp)


It is strange to call a dict a spreadsheet.
Does the for row in spreadsheet: logic below really work for all .json files?

The csv.DictReader is an iterator producing a dict for each row in the CSV file. You only get dicts in the loop below.
All that means is that the JSON file must have a similar structure, i.e. an array of objects:

[ {"identifier": "item", "file": "./foobar"}, {"file": "./foobaz"} ]

This needs test coverage though and an example in the documentation.

maxz · 2022-04-14T23:25:58Z

Frankly, I do not understand how the recent review activity on this pull request was motivated.

To me it seems like @laptopsftw does not have too much interest in integrating this pull request or else they would at least have notified us that they are working on the changes (unless something major in their life came up which hindered them).

laptopsftw · 2022-04-17T15:16:59Z

Frankly, I do not understand how the recent review activity on this pull request was motivated.

To me it seems like @laptopsftw does not have too much interest in integrating this pull request or else they would at least have notified us that they are working on the changes (unless something major in their life came up which hindered them).

honestly i was still on the edge of actually integrating it on the mainline, since .csv was already a good workaround for this and it's just me being dum dum (unfamiliar) with the file format.

as for me i have a separate local fork working for my case. would try to update a new pull request if i started working for 3.0.0 (i'm still uploading my files in 2.1.0 atm)

maxz · 2022-04-17T16:11:32Z

Good to know, thanks for your reply.

First Commit

0abebae

Checks if file ends in *.json, will be processed as json, else *.csv

maxz suggested changes Oct 28, 2021

View reviewed changes

cclauss suggested changes Apr 12, 2022

View reviewed changes

cclauss mentioned this pull request Apr 11, 2023

psf/black --skip-string-normalization #592

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable .json formats for uploading and metadata editing #443

enable .json formats for uploading and metadata editing #443

laptopsftw commented Oct 27, 2021

maxz Oct 28, 2021 •

edited

Loading

cclauss Apr 12, 2022

JustAnotherArchivist Apr 12, 2022

maxz Oct 28, 2021

maxz Oct 28, 2021

maxz Oct 28, 2021

cclauss Apr 12, 2022

JustAnotherArchivist Apr 12, 2022

jjjake Apr 14, 2022

maxz Oct 28, 2021

cclauss Apr 12, 2022

JustAnotherArchivist Apr 12, 2022

jjjake Apr 14, 2022 •

edited

Loading

maxz Apr 14, 2022

maxz commented Oct 28, 2021 •

edited

Loading

jjjake commented Nov 1, 2021

cclauss Apr 12, 2022

cclauss Apr 12, 2022

cclauss Apr 12, 2022

cclauss Apr 12, 2022

JustAnotherArchivist Apr 12, 2022

maxz commented Apr 14, 2022

laptopsftw commented Apr 17, 2022 •

edited

Loading

maxz commented Apr 17, 2022



		def is_json(filename):
		# Checks whether filetype is .json, else fallsback to .csv file format

enable .json formats for uploading and metadata editing #443

Are you sure you want to change the base?

enable .json formats for uploading and metadata editing #443

Conversation

laptopsftw commented Oct 27, 2021

maxz Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjjake Apr 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxz commented Oct 28, 2021 • edited Loading

jjjake commented Nov 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxz commented Apr 14, 2022

laptopsftw commented Apr 17, 2022 • edited Loading

maxz commented Apr 17, 2022

maxz Oct 28, 2021 •

edited

Loading

jjjake Apr 14, 2022 •

edited

Loading

maxz commented Oct 28, 2021 •

edited

Loading

laptopsftw commented Apr 17, 2022 •

edited

Loading