Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If ORCID null, don't include in datapackage #3005

Closed
e-belfer opened this issue Nov 2, 2023 · 4 comments · Fixed by catalyst-cooperative/pudl-archiver#486
Closed

If ORCID null, don't include in datapackage #3005

e-belfer opened this issue Nov 2, 2023 · 4 comments · Fixed by catalyst-cooperative/pudl-archiver#486
Assignees
Labels
bug Things that are just plain broken. community datapkg Frictionless data package input, output, metadata, manipulation metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. zenodo Issues having to do with Zenodo data archiving and retrieval.

Comments

@e-belfer
Copy link
Member

e-belfer commented Nov 2, 2023

Describe the bug

Datapackages in the pudl-archiver repository are generated using methods imported from the pudl repository.

  • The problem: ORCID IDs in datapackage.json files for every dataset where contributors don't have an ORCID ID are appearing as orcid: null.
  • Where is this problem taking place? orcid is a field of the Contributor class in pudl.metadata.classes.py, and can either be a string or a None type. For most contributors, this field is not provided in the CONTRIBUTORS dictionary in pudl.metadata.sources.py. When we initialize the Contributor class using the Contributor.from_id() method in pudl.metadata.classes.py, it produces a dictionary with a null value.
  • Where are datapackages being generated? The method used to create a new datapackage in the pudl-archiver repository (in pudl_archiver.frictionless.py), is DataSource.from_id() in pudl.metadata.classes.py. The Contributor class is also called directly in the pudl-archiver repository, so the fix will need to be made in both the Contributor and the DataSource classes.

Bug Severity

How badly is this bug affecting you?

  • Medium: With some effort, I can work around the bug.
    This occasionally causes issues but is also the status quo in all existing archives.

To Reproduce

See the datapackage.json file for any existing data archive, e.g.: https://zenodo.org/records/8164776

To produce a datapackage from an existing dataset, install PUDL locally following these instructions and run:

from pudl.metadata.classes import DataSource
DataSource.from_id("eia860").contributors

You can replace "eia860" with any of the data source IDs in pudl.metadata.sources.py.

Expected behavior

The orcid field should only be included as a field in the datapackage when it exists.

Software Environment?

  • Operating System. (e.g. MacOS 14.5, Ubuntu 22.04, Windows Subsystem for Linux v2)
    Ubuntu 22.04

  • Python version and distribution (e.g. Anaconda Python 3.10.6)
    Python 3.11.6

  • How did you install PUDL?
    git clone dev

@yolandazzz13
Copy link

Hi, I'm a student at University of Michigan who's currently working on a final project in a software engineering course that expect us to make contributions to open source github projects. I wonder if you can assign me to this issue? @catalyst-cooperative/com-dev Also , it would be awesome if I can be told how to generate new datapackage.json files. Thank you in advance!

@e-belfer
Copy link
Member Author

e-belfer commented Nov 8, 2024

@yolandazzz13 Awesome, happy you found us! Let me flesh this issue out a bit more to make it clearer what needs to get done and make sure it's a good fit for a first-time contributor.

@e-belfer
Copy link
Member Author

e-belfer commented Nov 8, 2024

@yolandazzz13 Took a stab at updating my description of the problem and have assigned you, let me know if you have any questions! Otherwise, I'm happy to review a design proposal or a draft PR when you're ready.

@yolandazzz13
Copy link

yolandazzz13 commented Nov 19, 2024 via email

@zaneselvans zaneselvans added datapkg Frictionless data package input, output, metadata, manipulation metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. zenodo Issues having to do with Zenodo data archiving and retrieval. labels Nov 21, 2024
@zaneselvans zaneselvans moved this from New to In progress in Catalyst Megaproject Nov 21, 2024
@github-project-automation github-project-automation bot moved this from In review to Done in Catalyst Megaproject Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that are just plain broken. community datapkg Frictionless data package input, output, metadata, manipulation metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. zenodo Issues having to do with Zenodo data archiving and retrieval.
Projects
Archived in project
3 participants