Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ORCIDs #30

Open
proycon opened this issue Oct 12, 2022 · 7 comments
Open

Add support for ORCIDs #30

proycon opened this issue Oct 12, 2022 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@proycon
Copy link
Owner

proycon commented Oct 12, 2022

Authors are best identified by their ORCID. We ideally need a way of resolving user emails to orcids automatically (does their API offer such a function?).

@proycon proycon self-assigned this Oct 12, 2022
@proycon proycon added the enhancement New feature or request label Oct 12, 2022
@broeder-j
Copy link

broeder-j commented Oct 18, 2022

Yes, it does: https://info.orcid.org/faq/how-do-i-find-orcid-record-holders-at-my-institution/
BUT (this is what I figured could be wrong): emails of users are per default not visible to the outside, a member has to upgrade this to either internal or public on a per email level. So only if people have done this you have a chance to find them via an authorized query to the API by email. I think most people do not change the default, so i expect this way to yield 10%. (test query https://pub.orcid.org/v3.0/csv-search/?q=affiliation-org-name:ORCID&fl=orcid,given-names,family-name,current-institution-affiliation-name,email)

A better way could be to find people over name, plus affiliation, i.e. institution name or identifier.
Here codemetapy probably only has a chance if the institution is given or it can get it from the metadata already there...
How to do this I do not know, since contributors can be from everywhere, maybe a first thing would be to allow for a list to try.

Let me know if you plan to work on this. I have a layout of what I want, but not implemented anything yet and it is currently not on my todo list

in terms of code out there I found this which is old and may or may not work:
https://github.com/ORCID/python-orcid
https://github.com/scholrly/orcid-python

@proycon
Copy link
Owner Author

proycon commented Oct 18, 2022

emails of users are per default not visible to the outside, a member has to upgrade this to either internal or public on a per email > level. So only if people have done this you have a chance to find them via an authorized query to the API by email. I think most > people do not change the default, so i expect this way to yield 10%.

Too bad, this would be the ideal method but if it yields only 10% it's not very useful indeed.

A better way could be to find people over name, plus affiliation, i.e. institution name or identifier.

That sounds viable yes, though one issue with affiliations is that people tend to come and go in institutions.

..maybe a first thing would be to allow for a list to try.

Like explicitly passing a tsv file to codemetapy with say emails and orcids? That would work yes, though it isn't as fully automated as we'd want ideally.

@broeder-j
Copy link

An add on to this. codemetapy parses the Citation.cff file, but it does not use the orcids in there for authors/contributors Ids but instead the gitlab id (account page) "@id": "https://iffgit.fz-juelich.de/fleur/fleur/person/cmax347".

Ideally once would keep both information... i.e that the orcid and the git id are same as somewhere.

also in that context the familyName and givenName parsing is also not optimal if the link of the person does not contain the name, example:

       {
            "@id": "https://iffgit.fz-juelich.de/fleur/fleur/person/cmax347",
            "@type": "Person",
            "email": "[email protected]@gmail.com",
            "familyName": "",
            "givenName": "cMax347",
            "position": 71
        },
        {
            "@id": "https://iffgit.fz-juelich.de/fleur/fleur/person/christian-roman-gerhorst",
            "@type": "Person",
            "email": "[email protected]",
            "familyName": "Gerhorst",
            "givenName": "Christian-Roman",
            "position": 72
        }

So it has also problems with middle names. I would assume that these would be easier to parse from an Citation.cff file.

@proycon
Copy link
Owner Author

proycon commented Nov 4, 2022

An add on to this. codemetapy parses the Citation.cff file, but it does not use the orcids in there for authors/contributors Ids

Hmm.. Agreed, if there are ORCIDs then they shouldn't be overwritten. I wonder if it's an issue in codemetapy or in https://github.com/citation-file-format/cff-converter-python, we don't do the CITATION.cff parsing ourselves.

but instead the gitlab id (account page) "@id": "https://iffgit.fz-juelich.de/fleur/fleur/person/cmax347".

(it's not the gitlab id, see #34)

Ideally once would keep both information... i.e that the orcid and the git id are same as somewhere.

@proycon
Copy link
Owner Author

proycon commented Nov 4, 2022

also in that context the familyName and givenName parsing is also not optimal if the link of the person does not contain the name, example:

  {
       "@id": "https://iffgit.fz-juelich.de/fleur/fleur/person/cmax347",
       "@type": "Person",
       "email": "[email protected]@gmail.com",
       "familyName": "",
       "givenName": "cMax347",
       "position": 71
   },

Yes, we'd better just use schema:name if we can't decipher given and family names, needs some fine-tuning. That e-mail looks malformed too.
For the actual name parsing from arbitrary strings I'm using nameparser

@proycon
Copy link
Owner Author

proycon commented Mar 2, 2023

I've been giving this some more thought and there are some challenges to solve, mostly related to 'affiliations':

  1. In the current implementation, whenever an author appears in multiple
    software metadata projects (or even multiple times in the same one), there
    is a high risk of properties getting conflated if not consistently named.
    The most notable one is 'affiliation'. If an author at various points has different
    affiliations (or even the same one but not consistently named). Then these will all
    be propagated to all instances when the full graph of multiple software projects is loaded.
  2. Related to the above: 'affiliation' is a property of a schema:Person. But
    that means it is no longer attached to any specific software project,
    meaning we can't differentiate between affiliations at the time of the
    sofware project or later/before. We'd always get all of them, which may be
    less informative than desired. It's common for people to have (had) multiple
    affiliations throughout their career. We do use schema:producer to tie
    software projects to institutions directly, so at least that is expressable
    (relates to Discussing the inclusion of producer and creator terms codemeta/codemeta#286)
  3. We already ascertained that automatically going from names or e-mails to
    ORCIDs is hard. We probably need a custom mapping as input (like a tsv
    file).
  4. The reverse, going from ORCIDs to all the names/emails/urls is fairly easy, we can
    just query orcid.org and request application/ld+json to get a schema.org
    representation that is compatible with codemeta. Some caveats there:
    * It does not contain the e-mail, even if it is public. The turtle
    output, however, does (it uses a completely different vocabulary than
    the JSON-LD serialisation)
    * The JSON-LD output lists all affiliations it knows (including those
    that have ended, but that information is not outputted). The turtle
    output lists no affiliations at all.

@proycon
Copy link
Owner Author

proycon commented Apr 17, 2024

Possibly relevant: ORCID profiles can be tied to Github accounts. If the GitHub API exposes this it provides a nice way to find ORCIDs.

See https://scicomm.xyz/@ORCID_Org/112282433046701907

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants