Extracting mentions by DOI publications (Crossref and DataCite) #266

dmijatovic · 2022-05-13T10:33:15Z

dmijatovic
May 13, 2022
Maintainer

Mentions

On the software page we have mentions section, where publications, blogs and other "types" of software mentions (referencing) are listed. In the legacy version of RSD this information is scraped from Zotero. For this purpose the Zotero library (company space) and api key are required. This worked well for eScience Center.

Extracting mentions on demand from different sources

Talking with our clients we see use of different reference manager applications like Zotero. Pure seem to be widely used in the Netherlands, but the clients mentioned number of other alternatives too. Some suggestions were also made to use ORCID api (?) for this purpose (I assume more from the personal perspective).

Note that I am not expert in this area and I need some assistance. Over the time I witnessed growing of (open) api efforts which enable extraction of (meta) data from various systems. I expect to find various ways to extract the information we need.

I think that pull/extract of mentions on demand might be more suitable. The scrapping approach requires certain definitions (incl. secrets) to be stored somewhere in RSD per company. In addition, the scraping (on schedule in the background) usually pulls all information and store it somewhere in the RSD, while some of the items might never been used.

Manually adding mention items

Looking at the Zotero manual they seem to support adding refference items from different sources on demand (see image below for reference or use link).

Pure website mentions these data sources (see image below for reference or use link).

I wonder what would be optimal approach for RSD to enable user to add specific mentions to software/project? If at least some of the sources offer an open api, without key/secret and with reasonable limits I might lean toward using this (on demand) approach above the scraper approach with direct link to Zotero/Pure or other similar services. in addition, it seem to me that complete manual adding of mention would always be needed as the "last resort" option.

@jmaassen, @cmeessen, @ruester and others, I wonder what are your thoughts on how to obtain mentions in RSD in a most flexible way?

Answered by dmijatovic

May 24, 2022

Extracting mention info using DOI

After some research and testing, I decided to implement following approach for extracting mentions of software and project in the scientific publications.

The process consist of 2 steps:

Use doi.org to extract information about RA (provider)
Based on RA use Crossref or DataCite to extract information
In addition, we can offer search by publication title in Crossref (journals, publications etc) and DataCite (software and datasets) registries

Extract RA of DOI

First step is to extact information about RA. We support Crossref and DataCite. These are also 2 widely used providers. DataCite is mainly used for software and datasets while Crossref is widely use…

View full answer

cmeessen · 2022-05-13T14:08:51Z

cmeessen
May 13, 2022
Collaborator

I think, if possible, mentions should be scraped automatically as it may be too much work to keep it up to date, and maintainers may forget at some point. Citation scraping would not have to be done so frequently too. My guess is that once every week or every two weeks would be enough. Nevertheless, there could still be a button in the UI to trigger the scraping for a specific DOI or to add one manually. However, this requires that there is a service that could be used, and apparently this is not so easy.

I had a look at different services that offer the possibility for mention scraping a while ago. Pure was new to me though, and I think it is not widely used here - but I might be wrong. The services I found that have an API are

DataCite (not all citations listed, collaborates with crossref)
Crossref Cited-by (seems to be quite restrictive and only for participating organisations)
Google Scholar
Web of Science (only papers, no software)
Zenodo (beta, relies on DataCite)

In fact, Google Scholar is doing a good job in identifying mentions, and they also have a good API documentation. Rate limit in the free version is 100/month, which is not really sufficient. Also, there is no way to query for DOIs, so we would have to make searches using titles and authors, and I am afraid this is very prone to errors.

In conclusion, I think there is no easy path to go right now, and manual maintenance might be the best solution for now.

1 reply

ruester May 18, 2022
Collaborator

about Google Scholar:
If we run this update routine monthly we could have 100 software entries I guess.. if we have more than 100 "lookups" per month we could have some kind of pipeline/queue so that at least every entry gets one update a year (meaning we could have 1200 software entries) which would be also enough IMO since updating the mentions does not need to be "realtime".
Regarding the "false-positive" mentions: in this case I would suggest to have some kind of "reviewing queue", so that we can give suggested mentions, but a reviewer needs to verify the mentions (for example gets notified by mail about new possible mentions)

After fiddling around a bit with the Google Scholar API: the "cited-by API" always needs an author ID which may be difficult for the software entries to identify... So you cannot search for specific articles (in our case software) in the cited-by API

dmijatovic · 2022-05-16T18:40:56Z

dmijatovic
May 16, 2022
Maintainer Author

Just came accross this information, source is Crossref Annual Meeting. I am not sure how reliable/accurate this stat is and what is the base.

0 replies

dmijatovic · 2022-05-18T18:32:14Z

dmijatovic
May 18, 2022
Maintainer Author

I plan to start with this approach

Adding mention by DOI

Use doi api to determine RA

https://doi.org/doiRA/10.1007/978-3-319-92016-0_13

response

[
    {
        "DOI": "10.1007/978-3-319-92016-0_13",
        "RA": "Crossref"
    }
]

If Crossref use crossref api to retreive info

https://api.crossref.org/works/10.1007/978-3-319-92016-0_13

response

{
    "status": "ok",
    "message-type": "work",
    "message-version": "1.0.0",
    "message": {
        "indexed": {
            "date-parts": [
                [
                    2022,
                    4,
                    1
                ]
            ],
            "date-time": "2022-04-01T17:31:37Z",
            "timestamp": 1648834297868
        },
        "publisher-location": "Cham",
        "reference-count": 26,
        "publisher": "Springer International Publishing",
        "isbn-type": [
            {
                "value": "9783319920153",
                "type": "print"
            },
            {
                "value": "9783319920160",
                "type": "electronic"
            }
        ],
        "license": [
            {
                "start": {
                    "date-parts": [
                        [
                            2018,
                            1,
                            1
                        ]
                    ],
                    "date-time": "2018-01-01T00:00:00Z",
                    "timestamp": 1514764800000
                },
                "content-version": "unspecified",
                "delay-in-days": 0,
                "URL": "http://www.springer.com/tdm"
            }
        ],
        "content-domain": {
            "domain": [
                "link.springer.com"
            ],
            "crossmark-restriction": false
        },
        "short-container-title": [],
        "published-print": {
            "date-parts": [
                [
                    2018
                ]
            ]
        },
        "DOI": "10.1007/978-3-319-92016-0_13",
        "type": "book-chapter",
        "created": {
            "date-parts": [
                [
                    2018,
                    5,
                    21
                ]
            ],
            "date-time": "2018-05-21T13:54:16Z",
            "timestamp": 1526910856000
        },
        "page": "133-145",
        "update-policy": "http://dx.doi.org/10.1007/springer_crossmark_policy",
        "source": "Crossref",
        "is-referenced-by-count": 1,
        "title": [
            "Query Disambiguation Based on Clustering Techniques"
        ],
        "prefix": "10.1007",
        "author": [
            {
                "given": "Panagiota",
                "family": "Kotoula",
                "sequence": "first",
                "affiliation": []
            },
            {
                "given": "Christos",
                "family": "Makris",
                "sequence": "additional",
                "affiliation": []
            }
        ],
        "member": "297",
        "published-online": {
            "date-parts": [
                [
                    2018,
                    5,
                    22
                ]
            ]
        },
        "reference": [
            {
                "key": "13_CR1",
                "doi-asserted-by": "crossref",
                "unstructured": "Agrawal, R., Collapudi, S., Halverson, A., Ieong S.: Diversifying search results. In: Proceedings of the 2nd International Conference on Web Search and Data Mining, pp. 5–14 (2009)",
                "DOI": "10.1145/1498759.1498766"
            } ...
        ],
        "container-title": [
            "IFIP Advances in Information and Communication Technology",
            "Artificial Intelligence Applications and Innovations"
        ],
        "original-title": [],
        "link": [
            {
                "URL": "http://link.springer.com/content/pdf/10.1007/978-3-319-92016-0_13",
                "content-type": "unspecified",
                "content-version": "vor",
                "intended-application": "similarity-checking"
            }
        ],
        "deposited": {
            "date-parts": [
                [
                    2018,
                    5,
                    21
                ]
            ],
            "date-time": "2018-05-21T13:58:13Z",
            "timestamp": 1526911093000
        },
        "score": 1,
        "resource": {
            "primary": {
                "URL": "http://link.springer.com/10.1007/978-3-319-92016-0_13"
            }
        },
        "subtitle": [],
        "short-title": [],
        "issued": {
            "date-parts": [
                [
                    2018
                ]
            ]
        },
        "ISBN": [
            "9783319920153",
            "9783319920160"
        ],
        "references-count": 26,
        "URL": "http://dx.doi.org/10.1007/978-3-319-92016-0_13",
        "relation": {},
        "ISSN": [
            "1868-4238",
            "1868-422X"
        ],
        "issn-type": [
            {
                "value": "1868-4238",
                "type": "print"
            },
            {
                "value": "1868-422X",
                "type": "electronic"
            }
        ],
        "published": {
            "date-parts": [
                [
                    2018
                ]
            ]
        }
    }
}

if Datacite use datacite api

https://api.datacite.org/dois/10.5281/zenodo.5873940

{
    "data": {
        "id": "10.5281/zenodo.5873940",
        "type": "dois",
        "attributes": {
            "doi": "10.5281/zenodo.5873940",
            "prefix": "10.5281",
            "suffix": "zenodo.5873940",
            "identifiers": [],
            "alternateIdentifiers": [],
            "creators": [
                {
                    "name": "van Hees, Vincent",
                    "givenName": "Vincent",
                    "familyName": "van Hees",
                    "affiliation": [
                        "Netherlands eScience Center"
                    ],
                    "nameIdentifiers": [
                        {
                            "schemeUri": "https://orcid.org",
                            "nameIdentifier": "https://orcid.org/0000-0003-0182-9008",
                            "nameIdentifierScheme": "ORCID"
                        }
                    ]
                },
                {
                    "name": "Fang, Zhou",
                    "givenName": "Zhou",
                    "familyName": "Fang",
                    "affiliation": [
                        "Activinsights Ltd."
                    ],
                    "nameIdentifiers": []
                },
                {
                    "name": "Mirkes, Evgeny",
                    "givenName": "Evgeny",
                    "familyName": "Mirkes",
                    "affiliation": [
                        "University of Leicester"
                    ],
                    "nameIdentifiers": [
                        {
                            "schemeUri": "https://orcid.org",
                            "nameIdentifier": "https://orcid.org/0000-0003-1474-1734",
                            "nameIdentifierScheme": "ORCID"
                        }
                    ]
                },
                {
                    "name": "Heywood, Joe",
                    "givenName": "Joe",
                    "familyName": "Heywood",
                    "affiliation": [
                        "University College London"
                    ],
                    "nameIdentifiers": []
                },
                {
                    "name": "Zhao, Jing Hua",
                    "givenName": "Jing Hua",
                    "familyName": "Zhao",
                    "affiliation": [
                        "MRC Epidemiology Unit"
                    ],
                    "nameIdentifiers": [
                        {
                            "schemeUri": "https://orcid.org",
                            "nameIdentifier": "https://orcid.org/0000-0003-4930-3582",
                            "nameIdentifierScheme": "ORCID"
                        }
                    ]
                },
                {
                    "name": "Joan, Capdevila Pujol",
                    "givenName": "Capdevila Pujol",
                    "familyName": "Joan",
                    "affiliation": [
                        "Polytechnical University of Catalonia"
                    ],
                    "nameIdentifiers": []
                },
                {
                    "name": "Sabia, Séverine",
                    "givenName": "Séverine",
                    "familyName": "Sabia",
                    "affiliation": [
                        "Inserm"
                    ],
                    "nameIdentifiers": [
                        {
                            "schemeUri": "https://orcid.org",
                            "nameIdentifier": "https://orcid.org/0000-0003-3109-9720",
                            "nameIdentifierScheme": "ORCID"
                        }
                    ]
                },
                {
                    "name": "Migueles, Jairo H.",
                    "givenName": "Jairo H.",
                    "familyName": "Migueles",
                    "affiliation": [
                        "University of Granada"
                    ],
                    "nameIdentifiers": [
                        {
                            "schemeUri": "https://orcid.org",
                            "nameIdentifier": "https://orcid.org/0000-0003-0366-6935",
                            "nameIdentifierScheme": "ORCID"
                        }
                    ]
                }
            ],
            "titles": [
                {
                    "title": "GGIR"
                }
            ],
            "publisher": "Zenodo",
            "container": {},
            "publicationYear": 2022,
            "subjects": [
                {
                    "subject": "activity tracker"
                },
                {
                    "subject": "health"
                },
                {
                    "subject": "fitness"
                },
                {
                    "subject": "sleep research"
                },
                {
                    "subject": "accelerometer"
                }
            ],
            "contributors": [],
            "dates": [
                {
                    "date": "2022-01-18",
                    "dateType": "Issued"
                }
            ],
            "language": null,
            "types": {
                "ris": "COMP",
                "bibtex": "misc",
                "citeproc": "article",
                "schemaOrg": "SoftwareSourceCode",
                "resourceTypeGeneral": "Software"
            },
            "relatedIdentifiers": [
                {
                    "relationType": "IsSupplementTo",
                    "relatedIdentifier": "https://github.com/wadpac/GGIR/tree/2.5-6",
                    "relatedIdentifierType": "URL"
                },
                {
                    "relationType": "IsVersionOf",
                    "relatedIdentifier": "10.5281/zenodo.1051064",
                    "relatedIdentifierType": "DOI"
                }
            ],
            "sizes": [],
            "formats": [],
            "version": "2.5-6",
            "rightsList": [
                {
                    "rights": "GNU Library General Public License v2 only",
                    "rightsUri": "https://www.gnu.org/licenses/old-licenses/lgpl-2.0-standalone.html",
                    "schemeUri": "https://spdx.org/licenses/",
                    "rightsIdentifier": "lgpl-2.0",
                    "rightsIdentifierScheme": "SPDX"
                },
                {
                    "rights": "Open Access",
                    "rightsUri": "info:eu-repo/semantics/openAccess"
                }
            ],
            "descriptions": [
                {
                    "description": "Converts raw data from wearables into insightful reports for researchers investigating human daily physical activity and sleep.",
                    "descriptionType": "Abstract"
                }
            ],
            "geoLocations": [],
            "fundingReferences": [],
            "xml": "",
            "url": "https://zenodo.org/record/5873940",
            "contentUrl": null,
            "metadataVersion": 0,
            "schemaVersion": "http://datacite.org/schema/kernel-4",
            "source": "mds",
            "isActive": true,
            "state": "findable",
            "reason": null,
            "viewCount": 0,
            "viewsOverTime": [],
            "downloadCount": 0,
            "downloadsOverTime": [],
            "referenceCount": 0,
            "citationCount": 0,
            "citationsOverTime": [],
            "partCount": 0,
            "partOfCount": 0,
            "versionCount": 0,
            "versionOfCount": 1,
            "created": "2022-01-18T16:54:54.000Z",
            "registered": "2022-01-18T16:54:55.000Z",
            "published": "2022",
            "updated": "2022-01-18T16:54:55.000Z"
        },
        "relationships": {
            "client": {
                "data": {
                    "id": "cern.zenodo",
                    "type": "clients"
                }
            },
            "provider": {
                "data": {
                    "id": "cern",
                    "type": "providers"
                }
            },
            "media": {
                "data": {
                    "id": "10.5281/zenodo.5873940",
                    "type": "media"
                }
            },
            "references": {
                "data": []
            },
            "citations": {
                "data": []
            },
            "parts": {
                "data": []
            },
            "partOf": {
                "data": []
            },
            "versions": {
                "data": []
            },
            "versionOf": {
                "data": [
                    {
                        "id": "10.5281/zenodo.1051064",
                        "type": "dois"
                    }
                ]
            }
        }
    }
}

0 replies

dmijatovic · 2022-05-24T07:59:50Z

dmijatovic
May 24, 2022
Maintainer Author

Extracting mention info using DOI

After some research and testing, I decided to implement following approach for extracting mentions of software and project in the scientific publications.

The process consist of 2 steps:

Use doi.org to extract information about RA (provider)
Based on RA use Crossref or DataCite to extract information
In addition, we can offer search by publication title in Crossref (journals, publications etc) and DataCite (software and datasets) registries

Extract RA of DOI

First step is to extact information about RA. We support Crossref and DataCite. These are also 2 widely used providers. DataCite is mainly used for software and datasets while Crossref is widely used for other types of scientific publications (books, journals etc).

https://doi.org/doiRA/10.1007/978-3-319-92016-0_13

Extract mention info using RA api's

As mention, I will implement support for extracting "works" from Crossref and DataCite api.

Crossref api

Crossref offers REST API, with select and filter options. We use works endpoint because it supports select option to obtain specific properties and not complete object (which is significantly larger)

https://api.crossref.org/works?filter=doi:10.2174/0929867327666200727114410&select=DOI,ISBN,ISSN,URL,title,subtitle,author,publisher,published,type,subject

DataCite api (GraphQL)

DataCite offers various options: REST API, GraphQL api and more
I will use GraphQL because it offers option to select specific properties rather than returning complete object.

query{
    work(id:"10.1007/978-1-4939-9145-7_4"){
      doi,    	
      type,
    	types{
        resourceType,
        resourceTypeGeneral
      },
    	sizes,
    	version,
      titles(first: 1){
        title
      },
      publisher,
      publicationYear,
      creators{
        givenName,
          familyName,
          affiliation{
          name
        }
      },
      contributors{
        givenName,
          familyName,
          affiliation{
          name
        }
      }    	
    }
  }

You can paste above query into GraphiQL playground

Searching for publication by title

During my research how to extract mention information from DOI and examining api endpoints at Crossref and DataCite I discovered that these API's also enable search for publications using text queries. If I am correct, the documentation mentions that both providers use ElasticSearch as database. This type of database enables extensive text search. At the moment I think it is most suitable to limit the (text) search to a publication title and to 30 results (per provider).

Crossref searching by title

https://api.crossref.org/works?query.title=knime&select=DOI,ISBN,URL,title,subtitle,author,publisher,published,type,subject&rows=30

DataCite search by title

query{	
  works(query:"titles.title:knime",first:30){
    nodes{
      doi,
      type,
      type,
      sizes,
      version,
      titles(first: 1){
        title
      },
      descriptions(first:1){
        description
      },
      publisher,
      publicationYear,
      creators{
        givenName,
          familyName,
          affiliation{
          name
        }
      },
      contributors{
        givenName,
          familyName,
          affiliation{
          name
        }
      }
    }
  }  
}

You can paste above query into GraphiQL playground

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting mentions by DOI publications (Crossref and DataCite) #266

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Extracting mentions by DOI publications (Crossref and DataCite) #266

dmijatovic May 13, 2022 Maintainer

Mentions

Extracting mentions on demand from different sources

Manually adding mention items

Extracting mention info using DOI

Extract RA of DOI

Replies: 4 comments · 1 reply

cmeessen May 13, 2022 Collaborator

ruester May 18, 2022 Collaborator

dmijatovic May 16, 2022 Maintainer Author

dmijatovic May 18, 2022 Maintainer Author

Adding mention by DOI

dmijatovic May 24, 2022 Maintainer Author

Extracting mention info using DOI

Extract RA of DOI

Extract mention info using RA api's

Crossref api

DataCite api (GraphQL)

Searching for publication by title

Crossref searching by title

DataCite search by title

dmijatovic
May 13, 2022
Maintainer

Replies: 4 comments 1 reply

cmeessen
May 13, 2022
Collaborator

ruester May 18, 2022
Collaborator

dmijatovic
May 16, 2022
Maintainer Author

dmijatovic
May 18, 2022
Maintainer Author

dmijatovic
May 24, 2022
Maintainer Author