Extracting mentions by DOI publications (Crossref and DataCite) #266
-
MentionsOn the software page we have mentions section, where publications, blogs and other "types" of software mentions (referencing) are listed. In the legacy version of RSD this information is scraped from Zotero. For this purpose the Zotero library (company space) and api key are required. This worked well for eScience Center. Extracting mentions on demand from different sourcesTalking with our clients we see use of different reference manager applications like Zotero. Pure seem to be widely used in the Netherlands, but the clients mentioned number of other alternatives too. Some suggestions were also made to use ORCID api (?) for this purpose (I assume more from the personal perspective). Note that I am not expert in this area and I need some assistance. Over the time I witnessed growing of (open) api efforts which enable extraction of (meta) data from various systems. I expect to find various ways to extract the information we need. I think that pull/extract of mentions on demand might be more suitable. The scrapping approach requires certain definitions (incl. secrets) to be stored somewhere in RSD per company. In addition, the scraping (on schedule in the background) usually pulls all information and store it somewhere in the RSD, while some of the items might never been used. Manually adding mention itemsLooking at the Zotero manual they seem to support adding refference items from different sources on demand (see image below for reference or use link). Pure website mentions these data sources (see image below for reference or use link). I wonder what would be optimal approach for RSD to enable user to add specific mentions to software/project? If at least some of the sources offer an open api, without key/secret and with reasonable limits I might lean toward using this (on demand) approach above the scraper approach with direct link to Zotero/Pure or other similar services. in addition, it seem to me that complete manual adding of mention would always be needed as the "last resort" option. @jmaassen, @cmeessen, @ruester and others, I wonder what are your thoughts on how to obtain mentions in RSD in a most flexible way? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 1 reply
-
I think, if possible, mentions should be scraped automatically as it may be too much work to keep it up to date, and maintainers may forget at some point. Citation scraping would not have to be done so frequently too. My guess is that once every week or every two weeks would be enough. Nevertheless, there could still be a button in the UI to trigger the scraping for a specific DOI or to add one manually. However, this requires that there is a service that could be used, and apparently this is not so easy. I had a look at different services that offer the possibility for mention scraping a while ago. Pure was new to me though, and I think it is not widely used here - but I might be wrong. The services I found that have an API are
In fact, Google Scholar is doing a good job in identifying mentions, and they also have a good API documentation. Rate limit in the free version is 100/month, which is not really sufficient. Also, there is no way to query for DOIs, so we would have to make searches using titles and authors, and I am afraid this is very prone to errors. In conclusion, I think there is no easy path to go right now, and manual maintenance might be the best solution for now. |
Beta Was this translation helpful? Give feedback.
-
Just came accross this information, source is Crossref Annual Meeting. I am not sure how reliable/accurate this stat is and what is the base. |
Beta Was this translation helpful? Give feedback.
-
I plan to start with this approach Adding mention by DOI
response [
{
"DOI": "10.1007/978-3-319-92016-0_13",
"RA": "Crossref"
}
]
response {
"status": "ok",
"message-type": "work",
"message-version": "1.0.0",
"message": {
"indexed": {
"date-parts": [
[
2022,
4,
1
]
],
"date-time": "2022-04-01T17:31:37Z",
"timestamp": 1648834297868
},
"publisher-location": "Cham",
"reference-count": 26,
"publisher": "Springer International Publishing",
"isbn-type": [
{
"value": "9783319920153",
"type": "print"
},
{
"value": "9783319920160",
"type": "electronic"
}
],
"license": [
{
"start": {
"date-parts": [
[
2018,
1,
1
]
],
"date-time": "2018-01-01T00:00:00Z",
"timestamp": 1514764800000
},
"content-version": "unspecified",
"delay-in-days": 0,
"URL": "http://www.springer.com/tdm"
}
],
"content-domain": {
"domain": [
"link.springer.com"
],
"crossmark-restriction": false
},
"short-container-title": [],
"published-print": {
"date-parts": [
[
2018
]
]
},
"DOI": "10.1007/978-3-319-92016-0_13",
"type": "book-chapter",
"created": {
"date-parts": [
[
2018,
5,
21
]
],
"date-time": "2018-05-21T13:54:16Z",
"timestamp": 1526910856000
},
"page": "133-145",
"update-policy": "http://dx.doi.org/10.1007/springer_crossmark_policy",
"source": "Crossref",
"is-referenced-by-count": 1,
"title": [
"Query Disambiguation Based on Clustering Techniques"
],
"prefix": "10.1007",
"author": [
{
"given": "Panagiota",
"family": "Kotoula",
"sequence": "first",
"affiliation": []
},
{
"given": "Christos",
"family": "Makris",
"sequence": "additional",
"affiliation": []
}
],
"member": "297",
"published-online": {
"date-parts": [
[
2018,
5,
22
]
]
},
"reference": [
{
"key": "13_CR1",
"doi-asserted-by": "crossref",
"unstructured": "Agrawal, R., Collapudi, S., Halverson, A., Ieong S.: Diversifying search results. In: Proceedings of the 2nd International Conference on Web Search and Data Mining, pp. 5–14 (2009)",
"DOI": "10.1145/1498759.1498766"
} ...
],
"container-title": [
"IFIP Advances in Information and Communication Technology",
"Artificial Intelligence Applications and Innovations"
],
"original-title": [],
"link": [
{
"URL": "http://link.springer.com/content/pdf/10.1007/978-3-319-92016-0_13",
"content-type": "unspecified",
"content-version": "vor",
"intended-application": "similarity-checking"
}
],
"deposited": {
"date-parts": [
[
2018,
5,
21
]
],
"date-time": "2018-05-21T13:58:13Z",
"timestamp": 1526911093000
},
"score": 1,
"resource": {
"primary": {
"URL": "http://link.springer.com/10.1007/978-3-319-92016-0_13"
}
},
"subtitle": [],
"short-title": [],
"issued": {
"date-parts": [
[
2018
]
]
},
"ISBN": [
"9783319920153",
"9783319920160"
],
"references-count": 26,
"URL": "http://dx.doi.org/10.1007/978-3-319-92016-0_13",
"relation": {},
"ISSN": [
"1868-4238",
"1868-422X"
],
"issn-type": [
{
"value": "1868-4238",
"type": "print"
},
{
"value": "1868-422X",
"type": "electronic"
}
],
"published": {
"date-parts": [
[
2018
]
]
}
}
} if Datacite use datacite api
{
"data": {
"id": "10.5281/zenodo.5873940",
"type": "dois",
"attributes": {
"doi": "10.5281/zenodo.5873940",
"prefix": "10.5281",
"suffix": "zenodo.5873940",
"identifiers": [],
"alternateIdentifiers": [],
"creators": [
{
"name": "van Hees, Vincent",
"givenName": "Vincent",
"familyName": "van Hees",
"affiliation": [
"Netherlands eScience Center"
],
"nameIdentifiers": [
{
"schemeUri": "https://orcid.org",
"nameIdentifier": "https://orcid.org/0000-0003-0182-9008",
"nameIdentifierScheme": "ORCID"
}
]
},
{
"name": "Fang, Zhou",
"givenName": "Zhou",
"familyName": "Fang",
"affiliation": [
"Activinsights Ltd."
],
"nameIdentifiers": []
},
{
"name": "Mirkes, Evgeny",
"givenName": "Evgeny",
"familyName": "Mirkes",
"affiliation": [
"University of Leicester"
],
"nameIdentifiers": [
{
"schemeUri": "https://orcid.org",
"nameIdentifier": "https://orcid.org/0000-0003-1474-1734",
"nameIdentifierScheme": "ORCID"
}
]
},
{
"name": "Heywood, Joe",
"givenName": "Joe",
"familyName": "Heywood",
"affiliation": [
"University College London"
],
"nameIdentifiers": []
},
{
"name": "Zhao, Jing Hua",
"givenName": "Jing Hua",
"familyName": "Zhao",
"affiliation": [
"MRC Epidemiology Unit"
],
"nameIdentifiers": [
{
"schemeUri": "https://orcid.org",
"nameIdentifier": "https://orcid.org/0000-0003-4930-3582",
"nameIdentifierScheme": "ORCID"
}
]
},
{
"name": "Joan, Capdevila Pujol",
"givenName": "Capdevila Pujol",
"familyName": "Joan",
"affiliation": [
"Polytechnical University of Catalonia"
],
"nameIdentifiers": []
},
{
"name": "Sabia, Séverine",
"givenName": "Séverine",
"familyName": "Sabia",
"affiliation": [
"Inserm"
],
"nameIdentifiers": [
{
"schemeUri": "https://orcid.org",
"nameIdentifier": "https://orcid.org/0000-0003-3109-9720",
"nameIdentifierScheme": "ORCID"
}
]
},
{
"name": "Migueles, Jairo H.",
"givenName": "Jairo H.",
"familyName": "Migueles",
"affiliation": [
"University of Granada"
],
"nameIdentifiers": [
{
"schemeUri": "https://orcid.org",
"nameIdentifier": "https://orcid.org/0000-0003-0366-6935",
"nameIdentifierScheme": "ORCID"
}
]
}
],
"titles": [
{
"title": "GGIR"
}
],
"publisher": "Zenodo",
"container": {},
"publicationYear": 2022,
"subjects": [
{
"subject": "activity tracker"
},
{
"subject": "health"
},
{
"subject": "fitness"
},
{
"subject": "sleep research"
},
{
"subject": "accelerometer"
}
],
"contributors": [],
"dates": [
{
"date": "2022-01-18",
"dateType": "Issued"
}
],
"language": null,
"types": {
"ris": "COMP",
"bibtex": "misc",
"citeproc": "article",
"schemaOrg": "SoftwareSourceCode",
"resourceTypeGeneral": "Software"
},
"relatedIdentifiers": [
{
"relationType": "IsSupplementTo",
"relatedIdentifier": "https://github.com/wadpac/GGIR/tree/2.5-6",
"relatedIdentifierType": "URL"
},
{
"relationType": "IsVersionOf",
"relatedIdentifier": "10.5281/zenodo.1051064",
"relatedIdentifierType": "DOI"
}
],
"sizes": [],
"formats": [],
"version": "2.5-6",
"rightsList": [
{
"rights": "GNU Library General Public License v2 only",
"rightsUri": "https://www.gnu.org/licenses/old-licenses/lgpl-2.0-standalone.html",
"schemeUri": "https://spdx.org/licenses/",
"rightsIdentifier": "lgpl-2.0",
"rightsIdentifierScheme": "SPDX"
},
{
"rights": "Open Access",
"rightsUri": "info:eu-repo/semantics/openAccess"
}
],
"descriptions": [
{
"description": "Converts raw data from wearables into insightful reports for researchers investigating human daily physical activity and sleep.",
"descriptionType": "Abstract"
}
],
"geoLocations": [],
"fundingReferences": [],
"xml": "",
"url": "https://zenodo.org/record/5873940",
"contentUrl": null,
"metadataVersion": 0,
"schemaVersion": "http://datacite.org/schema/kernel-4",
"source": "mds",
"isActive": true,
"state": "findable",
"reason": null,
"viewCount": 0,
"viewsOverTime": [],
"downloadCount": 0,
"downloadsOverTime": [],
"referenceCount": 0,
"citationCount": 0,
"citationsOverTime": [],
"partCount": 0,
"partOfCount": 0,
"versionCount": 0,
"versionOfCount": 1,
"created": "2022-01-18T16:54:54.000Z",
"registered": "2022-01-18T16:54:55.000Z",
"published": "2022",
"updated": "2022-01-18T16:54:55.000Z"
},
"relationships": {
"client": {
"data": {
"id": "cern.zenodo",
"type": "clients"
}
},
"provider": {
"data": {
"id": "cern",
"type": "providers"
}
},
"media": {
"data": {
"id": "10.5281/zenodo.5873940",
"type": "media"
}
},
"references": {
"data": []
},
"citations": {
"data": []
},
"parts": {
"data": []
},
"partOf": {
"data": []
},
"versions": {
"data": []
},
"versionOf": {
"data": [
{
"id": "10.5281/zenodo.1051064",
"type": "dois"
}
]
}
}
}
} |
Beta Was this translation helpful? Give feedback.
-
Extracting mention info using DOIAfter some research and testing, I decided to implement following approach for extracting mentions of software and project in the scientific publications. The process consist of 2 steps:
Extract RA of DOIFirst step is to extact information about RA. We support Crossref and DataCite. These are also 2 widely used providers. DataCite is mainly used for software and datasets while Crossref is widely used for other types of scientific publications (books, journals etc).
Extract mention info using RA api'sAs mention, I will implement support for extracting "works" from Crossref and DataCite api. Crossref apiCrossref offers REST API, with select and filter options. We use works endpoint because it supports select option to obtain specific properties and not complete object (which is significantly larger)
DataCite api (GraphQL)DataCite offers various options: REST API, GraphQL api and more query{
work(id:"10.1007/978-1-4939-9145-7_4"){
doi,
type,
types{
resourceType,
resourceTypeGeneral
},
sizes,
version,
titles(first: 1){
title
},
publisher,
publicationYear,
creators{
givenName,
familyName,
affiliation{
name
}
},
contributors{
givenName,
familyName,
affiliation{
name
}
}
}
} You can paste above query into GraphiQL playground Searching for publication by titleDuring my research how to extract mention information from DOI and examining api endpoints at Crossref and DataCite I discovered that these API's also enable search for publications using text queries. If I am correct, the documentation mentions that both providers use ElasticSearch as database. This type of database enables extensive text search. At the moment I think it is most suitable to limit the (text) search to a publication title and to 30 results (per provider). Crossref searching by title
DataCite search by titlequery{
works(query:"titles.title:knime",first:30){
nodes{
doi,
type,
type,
sizes,
version,
titles(first: 1){
title
},
descriptions(first:1){
description
},
publisher,
publicationYear,
creators{
givenName,
familyName,
affiliation{
name
}
},
contributors{
givenName,
familyName,
affiliation{
name
}
}
}
}
} You can paste above query into GraphiQL playground |
Beta Was this translation helpful? Give feedback.
Extracting mention info using DOI
After some research and testing, I decided to implement following approach for extracting mentions of software and project in the scientific publications.
The process consist of 2 steps:
Extract RA of DOI
First step is to extact information about RA. We support Crossref and DataCite. These are also 2 widely used providers. DataCite is mainly used for software and datasets while Crossref is widely use…