Skip to content

Matcher for affiliations - link raw affiliation to ROR ids, country and RNSR

License

Notifications You must be signed in to change notification settings

dataesr/affiliation-matcher

Repository files navigation

Affiliation matcher

Discord Follow license GitHub release (latest by date) Tests Build

Goal

The affiliation matcher aims to automatically align an affiliation with different reference systems, including :

And specifically for French affiliations :

Methodology

The methodology is fully explained in a publication freely available on HAL: https://hal.archives-ouvertes.fr/hal-03365806.

Run it locally

⚠️ Please use docker-compose version 1.27.0 up to 1.19.2.

git clone [email protected]:dataesr/affiliation-matcher.git
cd affiliation-matcher
make docker-build start

Wait for Elasticsearch to be up. Then run :

make load

In your browser, you now have :

In python, you can call the matcher this way:

import requests
url = 'http://localhost:5004/match'
r=requests.post(url, json={
  "type": "ror", 
  "name": "Paris Dauphine University", 
  "city": "Paris",
  "country": "France",
  "verbose": False}
)
r.json()

For RoR, available criteria are: id, grid_id, name, city, country, supervisor_name, acronym, city_zone_emploi, city_nuts_level2, web_url, web_domain. Default strategies are detauked https://github.com/dataesr/affiliation-matcher/blob/master/project/server/main/match_ror.py

For RNSR, available criteria are: year, id, code_number, acronym, name, supervisor_name, supervisor_acronym, zone_emploi, city, web_url. Default strategies are detailed in https://github.com/dataesr/affiliation-matcher/blob/master/project/server/main/match_rnsr.py

Run unit tests

make test

Build docker image

make docker-build

Build python package

To generate the tarball package into the dist folder :

make python-build

To install the generated package into your project :

pip install /path/to/your/package.tar.gz

Then import the package into your python file

import affiliation-matcher

Release

It uses semver.

To create a new release:

make release VERSION=x.x.x

API

Match a single query /match

Query the API by setting your own strategies :

curl "YOUR_API_IP/match" -X POST -d '{"type": "YOUR_TYPE", "query": "YOUR_QUERY", "strategies": "YOUR_STRATEGIES", "year": "YOUR_YEAR"}'

YOUR_TYPE is optional, has to be a string and can be one of :

  • "country"
  • "grid"
  • "rnsr"
  • "ror"

By default, YOUR_TYPE is equal to "rnsr".

YOUR_QUERY is mandatory, has to be a string and is your affiliation text.

By example : IPAG Institut de Planétologie et d'Astrophysique de Grenoble.

YOUR_STRATEGIES is optional, has to be a 3 dimensional arrays of criteria (see next paragraph).

By example : [[["grid_name", "grid_country"], ["grid_name", "grid_country_code"]]].

YOUR_YEAR is optional, and can be used only if you use the "rnsr" matcher type, has te be a string.

By example : 1998.

By default, YOUR_YEAR is not set ie. it will be match over all years.

Match multiple queries /match_list

curl "YOUR_API_IP/match_list" -X POST -d '{"match_types": "YOUR_TYPES", "affiliations": "YOUR_AFFILIATIONS"}'

YOUR_TYPES is optional, has to be a list of string and can contain one of :

  • "country"
  • "grid"
  • "rnsr"
  • "ror"

By default, YOUR_TYPES is equal to ["grid", "rnsr"].

YOUR_AFFILIATIONS is optional, has to be a list of string. By example : ["affiliation_01", "affiliation_02"].

By default, YOUR_AFFILIATIONS is equal to [].

Criteria

Here is a list of the criteria available for the country matcher:

  • country_alpha3
  • country_name
  • country_subdivision_code
  • country_subdivision_name

Here is a list of the criteria available for the grid matcher:

  • grid_acronym
  • grid_acronym_unique
  • grid_cities_by_region [indirect]
  • grid_city
  • grid_country
  • grid_country_code
  • grid_department
  • grid_id
  • grid_name
  • grid_name_unique
  • grid_parent
  • grid_region

Here is a list of the criteria available for the rnsr matcher:

  • rnsr_acronym
  • rnsr_city
  • rnsr_code_number
  • rnsr_code_prefix
  • rnsr_country_code
  • rnsr_id
  • rnsr_name
  • rnsr_name_txt
  • rnsr_supervisor_acronym
  • rnsr_supervisor_name
  • rnsr_urban_unit
  • rnsr_web_url
  • rnsr_year
  • rnsr_zone_emploi [indirect]

Here is a list of the criteria available for the ror matcher:

  • ror_acronym
  • ror_acronym_unique
  • ror_city
  • ror_country
  • ror_country_code
  • ror_grid_id
  • ror_id
  • ror_name
  • ror_name_unique
  1. You can combine criteria to create a strategy.
  2. You can cumulate strategies to create a family of strategies.
  3. And then you can cumulate families of strategies to create the final object.
  4. This final object strategies is then a 3 dimensional array that you will give as an argument to the "/match" API endpoint. By example : [[["grid_name", "grid_country"], ["grid_name", "grid_country_code"]]].

Results

matcher precision recall
country 0.9953 0.9690
grid 0.7946 0.5944
rnsr 0.9654 0.8192
ror 0.8891 0.2356