Skip to content
This repository has been archived by the owner on Dec 2, 2021. It is now read-only.

RIALTO Combine Load Procedure

Michael J. Giarlo edited this page Oct 18, 2018 · 1 revision

Initial Load

Use these steps when writing to an empty store.

  1. cd ~/workspace/rialto-etl
  2. Make sure you have the latest ETL code
  3. Get the SPARQL Proxy URL and API key from shared_configs. Put these values into config/settings.local.yml or the corresponding environment variables.
  4. Connect to the Stanford VPN using full-tunnel mode
  5. Test the connection by sending a simple count query to the SPARQL Proxy
  6. Extract, Transform, Load - Organizations from Profiles
    1. Ensure you have the CAP/Profiles API key in either config/settings.local.yml or an environment variable. See shared_configs.
    2. Run the organization ETL steps
  7. Extract, Transform, Load - Researchers from Profiles
    1. Run the researcher ETL steps
  8. Extract, Transform, Load - Grants from SeRA
    1. Using the researchers.ndj file from the researchers extract step above, run the grant ETL steps. Note that researchers without SUNet IDs will not have their grants imported.
  9. Extract, Transform, Load - Publications from Web of Science
    1. Using the researchers.ndj file from the researchers extract step above, run the publications ETL steps. This process will create new co-authors, link publications to authors, create new topics, link topics to publications, and link publications to grants.

Subsequent Loads

Use these steps when loading data into a store that already has data.

  1. Querying the data-store for people will get people who have been historically affiliated, which will be more people than we care to update (due to time to load). We may want to re-query Profiles for "current people" or we could mark "inactive" people?
Clone this wiki locally