Skip to content

Harvesting using ResourceSync

Cory Lown edited this page Jan 16, 2024 · 5 revisions

The POD Aggregator uses ResourceSync, an extension of the Sitemaps Protocol, to expose aggregated data in three forms. The links below point to specific ResourceSync resource lists that serve as the starting point for harvesting.

Like uploads, harvesting data requires an access token. To harvest the data, you would start by harvesting the correct sitemap, and retrieving linked resource lists and resources. Currently, the POD Aggregator supports baseline synchronization; incremental synchronization (as defined by the ResourceSync specification) is not yet implemented. Please also note that you must parse and inspect the returned sitemaps to determine the URLs to the original or normalized data you wish to harvest.

Since harvesting using ResourceSync requires parsing the returned sitemaps to find the additional links to follow, using ResourceSync client like resync is recommended.

API documentation

All requests must include an access token generated by the POD Aggregator.

Method URL Description
GET https://pod.stanford.edu/organizations/resourcelist Resource list for original uploads (updated in real-time; without normalization)
GET https://pod.stanford.edu/organizations/normalized_resourcelist/$FLAVOR Resource list for normalized data (flavor is either marc21 or marcxml)
GET https://pod.stanford.edu/organizations/$ORG_CODE/streams/$STREAM_ID/normalized_resourcelist/$FLAVOR Resource list for specific organization's normalized data in a specific flavor

Full example (using curl)

# Set up some environment information (and, in reality, consider protecting your ACCESS_TOKEN by some means)
$ export ACCESS_TOKEN="..." # put your access token here

# Get a resource list for normalized data to inspect
$ curl -H "Authorization: Bearer $ACCESS_TOKEN" --url https://pod.stanford.edu/organizations/normalized_resourcelist/marcxml

# Fetch a specific organization and stream's sitemap for that normalized data, listed in the output above
$ curl -H "Authorization: Bearer $ACCESS_TOKEN" --url https://pod.stanford.edu/organizations/brown/streams/2020-11-17b/normalized_resourcelist/marcxml

Full example (using resync)

# resync requires Python, and the example assumes that Python is already installed.
# First, install resync:
$ pip install resync

# Fetch original, unnormalized uploads into a new directory called "pod"
$ resync-sync -v --sitemap https://pod.stanford.edu/organizations/resourcelist --access-token $ACCESS_TOKEN -b https://pod.stanford.edu/ pod

# Fetch normalized data as MARCXML
$ resync-sync -v --sitemap https://pod.stanford.edu/organizations/normalized_resourcelist/marcxml --access-token $ACCESS_TOKEN -b https://pod.stanford.edu/ pod

# Fetch normalized data as MARC21 just from Brown University
$ resync-sync -v --sitemap https://pod.stanford.edu/organizations/brown/streams/2020-11-17b/normalized_resourcelist/marc21 --access-token $ACCESS_TOKEN -b https://pod.stanford.edu/ pod