Notebook for bulk downloading of AJCP material #58

wragge · 2022-05-04T23:41:05Z

See: https://twitter.com/MichWatsonOz/status/1521725616735014912

dleetalb · 2022-05-06T00:50:08Z

I'd like to be able to download sections from the AJCP digitised collection.

For instance, material from the Miscellaneous Series, London Missionary Society Collection.

From here, it would be great to search by three categories- name, date, and geographical location.

The example below shows the general data in the finding aid-

Letters mainly from missionaries in the Society, Hervey and Samoan Islands and also the New Hebrides, Loyalty Islands and Savage Island (Niue), 1862 - 1863 (File Box 29)

But what I'd really like is to harvest files based on the descriptive section of the file, as seen below. For my research, I would target information about Lawes.

The correspondents include Charles Barff (Huahine), P.G. Bird (Savaii, Apia), Stephen M. Creagh (Uea, Lifu), George Drummond (Upolu), Samuel Ella (Aneiteum), John Geddie (Aneiteum), Henry Gee (Apia), W. Wyatt Gill (Mangaia), James L. Green (Taha'a), William Howe (Papeete), John Jones (Mare), Ernst R.W. Krause (Rarotonga), William G. Lawes (Savage Island), Samuel Macfarlane (Lifu), George Morris (Raiatea), Archibald W. Murray (Malua), Henry Nisbet (Malua), George Platt (Raiatea), Thomas Powell (Tutuila), George Pratt (Matautu, Savage Island),Carl Schmidt (Apia), James Sleigh (Lifu) and George Turner (Sydney).

wragge · 2022-05-06T12:12:09Z

So to break this down:

You'd provide the notebook with a finding aid url and a search term
The notebook would then search for the term within the finding aid, getting a list of matching boxes/item groups
The notebook would then download all of the images in those boxes

Is that what you'd like?

wragge · 2022-05-06T12:18:45Z

Notes to self:

Searching within a finding aid fires off a POST request that returns an HTML fragment.

The params are something like this:

params = {"faIdentifier":"nla.obj-1126174847","term":"lawes","nuc":"ANL:AJCP","facets":"all","zone":"collection","selectedFacets":[],"pageSize":10,"cursorMark":"AoErc3UyMzcxMDI4Nzk=","start":1,"previous":["*"]}

And are posted as json to https://nla.gov.au/tarkine/nla.obj-1126174847/findingaid/search Results are paginated -- increment the start value. So next page would be "start": 11. Looks like the number of results per page can be changed.

Results are HTML so would need to scrape identifiers from the HTML for further processing.

wragge · 2022-05-30T06:23:42Z

Worth noting too that dezoomify (https://dezoomify.ophir.dev/) works a treat in downloading high-resolution versions of pages in the AJCP.

dleetalb · 2022-05-30T06:26:57Z

Thanks for the dezoomify link, Tim. Bart mentioned he spoke with you recently and just commented on how good the images are!

As for the query above, I think that sounds good!

wragge · 2024-07-10T06:04:21Z

Some recent notebooks should meet parts of this need:

In addition, the 'Images' tutorials in the Trove Data Guide provide detailed instructions in getting data out of a finding aid and then loading into other tools for analysis/annotation:

There's also documentation about getting high-res versions of images in the TDG, here and here.

I'll think more about the searching part of this issue as I do some more work on Finding Aids as part of wragge/trove-data-guide#157

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notebook for bulk downloading of AJCP material #58

Notebook for bulk downloading of AJCP material #58

wragge commented May 4, 2022

dleetalb commented May 6, 2022

wragge commented May 6, 2022

wragge commented May 6, 2022

wragge commented May 30, 2022

dleetalb commented May 30, 2022

wragge commented Jul 10, 2024

Notebook for bulk downloading of AJCP material #58

Notebook for bulk downloading of AJCP material #58

Comments

wragge commented May 4, 2022

dleetalb commented May 6, 2022

wragge commented May 6, 2022

wragge commented May 6, 2022

wragge commented May 30, 2022

dleetalb commented May 30, 2022

wragge commented Jul 10, 2024