Sample, annotate and apply computer vision models to the Newspaper Navigator dataset
nnanno
is a modest collection of tools to help work with the delightful Newspaper Navigator data.
Newspaper Navigator is a project which extracted visual content (pictures, maps etc.) from the Library of Congress Chronicling America digitised newspaper collection.
Newspaper Navigator has released data in a number of formats including json
files which contain a range of metadata about the newspaper from which the image was taken from. nnanno
was thrown together to help work with this collection as part of the preparation of some example datasets used in a series of Programming Historian tutorials (currently under review). Since this code was developed using the wonderful nbdev library it is possible to install it for your own use.
nnanno doesn't to provide an end-to-end to end software for using machine learning with the Newspaper Navigator data. Instead it is a minimal collection of code that may help you if you want to work with the Newspaper Navigator data.
There are three particular areas where nnanno tries to help a little:
- sampling subsets from the full Newspaper Navigator data
- annotating this sample with additional labels using label studio
- inference (experimental 😬) running inference i.e making new predictions with a machine learning model on the newspaper navigator dataset using IIIF.
This code was written mainly to help develop some example datasets for a series of Programming Historian tutorials. It has some tests but there are likely bugs and issues with the code. The code in this repository was all written in notebooks, some people hate notebooks. Those people will probably hate this too 🤷♂️
If you want to work with the full Newspaper Navigator data you will likely be better of accessing it via the provided S3 bucket see news-navigator.labs.loc.gov/ for more information.
This code was written using nbdev
. This is a tool that helps use Jupyter notebooks for developing Python libraries. Inside the documentation you will see code cells followed by output. This is generated from a Jupyter notebook and shows the actual output of the code rather than something that has been copied and pasted for example:
import datetime
print(datetime.date.today())
2021-02-02
Is evaluated as Python code. This also means all of the documentation and examples can be opened in notebooks and the code inspected, changed and run.
You can install nnanno
via PyPI
:
pip install nnanno
This will install all of the code for sampling from Newspaper Navigator data.
If you want to use all of the parts of nnanno
you'll need to install some extra packages.
If you also want to use label-studio
for annotating you will need to install this too. There are various ways in which you may want to install label-studio. See the label studio documentation for various options for setting this up.
If you want to use the experimental inference functionality you will need to install fastai. See the fastai docs for options for doing this.
The documentation can be viewed as rendered pages with the option of opening as notebooks in Google Colab.
You can find an 'end-to-end' example in examples folder of the documentation.
This example goes through the process of sampling, annotating, training a model and predicting this against the newspaper navigator data.
The three main areas of nnanno
are shown below. The examples in the documentation and in the "end-to-end" example shows this functionality in more detail.
nnanno
can be used to create samples from the Newspaper Navigator data:
from nnanno.sample import *
sampler = nnSampler()
df = sampler.create_sample(1,
'photos',
start_year=1850,
end_year=1855,
step=1)
This returns a dataframe containing samples from the Newspaper Navigator data (loaded via JSON) into a Pandas DataFrame.
df.columns
Index(['filepath', 'pub_date', 'page_seq_num', 'edition_seq_num', 'batch',
'lccn', 'box', 'score', 'ocr', 'place_of_publication',
'geographic_coverage', 'name', 'publisher', 'url', 'page_url'],
dtype='object')
The annotation part of nnanno is mainly a little bit of documentation and a few functions to help setup annotation of a sample from Newspaper Navigator using IIIF urls and the label studio annotation tool. Since we can annotate via IIIF this offers a way of annotating without having to download large amounts of data locally.
from nnanno.annotate import create_label_studio_json
The inference section of nnanno attempts to show one possible way to use IIIF to run inference against samples of Newspaper Navigator using a trained fastai model. This will allow you to make predictions against larger parts of the Newspaper Navigator data.
from nnanno.inference import *
from fastai.vision.all import *
dls = ImageDataLoaders.from_csv('../ph/ads/',
'ads_upsampled.csv',
folder='images',
fn_col='file',
label_col='label',
item_tfms=Resize(64,ResizeMethod.Squish),
num_workers=0)
learn = cnn_learner(dls, resnet18, metrics=F1Score())
learn.fit(1)
epoch | train_loss | valid_loss | f1_score | time |
---|---|---|---|---|
0 | 0.924742 | 0.963872 | 0.645570 | 00:12 |
With a trained fastai model we can predict on a sample from Newspaper Navigator
predictor = nnPredict(learn, try_gpu=False)
predictor.predict_sample('ads','testinference',0.01,end_year=1850)
This returns a json
file for each year from the sample containing the original newspaper navigator data plus the predictions from your model
df = pd.read_json('testinference/1850.json')
We can access the 'decoded' predictions
df['pred_decoded'].value_counts()
text-only 50
illustrations 38
Name: pred_decoded, dtype: int64
or work with the probabilities directly
df.iloc[:5,-2]
0 0.152435
1 0.027183
2 0.096673
3 0.395800
4 0.957422
Name: illustrations_prob, dtype: float64
This work was support by Living with Machines. This project, funded by the UK Research and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRC), with The Alan Turing Institute, the British Library and the Universities of Cambridge, East Anglia, Exeter, and Queen Mary University of London.