Info

on this page...

Purpose
Preliminary results
Problem being addressed
Embeddings -- an overview
Future work
Other

Purpose

This is experimental python machine-learning code, to explore using a vision-model to create embeddings to cluster related Hall-Hoag images images in the Brown Digital Repository.

Preliminary results

These are very prelimnary results; this is an outside-of-work side-project.

Nevertheless, the results indicate that this embedding approach could be a very useful tool to group related images.

This google-spreadsheet shows clustering results for 50 of the 587 images in the AFL-CIO Hall-Hoag organization. It's useful to skip back and forth from the spreadsheet to the AFL-CIO page (the "Has Parts" section) to better understand the groupings. The important parts are the last three columns.

The columns "01-groups" and "02-groups" both show the predicted grouping output, from my code. The difference between the two is from a single change to one setting. (The technique I used, described below, offers the ability to adjust lots of settings.) The column "real-groups" shows the actual proper grouping -- from a manual inspection of the images.

The alternating light-yellow and light-red cells highlight groupings of the "real" images for comparison with the 2 prediction columns. (A clarification note: Look at the last grouping... Note that it doesn't matter that the "real" column has the number "10" for all the related images, but the other two prediction-columns have the numbers "15" and "12". The important thing is the grouping, not the actual numbers.)

Problem being addressed

Context: a Hall-Hoag organization, like this AFL-CIO organization, can have hundreds or thousands of scanned-images. There might be two-pages for a flyer (front and back), followed by four-pages for a letter, followed by 15-pages for a report. But all of those pages are just single-page images with zero item-level-metadata (not even a title). They're not "grouped" as a flyer, a letter, a report -- in any way.

For very good reasons we're ingesting these this way -- but obviously we'd eventually like to be able to make improvements. Because we'll end up having some 900,000 scanned images, we'll be looking at programmatic ways to create metadata -- and to group related images. Thus this preliminary experiment, to get a feel for how "embeddings" can be used to cluster related images.

Embeddings -- an overview

I've mentioned that this clustering technique uses a vision-model to create "embeddings". An explanation...

Imagine you're filling out a detailed checklist for a series of images of dogs. Imagine the checklist includes things like:

Height (short, medium, tall)
Fur length (short, medium, long)
Ear shape (pointy, floppy, round)
Color (black, brown, white, spotted, etc.)
Snout length (short, medium, long)

Each dog-picture gets a unique set of answers, and you can compare dogs by how many checklist items match.

You could say things like "Height" and "Fur" are "categories" or "features", and a "value" is entered for each feature.

It makes sense that the more features you track, the more detailed your understanding of the dog image.

Now, instead of entering words for the checklist-values, imagine that each of these features gets a number instead.

This is similar to what a vision-model does:

Instead of a few features like color and ear-shape, a vision-model can track hundreds or even thousands of subtle details. (Those "features" are called "dimensions" in machine-learning jargon.)
Instead of assigning words to a category, a vision-model assigns a number, and stores its evaluation as a long list of numbers. (Called an "embedding vector" in machine-learning jargon.)
Instead of checking which words match, we can examine those numbers. By comparing them mathematically, we can determine which images are most similar — closer numbers in the same dimensions mean more similarity.

Future work

Regarding this embedding technique, there's lots more experimenting that could be done:

I could try different vision-models.
I could try different versions of a given vision model (I chose a somewhat smaller one favoring speed over accuracy).
There are different settings I could change in the process of creating the numeric-representations of each image.
There are different settings I could change in the process of evaluating similarity.
For speed of processing, I'm using images that are only one-fourth the full-size images. Perhaps the higher-resolution images would yield better results.
I'm only passing images to the vision-model. I could integrate a parallel process of passing OCRed text to a model (or instructing a multi-modal model to read the text) -- and create embeddings from that, as well.

Likely the best clustering results will involve using a variety of techniques, not only embeddings. There will be many situations where programmtic grouping is likely to prove challenging. A typical example would be where the colors and design of a "cover-page" will be very different from more predictable "article-pages" that follow. Still, this could be a very valuable "initial-pass" that by itself, and certainly with human followup, could facilitate useful grouping.

Usage

Assuming the wonderful python package uv is installed (info link):

$ uv run ./cc__hh_ml_code.py

Other

One oddity of the code is that it stores all the vector-arrays (lists of numbers) to an sqlite database -- but then reads all that data into memory at the "determine-similarity" step. That's because sqlite doesn't natively support vector-searching. For the number of images I was dealing with in this experiment, that's not a problem -- but it'd be worth investigating sqlite extensions that would allow direct db-vector-searches, or using a db that supports them.
A sincere thanks to Brendan Quinn and Michael B. Klein of the Northwestern University Library for their terrific full-day code4lib-con-2024 full-day LLM-workshop that really solidified my sense of the concept of embeddings, as well as how to work with them programmatically.

(end)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Info

Purpose

Preliminary results

Problem being addressed

Embeddings -- an overview

Future work

Usage

Other

Files

README.md

Latest commit

History

README.md

File metadata and controls

Info

Purpose

Preliminary results

Problem being addressed

Embeddings -- an overview

Future work

Usage

Other