Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bayesian format obsolescence modeling #116

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ross-spencer
Copy link
Contributor

@ross-spencer ross-spencer commented Mar 8, 2021

Introduces a new digital preservation plot which may over a long-enough period of time identify a mechanism of predicting format obsolescence across a repository.

Represents a distribution of file-format instance creation dates. Thicker lines demonstrate a greater number of format instances recorded for a single date in time and less-dense portions of the distribution either side of rug-plot show fewer instances of file-formats being recorded. Those to the right given a long-enough gap between today and the last lines may begin to show us a file-format in the process of becoming obsolete.

This PR also introduces a new plot library to AIPscan called Seaborn which offers the type of rug plot required to approximate @nkrabben's original work which this PR is taken from also here.

Seaborn produces PNG images of plots. To render these in AIPscan we write them into memory and then encode the byte stream as Base64 which can be interpreted in HTML. This approach might be useful for other plot-types and libraries in AIPscan in the futrue. Seaborn also seems to present a powerful visualization tool.

With only a short period of time to polish this off it is unlikely that it meets all of the code quality requirements needed to merge, but hopefully the work is close for whoever picks it up with some tests included, and most of the basic principles followed so far in AIPscan.

For testing, the cURL for the API endpoint is: curl -X GET "http://127.0.0.1:5000/api/report-data/bayesian-format-modeling/1" -H "accept: application/json" | python -m json.tool

Approximate time to write report code: 9 hrs + added time to learn Seaborn.

image

A note on distributions: The data I have available is not authentic enough to demonstrate realistic distribution patterns. Largely downloaded through sources like other GitHub repositories, the distributions of dates are usually clumped into the day that something was downloaded with a few organic outliers on the side. I am still investigating synthetic methods for simulating data but ultimately I am looking forward to seeing charts like these generated from a real repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant