Skip to content
This repository has been archived by the owner on Nov 21, 2024. It is now read-only.

The Scroll routine appears to return incorrect data #4

Open
righthalfplane opened this issue Sep 28, 2024 · 7 comments
Open

The Scroll routine appears to return incorrect data #4

righthalfplane opened this issue Sep 28, 2024 · 7 comments

Comments

@righthalfplane
Copy link

I ran a simple example -

import vesuvius
scroll = vesuvius.Volume("Scroll1")
img = scroll[1000,5000:5256,5000:5256]
binary_file=open("file1p.raw", "wb")
binary_file.write(img)
binary_file.close()

and I did a histogram of "file1p.raw" -

4704 58 41 46 56 37 57 53 0 0 0 0 0 0 0 0
57 45 47 58 53 52 60 63 0 0 0 0 0 0 0 0
66 72 45 73 58 75 68 75 0 0 0 0 0 0 0 0
95 79 84 102 83 112 103 106 0 0 0 0 0 0 0 0
137 141 148 162 181 176 216 202 0 0 0 0 0 0 0 0
379 328 341 418 405 435 478 526 0 0 0 0 0 0 0 0
891 908 964 1006 1077 1134 1190 1247 0 0 0 0 0 0 0 0
1783 1716 1741 1722 1774 1857 1917 1919 0 0 0 0 0 0 0 0
1852 1982 1911 1806 1816 1811 1709 1614 0 0 0 0 0 0 0 0
1336 1248 1219 1072 1131 1067 987 969 0 0 0 0 0 0 0 0
608 577 530 532 490 510 459 440 0 0 0 0 0 0 0 0
272 246 250 230 225 207 192 170 0 0 0 0 0 0 0 0
113 134 96 109 97 99 96 81 0 0 0 0 0 0 0 0
64 50 47 43 41 33 47 38 0 0 0 0 0 0 0 0
26 27 29 31 28 22 28 28 0 0 0 0 0 0 0 0
12 14 26 18 14 19 13 243 0 0 0 0 0 0 0 0

note that only 128 non zero values appear with gaps of 8 zeros in the results.

The vesuvius-c routines show the same problem.

@bostelk
Copy link

bostelk commented Oct 24, 2024

The Zarr volume was normalized and quantized (16-bits to 8-bits) to reduce the size on disk. The zeros could be empty air that was clipped outside of the mean range representing other denser materials like papyrus. The precision loss hasn't affected ink detection or other downstream applications to my knowledge--however, retraining on the new dataset was required. **I'm paraphrasing from the Discord channel here.

So it's not an issue in how the API accesses the data. There are zeros in the original dataset too, but are less frequent.

Comparison

Here's a quick comparison between the two data sources in the region you highlighted.

comparison

@righthalfplane
Copy link
Author

I do not know what your chart is showing. The returned data set has 65536 (256x256) pixels it - not 306912 points and it looks like the include plot.
Screen Shot 2024-10-26 at 7 28 48 AM

@righthalfplane
Copy link
Author

@bostelk - What version of python are you using ? On ubuntu 24.10, python2 would not completely install. I was able to the the install with python3, but only after putting in some fake links to python2 stuff. The data returned had the same holes as I found in the MacOS 12.7.3 version. Your histogram of the data returned looks very much like mine except that the holes are filled in and it has too many points.

@bostelk
Copy link

bostelk commented Oct 29, 2024

I'm sorry, my earlier comparison was a bad illustration. I had copied the image into an editor and the size was larger hence the wrong number of points and interpolated graph. I'm using Python 3.

I opened the volume in a different viewer (https://dl.ash2txt.org/view/Scroll1) and the issue (???) is apparent there too. So I don't think it's caused by the reader but rather in how the volume was created/converted from a higher-precision dataset. I don't have a concrete explanation, only speculation.

New Comparison

Figure_1

@jrudolph
Copy link

Good catch! It seems that for some reason the quantization process ended up setting (only?) bit 3 to zero, so that all resulting numbers look like xxxx0xxx. If the intention was to quantize to 4 bits, the data set ended up having more precision than intended (but is also somewhat misrepresenting the original data by setting a middle bit to zero which I'd say introduces some unintended bias by shifting the blanked values by 8 to the next lower bucket).

@righthalfplane
Copy link
Author

righthalfplane commented Oct 29, 2024

I have been complaining about this problem for a month and I see in the general discussion that people are going to fix the problem and redo some work - will the fix get into the C and Python routines ?

@stephenrparsons
Copy link
Member

Yes, and thanks for bringing attention to it. Those libraries pull from the same data source, so as soon as the volumes are updated on the server, both libraries will have the revised data. You may wish to clear the local cache, if you have one, to make sure you get the new versions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants