The Scroll routine appears to return incorrect data #4

righthalfplane · 2024-09-28T14:51:34Z

I ran a simple example -

import vesuvius
scroll = vesuvius.Volume("Scroll1")
img = scroll[1000,5000:5256,5000:5256]
binary_file=open("file1p.raw", "wb")
binary_file.write(img)
binary_file.close()

and I did a histogram of "file1p.raw" -

4704 58 41 46 56 37 57 53 0 0 0 0 0 0 0 0
57 45 47 58 53 52 60 63 0 0 0 0 0 0 0 0
66 72 45 73 58 75 68 75 0 0 0 0 0 0 0 0
95 79 84 102 83 112 103 106 0 0 0 0 0 0 0 0
137 141 148 162 181 176 216 202 0 0 0 0 0 0 0 0
379 328 341 418 405 435 478 526 0 0 0 0 0 0 0 0
891 908 964 1006 1077 1134 1190 1247 0 0 0 0 0 0 0 0
1783 1716 1741 1722 1774 1857 1917 1919 0 0 0 0 0 0 0 0
1852 1982 1911 1806 1816 1811 1709 1614 0 0 0 0 0 0 0 0
1336 1248 1219 1072 1131 1067 987 969 0 0 0 0 0 0 0 0
608 577 530 532 490 510 459 440 0 0 0 0 0 0 0 0
272 246 250 230 225 207 192 170 0 0 0 0 0 0 0 0
113 134 96 109 97 99 96 81 0 0 0 0 0 0 0 0
64 50 47 43 41 33 47 38 0 0 0 0 0 0 0 0
26 27 29 31 28 22 28 28 0 0 0 0 0 0 0 0
12 14 26 18 14 19 13 243 0 0 0 0 0 0 0 0

note that only 128 non zero values appear with gaps of 8 zeros in the results.

The vesuvius-c routines show the same problem.

bostelk · 2024-10-24T12:55:15Z

The Zarr volume was normalized and quantized (16-bits to 8-bits) to reduce the size on disk. The zeros could be empty air that was clipped outside of the mean range representing other denser materials like papyrus. The precision loss hasn't affected ink detection or other downstream applications to my knowledge--however, retraining on the new dataset was required. **I'm paraphrasing from the Discord channel here.

So it's not an issue in how the API accesses the data. There are zeros in the original dataset too, but are less frequent.

Comparison

Here's a quick comparison between the two data sources in the region you highlighted.

righthalfplane · 2024-10-26T14:37:43Z

I do not know what your chart is showing. The returned data set has 65536 (256x256) pixels it - not 306912 points and it looks like the include plot.

righthalfplane · 2024-10-28T17:15:12Z

@bostelk - What version of python are you using ? On ubuntu 24.10, python2 would not completely install. I was able to the the install with python3, but only after putting in some fake links to python2 stuff. The data returned had the same holes as I found in the MacOS 12.7.3 version. Your histogram of the data returned looks very much like mine except that the holes are filled in and it has too many points.

bostelk · 2024-10-29T00:23:54Z

I'm sorry, my earlier comparison was a bad illustration. I had copied the image into an editor and the size was larger hence the wrong number of points and interpolated graph. I'm using Python 3.

I opened the volume in a different viewer (https://dl.ash2txt.org/view/Scroll1) and the issue (???) is apparent there too. So I don't think it's caused by the reader but rather in how the volume was created/converted from a higher-precision dataset. I don't have a concrete explanation, only speculation.

New Comparison

jrudolph · 2024-10-29T08:59:01Z

Good catch! It seems that for some reason the quantization process ended up setting (only?) bit 3 to zero, so that all resulting numbers look like xxxx0xxx. If the intention was to quantize to 4 bits, the data set ended up having more precision than intended (but is also somewhat misrepresenting the original data by setting a middle bit to zero which I'd say introduces some unintended bias by shifting the blanked values by 8 to the next lower bucket).

righthalfplane · 2024-10-29T15:38:56Z

I have been complaining about this problem for a month and I see in the general discussion that people are going to fix the problem and redo some work - will the fix get into the C and Python routines ?

stephenrparsons · 2024-10-29T16:07:43Z

Yes, and thanks for bringing attention to it. Those libraries pull from the same data source, so as soon as the volumes are updated on the server, both libraries will have the revised data. You may wish to clear the local cache, if you have one, to make sure you get the new versions.

bostelk mentioned this issue Oct 29, 2024

Fix trailing zero bit mask. schillij95/scroll2zarr#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Scroll routine appears to return incorrect data #4

The Scroll routine appears to return incorrect data #4

righthalfplane commented Sep 28, 2024

bostelk commented Oct 24, 2024 •

edited

Loading

righthalfplane commented Oct 26, 2024

righthalfplane commented Oct 28, 2024

bostelk commented Oct 29, 2024

jrudolph commented Oct 29, 2024

righthalfplane commented Oct 29, 2024 •

edited

Loading

stephenrparsons commented Oct 29, 2024

The Scroll routine appears to return incorrect data #4

The Scroll routine appears to return incorrect data #4

Comments

righthalfplane commented Sep 28, 2024

bostelk commented Oct 24, 2024 • edited Loading

Comparison

righthalfplane commented Oct 26, 2024

righthalfplane commented Oct 28, 2024

bostelk commented Oct 29, 2024

New Comparison

jrudolph commented Oct 29, 2024

righthalfplane commented Oct 29, 2024 • edited Loading

stephenrparsons commented Oct 29, 2024

bostelk commented Oct 24, 2024 •

edited

Loading

righthalfplane commented Oct 29, 2024 •

edited

Loading