Compute histograms for numeric attributes #3

joaquinvanschoren · 2018-01-29T17:04:10Z

Currently the website shows a box plot for numeric attributes. This does not always look good, plus it hides a lot of information.

It would be better to store a histogram of the distribution. This can be computed beforehand.
I.e. Something like this: https://www.mathworks.com/help/examples/matlab/win64/AdjustHistogramPropertiesExample_01.png

For categorical targets we could also compute it per class value: https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2014/03/histograms.png

Looking at the code, we could extend models.AttributeStatistics
with a new function that returns something of the form
[[b1,b2,b3],[123],[234],[354]], where b1, b2 are the bucket values.

For categorical targets, we could compute something like
[[b1,b2,b3],[123,12,23],[234,23,34],[354,34,45]] for a 3-class dataset.

What do you think would be the best way to implement this?

janvanrijn · 2018-10-02T10:07:41Z

If you want to take this really to the next step, please consider a cdf rather than a histogram.

https://www.andata.at/en/software-blog-reader/why-we-love-the-cdf-and-do-not-like-histograms-that-much.html

There are two ways to implement this. 1) On Evaluation Engine level, 2) On ES level.

My preference goes to (2), and I can also add a reason when I have a bit more time, but what is your opinion and why?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute histograms for numeric attributes #3

Compute histograms for numeric attributes #3

joaquinvanschoren commented Jan 29, 2018

janvanrijn commented Oct 2, 2018

Compute histograms for numeric attributes #3

Compute histograms for numeric attributes #3

Comments

joaquinvanschoren commented Jan 29, 2018

janvanrijn commented Oct 2, 2018