Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary Classification Confusion Matrix and AUC Aggregators #633

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

richwhitjr
Copy link

Some of this works is derived from similar functions in the Spark library. For Binary Classifications tasks it will compute a confusion matrix. You can also aggregator over these confusions matrices with different thresholds to compute an Area under the Curve Stat for both PR and ROC.

@codecov-io
Copy link

codecov-io commented May 30, 2017

Codecov Report

Merging #633 into develop will increase coverage by 0.25%.
The diff coverage is 92.15%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #633      +/-   ##
===========================================
+ Coverage       82%   82.25%   +0.25%     
===========================================
  Files          111      113       +2     
  Lines         5156     5207      +51     
  Branches       457      479      +22     
===========================================
+ Hits          4228     4283      +55     
+ Misses         928      924       -4
Impacted Files Coverage Δ
...com/twitter/algebird/BinaryClassificationAUC.scala 91.66% <91.66%> (ø)
...algebird/BinaryClassificationConfusionMatrix.scala 92.59% <92.59%> (ø)
...n/scala/com/twitter/algebird/SuccessibleLaws.scala 85.71% <0%> (-7.15%) ⬇️
...witter/algebird/util/summer/AsyncListMMapSum.scala 96.15% <0%> (-3.85%) ⬇️
.../main/scala/com/twitter/algebird/BloomFilter.scala 94.32% <0%> (-0.44%) ⬇️
.../main/scala/com/twitter/algebird/HyperLogLog.scala 92.4% <0%> (-0.4%) ⬇️
.../main/scala/com/twitter/algebird/Approximate.scala 90.32% <0%> (+1.61%) ⬆️
...ain/scala/com/twitter/algebird/AdaptiveCache.scala 77.14% <0%> (+5.71%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa98422...4aaeb1c. Read the comment docs.

@sritchie-stripe
Copy link
Contributor

Hey @richwhitjr, thank you for this! We should have some time to look today.

*/
object AreaUnderCurve {
private def trapezoid(points: Seq[(Double, Double)]): Double = {
require(points.length == 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to take the two points directly instead of using this assertion, head, last combo?

* with different thresholds
*/
case object CurveMonoid extends Monoid[Curve] {
def zero = Curve(Nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looked like we don't have a test yet that exercises the zero.

*/
case object CurveMonoid extends Monoid[Curve] {
def zero = Curve(Nil)
override def plus(left: Curve, right: Curve): Curve = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a sumOption implementation that would be more efficient that we should add here as well, for adding multiple curves at once?

extends Aggregator[BinaryPrediction, Curve, Double]
with Serializable {

private def linspace(a: Double, b: Double, length: Int = 100): Array[Double] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never call this without default args - should we remove the default 100?

}

property("Curve is associative") {
isAssociative[Curve]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the monoid laws?

*
* @param matrices List of Matrices
*/
case class Curve(matrices: List[ConfusionMatrix])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you add an object with an implicit def monoid: Monoid[Curve] = CurveMonoid you won't need to declare the same in the tests.

@sritchie-stripe
Copy link
Contributor

@richwhitjr this is getting there - I think it's looking good.

One ask that we have for all new data structures is that you add a section with a few examples to the documentation site.

here are the instructions on how to do this: https://github.com/twitter/algebird/blob/develop/CONTRIBUTING.md#contributing-documentation

All data structure examples are compiled by CI, so it's extremely helpful to get a mini tutorial in at the same time the data structure is merged.

here are a few examples of existing pages:

@richwhitjr
Copy link
Author

I looked at adding sumOption but couldn't think of a way to make it more efficient without a lot of complexity and not sure if it is entirely worth it. It would help with object creation but on the other side we would have to aggregate the confusion matrix variables outside into some type of dictionary and reconstruct it at the end.

@CLAassistant
Copy link

CLAassistant commented Jul 18, 2019

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Richard Whitcomb seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants