Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #163 #193

Merged
merged 108 commits into from
Nov 8, 2017
Merged

Closes #163 #193

merged 108 commits into from
Nov 8, 2017

Conversation

deaktator
Copy link
Contributor

Summary

Adds multi-label model, a VW plugin for the multi-label model, a row creator for multi-label VW input. Adds and EitherAuditor and fixes stack overflows in tut when generating documentation.

Resolves

Bug Fixes/New Features

How to Verify

Look at the tests. I would also build locally (sbt clean test package makeMicrosite) to make sure nothing is left out.

Tests

  • aloha-core/com.eharmony.aloha.dataset.vw.multilabel
  • aloha-vw-jni/com.eharmony.aloha.models.vw.jni.multilabel

Code Reviewer(s)

@jmorra

deaktator and others added 30 commits August 30, 2017 12:32
…on't matter because there are no sub models or super models.
Copy link
Collaborator

@jmorra jmorra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really nice, left some comments.

val (missingAndErrs, features) = featuresFunction(a)

// TODO: Should this be sci.BitSet?
val positiveIndices: Set[Int] =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to consider BitSet but only if the performance improvement is drastic. If we care about performance shouldn't we consider not changing collection types? Isn't .toSeq an O(N) operation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.toSeq is on an option. It could be removed without effect due to the implicit function that changes options to iterables.


object VwMultilabelRowCreator {

/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love these descriptions.

*
* '''NOTE''': The trailing space should be here.
*/
private[this] val SharedFeatureIndicator = "shared" + " "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just write this as "shared "?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emphasis.

*
* The goal of this function is to try to use characters in literature used to denote a
* dependent variable. If that isn't possible (because the characters are already used by some
* other namespace, just find the first possible characters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close your parenthesis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

private[multilabel] def nssToFirstCharBitSet(ss: Set[String]): sci.BitSet =
ss.collect { case s if s != "" =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s.nonEmpty is more legible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requires class instantiation. Changing to case s if s.length != 0 => .


labelAndDummyLabelNss match {
case Some(LabelNamespaces(labelNs, _)) =>
// TODO: Should we remove this. If not, it must contain the --ring_size [training labels + 10].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

val substr = str.substring(0, JsonErrStrLength)
s"JSON object expected. Found " + substr + (if (str.length != substr.length) " ..." else "")
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline

labelsInTrainingSet: Vector[K],
labelsOfInterest: Option[String] = None) = {

// implicit val vecWriter = vectorFormat(DefaultJsonProtocol.lift(implicitly[JsonWriter[K]]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code.

}
}

object VwMultilabelParamAugmentationTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

ringSize.toInt
)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline

Copy link
Contributor

@ryan-deak-zefr ryan-deak-zefr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmorra. What do you think about this?

positiveLabelsFunction: GenAggFunc[A, sci.IndexedSeq[K]],
classNs: Char,
dummyClassNs: Char,
includeZeroValues: Boolean = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I like that. Let me think on that for a few and come up with something.

positiveLabelsFunction: GenAggFunc[A, sci.IndexedSeq[K]],
classNs: Char,
dummyClassNs: Char,
includeZeroValues: Boolean = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmorra I think the requirement that "negative labels added need to change with EVERY line" is important, but I don't think it fits into the current architecture. It seems like you are (at least implicitly) advocating two things:

  1. Intentionally making apply "more" impure, and definitely not idempotent or a function (in the math sense).
  2. Maintaining state inside VwMultilabelRowCreator (by virtue of (1))

What if we have a training set that is distributed across multiple nodes. We apply the VwMultilabelRowCreator to the elements on each node and combine the results? If we want to do what I think you are suggesting, I think we'll need to make some pretty big design changes. I suggest the following:

  1. We'll need a constructor seed parameter. This could be a value, but if so, the calling code would be responsible for varying the seed for each node on which the VwMultilabelRowCreator is run. It could also be an impure and side-effecting function is idempotent (in context), meaning it would produce the same result every time it is run on the same node. This would allow the seed generator parameter to be set once and called on each node to produce the seed for that node.
    seed: () => Long // Not necessarily a long but something to seed pseudo-randomness.
  2. We'll need a new interface like StatefulRowCreator. This interface will use something like the State monad in cats or scalaz. I would recommend stack safety because its implementation stack-safe (remember the tut bug in 0.5.3?).
  3. We'll have to use the StatefulRowCreator differently most likely, we could expose two methods, one that operates on single items and requires the caller the maintain the state and pass it in. The other method could operate on multiple elements (collections, iterators) and hide the state information.

What do you think about this stuff? I think if we try to jam this type of mutability into the current interface, it'll most likely produce disastrous results.

Copy link
Contributor Author

@deaktator deaktator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmorra I think this addresses all of your PR comments. I updated the downsampling to operate on 231 - 1 labels.

* Created by ryan.deak on 9/7/17.
*/
trait PluginInfo[K] {
def features: ListMap[String, Spec]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ❤️ the performance of get and apply on ListMap! Also, it provides a way to index on key-value pairs and maintain iteration order based on insertion order. It also allows us to read the features as a map in the JSON. For instance:

{
  "modelType": "multilabel-sparse",
  "modelId": { "id": 1, "name": "NONE" },
  "numMissingThreshold": 0,
  "features": {
    "feature": "1"                // <== JSON Map maintains order in code.
  },
  "labelsInTrainingSet": [4, 8],
  "underlying": {
    "type": "vw",
    "modelSource": { "model": "..." },
    "namespaces": { "X": [ "feature" ] }
  }
}

protected[this] case class Plugin(`type`: String)

/**
* Data for the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* A sparse multi-label predictor takes:
*
- features
- labels for which a prediction should be produced
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not be in the output but will be in LabelsAndInfo.labelsNotInTrainingSet returned by MultilabelModel.labelsForPrediction. This LabelsAndInfo instance is accepted by the error functions and success functions and these are reported in the Aloha audit trail.

* This definition is "lazy" because we can't guarantee that the underlying predictor is
* `Serializable` so we pass around a function that can be cached in a ''transient''
* `lazy val`. This function should however be `Serializable` and testing should be done
* to ensure that each predictor producer is `Serializable`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VwSparseMultilabelPredictorTest.testSerializability

/**
* Created by ryan.deak on 9/6/17.
*/
trait RuntimeClasspathScanning {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I'm doing in that class. What do you mean?

private val SharedPrefix = "shared "

private val DummyLabels = List(
s"$NegDummyClass:$NegVal |y N",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a good idea. This is testing the constants are the expected values. This is important, especially for the PosVal and NegVal. I will extract it though, like I did with the other variables.


val vwParams = vw.vw.params.fold("")(_.fold(_.mkString(" "), x => x)).trim
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One exists for com.eharmony.aloha.models.vw.jni.multilabel.VwMultlabelJsonCreator. Do you think that most of that code should be refactored to be placed in MultlabelJsonCreator and make it accept an implicit way of parameterizing it based on PredictorProducer?

// Wikipedia does a good job describing this in the proof by induction in the
// explanation for Algorithm R.
if (reservoirSwapInd < k)
reservoir(reservoirSwapInd) = (i - 1).toShort
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to remove "toShort"

Copy link
Contributor Author

@deaktator deaktator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about refactoring StatefulRowCreator to make it useful elsewhere?

* `(MissingAndErroneousFeatureInfo, Option[B])` along with the resulting
* state that is created in the process.
*/
def mapIteratorWithState(as: Iterator[A], state: S): Iterator[((MissingAndErroneousFeatureInfo, Option[B]), S)] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmorra, do you think I should extract this to an object:

package com.eharmony.aloha.util

object MappingUtils {
  def mapIteratorWithState[A, B, S](as: Iterator[A], state: S)(f: (A, S) => (B, S)): Iterator[(B, S)] = {
    // implementation filled in
  }

  def mapSeqWithState[A, B, S](as: Seq[A], state: S)(f: (A, S) => (B, S)): (Vector[B], S) = {
    // implementation filled in
  }
}

And then have a trait:

trait MapWithState[A, B, S] { self: ((A, S) => (B, S)) => 
  def mapIteratorWithState(as: Iterator[A], state: S): Iterator[(B, S)] = 
    MappingUtils.mapIteratorWithState(as, state)(apply)

  def mapSeqWithState(as: Seq[A], state: S): (Vector[B], S) = 
    MappingUtils.mapSeqWithState(as, state)(apply)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you don't model both of these functions together with a common interface such as TraversableOnce.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmorra what do you think about something like this:

import scala.collection.SeqLike
import scala.collection.generic.CanBuildFrom
import scala.collection.breakOut
import scala.collection.{immutable => sci}

def statefulMap[A, B, S, Repr <: sci.Seq[A], That]
    (as: SeqLike[A, Repr], s: S)
    (f: (A, S) => (B, S))
    (implicit cbf: CanBuildFrom[Repr, (B, S), That]): That = {
  if (as.isEmpty)
    cbf().result()
  else {
    val initEl = f(as.head, s)
    as.tail.scanLeft(initEl){ case ((_, newS), a) => f(a, newS) }(breakOut)
  }
}

Then we can use any immutable sequence like you suggested and we can maintain the sequence's structure by not supplying a scala.collection.generic.CanBuildFrom or we can change the structure by supplying a CanBuildFrom. See below:

val sCycleModulo4 = Stream.iterate(0)(s => (s + 1) % 4)
val vCycleModulo4 = Vector.iterate(0, 10)(s => (s + 1) % 4)
val addAndIncrement = (a: Int, s: Int) => ((a + s).toDouble, s + 1)

val stream = statefulMap(sCycleModulo4, 0)(addAndIncrement)

val list: List[(Double, Int)] = 
  statefulMap(vCycleModulo4, 0)(addAndIncrement)(breakOut)

We also get safety guarantees for things that are not immutable (e.g. Array). For instance:

val aCycleModulo4 = Array.iterate(0, 10)(s => (s + 1) % 4)
statefulMap(aCycleModulo4, 0)(addAndIncrement)

Returns:

<console>:21: error: inferred type arguments [Int,Nothing,Int,Array[Int],Nothing] 
do not conform to method statefulMap's type parameter bounds 
[A,B,S,Repr <: scala.collection.immutable.Seq[A],That]

statefulMap(aCycleModulo4, 0)(addAndIncrement)
^

I chose sc.SeqLike and sci.Seq over TraversableLike and Traversable or GenTraversableLike and GenTraversable for a few reasons. The first is that there are no iteration order guarantees for Traversables so the function wouldn't be guaranteed to be idempotent (which I didn't like), whereas sequences do. The reason I chose Repr <: sci.Seq[A] is that I like the idea that of immutability. A third reason to choose immutable is that it precludes arrays. If we want arrays, there should be a separate method because we would have to do an O(N) array copy to get the tail and another O(N) copy to produce the result. This is both slow in terms of runtime and excessive in terms of memory usage.

Copy link
Collaborator

@jmorra jmorra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to approve this with some comments that may not require any code changes. Let me know

@@ -28,7 +28,8 @@ class ModelTypesTest {
"ModelDecisionTree",
"Regression",
"Segmentation",
"VwJNI"
"VwJNI",
"multilabel-sparse"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the casing different here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I want opinions on the name here. I am neither set on, nor particularly enamored with the name name "multilabel-sparse". Suggestions?

* `(MissingAndErroneousFeatureInfo, Option[B])` along with the resulting
* state that is created in the process.
*/
def mapIteratorWithState(as: Iterator[A], state: S): Iterator[((MissingAndErroneousFeatureInfo, Option[B]), S)] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you don't model both of these functions together with a common interface such as TraversableOnce.

* machines, the `seedCreator` should produce a unique value for each
* machine. If row creation is parallelized across machines and threads on
* each machine, the `seedCreator` should create unique values for each
* thread on each machine. Otherwise, randomness will be striped which
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finish your thought.

// array. Next allLabelsInTrainingSet.indices and the positiveIndices array could
// be iterated over simultaneously and if the current index in
// allLabelsInTrainingSet.indices isn't in the positiveIndices array, then it must
// be in the negativeIndices array. Then then offsets into the two arrays are
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then then

object VwDownsampledMultilabelRowCreator extends Rand {

// Expose initSeedScramble to companion class.
private def scramble(initSeed: Long): Long = initSeedScramble(initSeed)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why bother to have a function that is just a pass through?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs it because of the way that Scala does mix-in composition. If you removed it and tried to call initSeedScramble directly, the calling code wouldn't be able to access it. Rand could have been mixed into both the class and companion object but I didn't like that either.

else {
// Determine the weight for the downsampled negatives.
// If the cost of negative examples is positive, then the weight will be
// strictly greater than .
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

greater than ???

// This code is written with knowledge of the constant's value.
// TODO: Write calling code to abstract over costs so knowledge of the costs isn't necessary.
val negWt =
if (numDownsampledLabels == negLabels) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be == or <=?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

==. Since numDownsampledLabels == math.min(numNegLabelsTarget, negLabels), it is certain that numDownsampledLabels <= negLabels. We only don't downsample when numDownsampledLabels == negLabels.

// NOTE: This isn't idiomatic Scala code but it will operate in hot loops in minimizing
// object creation is important.

if (n <= k) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, this is very difficult to read, but I'm going to assume that you did your research here judging by the comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did say you wanted "fast" and that hashing would be too slow. This is what happens when making deals with the devil. And the devil is always in the details. Muah ha ha ha!


val DefaultNumLabels = 0
val DefaultRingSize =
s"--ring_size ${DefaultNumLabels + VwSparseMultilabelPredictor.AddlVwRingSize}"
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline


import VwMultilabelDownsampledModelTest.EndToEndDatasetSpec

@Test def test(): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a test for the actual VW input. Is that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Sort of. The VW input is generated implicitly and the prediction output is produced and tested against the marginal probabilities. So, it works implicitly.

@deaktator deaktator merged commit c4c5502 into development Nov 8, 2017
@deaktator deaktator deleted the 163-multilabel branch November 8, 2017 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants