-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #163 #193
Closes #163 #193
Conversation
…on't matter because there are no sub models or super models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really nice, left some comments.
val (missingAndErrs, features) = featuresFunction(a) | ||
|
||
// TODO: Should this be sci.BitSet? | ||
val positiveIndices: Set[Int] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to consider BitSet
but only if the performance improvement is drastic. If we care about performance shouldn't we consider not changing collection types? Isn't .toSeq
an O(N) operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.toSeq
is on an option. It could be removed without effect due to the implicit function that changes options to iterables.
|
||
object VwMultilabelRowCreator { | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love these descriptions.
* | ||
* '''NOTE''': The trailing space should be here. | ||
*/ | ||
private[this] val SharedFeatureIndicator = "shared" + " " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just write this as "shared "
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emphasis.
* | ||
* The goal of this function is to try to use characters in literature used to denote a | ||
* dependent variable. If that isn't possible (because the characters are already used by some | ||
* other namespace, just find the first possible characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close your parenthesis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
private[multilabel] def nssToFirstCharBitSet(ss: Set[String]): sci.BitSet = | ||
ss.collect { case s if s != "" => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s.nonEmpty
is more legible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requires class instantiation. Changing to case s if s.length != 0 =>
.
|
||
labelAndDummyLabelNss match { | ||
case Some(LabelNamespaces(labelNs, _)) => | ||
// TODO: Should we remove this. If not, it must contain the --ring_size [training labels + 10]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so.
val substr = str.substring(0, JsonErrStrLength) | ||
s"JSON object expected. Found " + substr + (if (str.length != substr.length) " ..." else "") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline
labelsInTrainingSet: Vector[K], | ||
labelsOfInterest: Option[String] = None) = { | ||
|
||
// implicit val vecWriter = vectorFormat(DefaultJsonProtocol.lift(implicitly[JsonWriter[K]])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code.
} | ||
} | ||
|
||
object VwMultilabelParamAugmentationTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
ringSize.toInt | ||
) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmorra. What do you think about this?
positiveLabelsFunction: GenAggFunc[A, sci.IndexedSeq[K]], | ||
classNs: Char, | ||
dummyClassNs: Char, | ||
includeZeroValues: Boolean = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I like that. Let me think on that for a few and come up with something.
positiveLabelsFunction: GenAggFunc[A, sci.IndexedSeq[K]], | ||
classNs: Char, | ||
dummyClassNs: Char, | ||
includeZeroValues: Boolean = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmorra I think the requirement that "negative labels added need to change with EVERY line" is important, but I don't think it fits into the current architecture. It seems like you are (at least implicitly) advocating two things:
- Intentionally making
apply
"more" impure, and definitely not idempotent or a function (in the math sense). - Maintaining state inside
VwMultilabelRowCreator
(by virtue of (1))
What if we have a training set that is distributed across multiple nodes. We apply the VwMultilabelRowCreator
to the elements on each node and combine the results? If we want to do what I think you are suggesting, I think we'll need to make some pretty big design changes. I suggest the following:
- We'll need a constructor
seed
parameter. This could be a value, but if so, the calling code would be responsible for varying the seed for each node on which theVwMultilabelRowCreator
is run. It could also be an impure and side-effecting function is idempotent (in context), meaning it would produce the same result every time it is run on the same node. This would allow the seed generator parameter to be set once and called on each node to produce the seed for that node.seed: () => Long // Not necessarily a long but something to seed pseudo-randomness.
- We'll need a new interface like
StatefulRowCreator
. This interface will use something like theState
monad in cats or scalaz. I would recommend stack safety because its implementation stack-safe (remember thetut
bug in 0.5.3?). - We'll have to use the
StatefulRowCreator
differently most likely, we could expose two methods, one that operates on single items and requires the caller the maintain the state and pass it in. The other method could operate on multiple elements (collections, iterators) and hide the state information.
What do you think about this stuff? I think if we try to jam this type of mutability into the current interface, it'll most likely produce disastrous results.
…ed StatefulRowCreator API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmorra I think this addresses all of your PR comments. I updated the downsampling to operate on 231 - 1 labels.
* Created by ryan.deak on 9/7/17. | ||
*/ | ||
trait PluginInfo[K] { | ||
def features: ListMap[String, Spec] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ❤️ the performance of get
and apply
on ListMap
! Also, it provides a way to index on key-value pairs and maintain iteration order based on insertion order. It also allows us to read the features as a map in the JSON. For instance:
{
"modelType": "multilabel-sparse",
"modelId": { "id": 1, "name": "NONE" },
"numMissingThreshold": 0,
"features": {
"feature": "1" // <== JSON Map maintains order in code.
},
"labelsInTrainingSet": [4, 8],
"underlying": {
"type": "vw",
"modelSource": { "model": "..." },
"namespaces": { "X": [ "feature" ] }
}
}
protected[this] case class Plugin(`type`: String) | ||
|
||
/** | ||
* Data for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* A sparse multi-label predictor takes: | ||
* | ||
- features | ||
- labels for which a prediction should be produced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not be in the output but will be in LabelsAndInfo.labelsNotInTrainingSet
returned by MultilabelModel.labelsForPrediction
. This LabelsAndInfo
instance is accepted by the error functions and success functions and these are reported in the Aloha audit trail.
* This definition is "lazy" because we can't guarantee that the underlying predictor is | ||
* `Serializable` so we pass around a function that can be cached in a ''transient'' | ||
* `lazy val`. This function should however be `Serializable` and testing should be done | ||
* to ensure that each predictor producer is `Serializable`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VwSparseMultilabelPredictorTest.testSerializability
/** | ||
* Created by ryan.deak on 9/6/17. | ||
*/ | ||
trait RuntimeClasspathScanning { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I'm doing in that class. What do you mean?
private val SharedPrefix = "shared " | ||
|
||
private val DummyLabels = List( | ||
s"$NegDummyClass:$NegVal |y N", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a good idea. This is testing the constants are the expected values. This is important, especially for the PosVal
and NegVal
. I will extract it though, like I did with the other variables.
|
||
val vwParams = vw.vw.params.fold("")(_.fold(_.mkString(" "), x => x)).trim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One exists for com.eharmony.aloha.models.vw.jni.multilabel.VwMultlabelJsonCreator. Do you think that most of that code should be refactored to be placed in MultlabelJsonCreator and make it accept an implicit way of parameterizing it based on PredictorProducer?
// Wikipedia does a good job describing this in the proof by induction in the | ||
// explanation for Algorithm R. | ||
if (reservoirSwapInd < k) | ||
reservoir(reservoirSwapInd) = (i - 1).toShort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to remove "toShort"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about refactoring StatefulRowCreator
to make it useful elsewhere?
* `(MissingAndErroneousFeatureInfo, Option[B])` along with the resulting | ||
* state that is created in the process. | ||
*/ | ||
def mapIteratorWithState(as: Iterator[A], state: S): Iterator[((MissingAndErroneousFeatureInfo, Option[B]), S)] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmorra, do you think I should extract this to an object:
package com.eharmony.aloha.util
object MappingUtils {
def mapIteratorWithState[A, B, S](as: Iterator[A], state: S)(f: (A, S) => (B, S)): Iterator[(B, S)] = {
// implementation filled in
}
def mapSeqWithState[A, B, S](as: Seq[A], state: S)(f: (A, S) => (B, S)): (Vector[B], S) = {
// implementation filled in
}
}
And then have a trait:
trait MapWithState[A, B, S] { self: ((A, S) => (B, S)) =>
def mapIteratorWithState(as: Iterator[A], state: S): Iterator[(B, S)] =
MappingUtils.mapIteratorWithState(as, state)(apply)
def mapSeqWithState(as: Seq[A], state: S): (Vector[B], S) =
MappingUtils.mapSeqWithState(as, state)(apply)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why you don't model both of these functions together with a common interface such as TraversableOnce
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmorra what do you think about something like this:
import scala.collection.SeqLike
import scala.collection.generic.CanBuildFrom
import scala.collection.breakOut
import scala.collection.{immutable => sci}
def statefulMap[A, B, S, Repr <: sci.Seq[A], That]
(as: SeqLike[A, Repr], s: S)
(f: (A, S) => (B, S))
(implicit cbf: CanBuildFrom[Repr, (B, S), That]): That = {
if (as.isEmpty)
cbf().result()
else {
val initEl = f(as.head, s)
as.tail.scanLeft(initEl){ case ((_, newS), a) => f(a, newS) }(breakOut)
}
}
Then we can use any immutable sequence like you suggested and we can maintain the sequence's structure by not supplying a scala.collection.generic.CanBuildFrom
or we can change the structure by supplying a CanBuildFrom
. See below:
val sCycleModulo4 = Stream.iterate(0)(s => (s + 1) % 4)
val vCycleModulo4 = Vector.iterate(0, 10)(s => (s + 1) % 4)
val addAndIncrement = (a: Int, s: Int) => ((a + s).toDouble, s + 1)
val stream = statefulMap(sCycleModulo4, 0)(addAndIncrement)
val list: List[(Double, Int)] =
statefulMap(vCycleModulo4, 0)(addAndIncrement)(breakOut)
We also get safety guarantees for things that are not immutable (e.g. Array
). For instance:
val aCycleModulo4 = Array.iterate(0, 10)(s => (s + 1) % 4)
statefulMap(aCycleModulo4, 0)(addAndIncrement)
Returns:
<console>:21: error: inferred type arguments [Int,Nothing,Int,Array[Int],Nothing]
do not conform to method statefulMap's type parameter bounds
[A,B,S,Repr <: scala.collection.immutable.Seq[A],That]
statefulMap(aCycleModulo4, 0)(addAndIncrement)
^
I chose sc.SeqLike
and sci.Seq
over TraversableLike
and Traversable
or GenTraversableLike
and GenTraversable
for a few reasons. The first is that there are no iteration order guarantees for Traversable
s so the function wouldn't be guaranteed to be idempotent (which I didn't like), whereas sequences do. The reason I chose Repr <: sci.Seq[A]
is that I like the idea that of immutability. A third reason to choose immutable is that it precludes arrays. If we want arrays, there should be a separate method because we would have to do an O(N) array copy to get the tail and another O(N) copy to produce the result. This is both slow in terms of runtime and excessive in terms of memory usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to approve this with some comments that may not require any code changes. Let me know
@@ -28,7 +28,8 @@ class ModelTypesTest { | |||
"ModelDecisionTree", | |||
"Regression", | |||
"Segmentation", | |||
"VwJNI" | |||
"VwJNI", | |||
"multilabel-sparse" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the casing different here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I want opinions on the name here. I am neither set on, nor particularly enamored with the name name "multilabel-sparse"
. Suggestions?
* `(MissingAndErroneousFeatureInfo, Option[B])` along with the resulting | ||
* state that is created in the process. | ||
*/ | ||
def mapIteratorWithState(as: Iterator[A], state: S): Iterator[((MissingAndErroneousFeatureInfo, Option[B]), S)] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why you don't model both of these functions together with a common interface such as TraversableOnce
.
* machines, the `seedCreator` should produce a unique value for each | ||
* machine. If row creation is parallelized across machines and threads on | ||
* each machine, the `seedCreator` should create unique values for each | ||
* thread on each machine. Otherwise, randomness will be striped which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finish your thought.
// array. Next allLabelsInTrainingSet.indices and the positiveIndices array could | ||
// be iterated over simultaneously and if the current index in | ||
// allLabelsInTrainingSet.indices isn't in the positiveIndices array, then it must | ||
// be in the negativeIndices array. Then then offsets into the two arrays are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then then
object VwDownsampledMultilabelRowCreator extends Rand { | ||
|
||
// Expose initSeedScramble to companion class. | ||
private def scramble(initSeed: Long): Long = initSeedScramble(initSeed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why bother to have a function that is just a pass through?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs it because of the way that Scala does mix-in composition. If you removed it and tried to call initSeedScramble
directly, the calling code wouldn't be able to access it. Rand
could have been mixed into both the class and companion object but I didn't like that either.
else { | ||
// Determine the weight for the downsampled negatives. | ||
// If the cost of negative examples is positive, then the weight will be | ||
// strictly greater than . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
greater than ???
// This code is written with knowledge of the constant's value. | ||
// TODO: Write calling code to abstract over costs so knowledge of the costs isn't necessary. | ||
val negWt = | ||
if (numDownsampledLabels == negLabels) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be ==
or <=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
==
. Since numDownsampledLabels == math.min(numNegLabelsTarget, negLabels)
, it is certain that numDownsampledLabels <= negLabels
. We only don't downsample when numDownsampledLabels == negLabels
.
// NOTE: This isn't idiomatic Scala code but it will operate in hot loops in minimizing | ||
// object creation is important. | ||
|
||
if (n <= k) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, this is very difficult to read, but I'm going to assume that you did your research here judging by the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You did say you wanted "fast" and that hashing would be too slow. This is what happens when making deals with the devil. And the devil is always in the details. Muah ha ha ha!
|
||
val DefaultNumLabels = 0 | ||
val DefaultRingSize = | ||
s"--ring_size ${DefaultNumLabels + VwSparseMultilabelPredictor.AddlVwRingSize}" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline
|
||
import VwMultilabelDownsampledModelTest.EndToEndDatasetSpec | ||
|
||
@Test def test(): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a test for the actual VW input. Is that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Sort of. The VW input is generated implicitly and the prediction output is produced and tested against the marginal probabilities. So, it works implicitly.
…ltilabel'. Generalized StatefulRowCreator.
Summary
Adds multi-label model, a VW plugin for the multi-label model, a row creator for multi-label VW input. Adds and
EitherAuditor
and fixes stack overflows intut
when generating documentation.Resolves
Bug Fixes/New Features
How to Verify
Look at the tests. I would also build locally (
sbt clean test package makeMicrosite
) to make sure nothing is left out.Tests
aloha-core/com.eharmony.aloha.dataset.vw.multilabel
aloha-vw-jni/com.eharmony.aloha.models.vw.jni.multilabel
Code Reviewer(s)
@jmorra