-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multilabel problems #163
Comments
Output TypesThere are some thing's here that need to be fleshed out regarding output type. The output type should be a Map-like structure but So, we have to determine the output boundary. Who's going to be responsible for mapping the ID space to the class (class in the machine learning sense) instances? If this is going to be Aloha's responsibility, Map[K, Double] Then we'll need to have a mapping from Auditor[U, Map[K, Double], B] This works fine for things like Auditor[Option[_], Map[K, Double], Option[Map[K, Double]]] but we will need to update AvroScoreAuditor and protobuf ScoreAuditor. Parallel Arrays outputIf we make the output type case class ClassInfo(id: Long, name: String, description: String) we will have problems since libraries like Avro would need this type encoded in the protocol. But the protocol wouldn't know anything about a user's problem-specific codomain type. So what do we do here? |
The parallel arrays approach would return the input from the user with the corresponding values? Say
|
After speaking about this yesterday, I think my original comment is probably not true and we might want to support LDF right out of the box based on PR #105. |
This is a little tricky. Here's is a basic skeleton of the class and companion. A few things to notice:
case class MultilabelModel[U, K, -A, +B <: U](
modelId: ModelIdentity,
labels: collection.immutable.IndexedSeq[K],
learnerProducer: () => (Seq[(String, Double)]) => Map[Int, Double],
auditor: Auditor[U, Map[K, Double], B]
)
extends SubmodelBase[U, Map[K, Double], A, B] {
@transient private[this] lazy val learner = learnerProducer()
{
// Force creation eagerly somehow
learner
}
override def subvalue(a: A): Subvalue[B, Map[K, Double]] = {
val features: Seq[(String, Double)] = getFeatures(a)
// Might need another param to specify which labels / predictions to output.
val unmappedValues = learner(features)
val natural = unmappedValues.map { case (k, v) => labels(k) -> v }
// Obviously need to handle any failure cases too.
val aud = auditor.success(modelId, natural)
Subvalue(aud, Option(natural))
}
}
object MultilabelModel extends ParserProviderCompanion {
object Parser extends ModelSubmodelParsingPlugin {
override val modelType: String = "multilabel"
override def modelJsonReader[U, N, A, B <: U](
factory: SubmodelFactory[U, A],
semantics: Semantics[A],
auditor: Auditor[U, N, B]
)(implicit r: RefInfo[N], jf: JsonFormat[N]): Option[JsonReader[Model[A, B]]] = {
if (!RefInfoOps.isSubType[N, Map[_, Double]])
None
else {
// Here's the necessary reflection and JSON information needed to proceed.
val refInfoK = RefInfoOps.typeParams(r).head
// Returns an Option[JsonFormat[Any]]
val jsonFormatK = factory.jsonFormat(refInfoK)
???
}
}
override def parser: ModelParser = Parser
} |
Given the modelJsonReader(...)(implicit r: RefInfo[N]) we can use the following to get the r.typeArguments.head This should give us enough type information from the Scala perspective but we still need a |
@amir-ziai-zefr Does it make sense why we need a spray.json.JsonReader[K]? Otherwise, we won't be able to read the labels needed by the model. |
@deaktator we'll get a |
Here's the model spec that @deaktator proposed
@deaktator: I think maybe we should support exactly one of Label => Seq[Iterable[(String, Double)]] OR Label => Seq[Double] |
After some thinking I think there's a definite path forward. There will be three stages.
Multi-label Base ClassThe skeleton for We need to nail down whether we want to support dense and sparse features. VW and libraries supporting SVM-light like input are obviously sparse. H2O is dense. I assume most deep learning libraries will also be dense. (Let's figure this out ASAP) The implementations provided to the base class parser should adhere to a plugin-based architecture. This can be done via class path scanning using Reflections like I do in We can test the functionality of the class and companion object separately. This class will be easier to test I suspect. Implementations (VW CSOAA and CSOAA LDF)Once we nail down the details related to the plugin architecture, I think CSOAA should be fairly straight forward because we know the number of classes. I am actually more concerned about CSOAA LDF. How do we create the mapping from ID space to Auditors and RefInfoToJsonFormatI think these are pretty easy. We'll need to define a |
One other question is if we decide to allow for dense features, how do we deal with missing feature values? We can't simply flat map them out because we need the indices to be consistent. We could leave the values as zeroes, but that seems like a bad idea too. |
@deaktator regarding sparsity are you concerned with features or labels? My understanding of Tensorflow (and by extension Keras) is that sparse labels are not supported but sparse features (Sparse Tensors) are supported. The same is probably true for other deep learning libraries. |
@amir-ziai-zefr Concerned with sparse features. |
I was having trouble thinking about how to do label-dependent features. @amir-ziai-zefr suggested passing to the Map[K, Option[com.eharmony.aloha.dataset.density.Sparse]]
Map[K, Option[sci.IndexedSeq[com.eharmony.aloha.dataset.density.Sparse]]] I think this is pretty slick as it avoids the problem of having to construct functions with domain {
"modelType": "multilabel-sparse",
"modelId": { "id": 0, "name": "my multilabel model" },
"other_stuff": "...",
"labels": [
{
"label": {
"id": 1,
"title": "sports",
"desc": "sports videos for sports"
},
"label_dep_features": [
["desc=sports", 2],
["desc=for", 1],
["desc=videos", 1]
]
},
{
"label": {
"id": 2,
"title": "movies"
},
"label_dep_features": [
["movies", 1]
]
},
{
"label": {
"id": 3,
"title": "misc",
"desc": "doesn't have label-dependent features"
}
}
]
} This is much easier from a coding perspective than trying to produce a Seq[GenAggFunc[K, Sparse]] This solution is much less error prone. But requires label-dependent featurization to be done and copied into the aloha model specification. What does everyone think? |
This seems like an OK thing to do. I'm not convinced I understand all the ramifications at this time. My biggest concern is that this looks burdensome for the model creator (data scientists). Can we avoid placing a large burden on model creators? |
@jon-morra-zefr this should be generated programmatically and probably changes very infrequently. The alternative is computationally heavy and leaves a lot of room for things to go wrong from what I understand. |
If it's generated programmatically then it shouldn't be a burden for the DS in which case I don't see a problem with it. |
After talking to Jon, the first stab at this will not have label dependent features. I'll leave some comments on some approaches that could be used to implement LDF in future versions. |
It has become clear that Aloha needs to support true multilabel problems. Before any library is integrated we need to decide upon the interfaces to expose to get this work done. This ticket is the discussion area for those interfaces. To get the ball rolling here is a preliminary suggestion
The input for multilabel problems will be a set of features and an N dimensional binary vector indicating membership. In order to facilitate easier guarantees I think that the number of labels should be an input to the spec at train time. There should then be a
labels
field which contains a function which produces the indices of the labels to be activated[A => Iterable[Int]]
. This is clearly a sparse format. The featurization should fail if an index is greater than the number of labels.At score time, a multilabel model should have a field called
labels
as well that is the same function mentioned above. This function will be used to identify what labels to score. The output of the model should beMap[Int, Double]
that indicates the score of all the labels that are passed in.At this time I would NOT support the any featurization of labels such as label dependent features.
This is meant to get the discussion going and not a requirement.
The text was updated successfully, but these errors were encountered: