Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multilabel problems #163

Closed
jon-morra-zefr opened this issue Aug 14, 2017 · 17 comments
Closed

Support multilabel problems #163

jon-morra-zefr opened this issue Aug 14, 2017 · 17 comments
Assignees

Comments

@jon-morra-zefr
Copy link

jon-morra-zefr commented Aug 14, 2017

It has become clear that Aloha needs to support true multilabel problems. Before any library is integrated we need to decide upon the interfaces to expose to get this work done. This ticket is the discussion area for those interfaces. To get the ball rolling here is a preliminary suggestion

The input for multilabel problems will be a set of features and an N dimensional binary vector indicating membership. In order to facilitate easier guarantees I think that the number of labels should be an input to the spec at train time. There should then be a labels field which contains a function which produces the indices of the labels to be activated [A => Iterable[Int]]. This is clearly a sparse format. The featurization should fail if an index is greater than the number of labels.

At score time, a multilabel model should have a field called labels as well that is the same function mentioned above. This function will be used to identify what labels to score. The output of the model should be Map[Int, Double] that indicates the score of all the labels that are passed in.

At this time I would NOT support the any featurization of labels such as label dependent features.

This is meant to get the discussion going and not a requirement.

@deaktator
Copy link
Contributor

Output Types

There are some thing's here that need to be fleshed out regarding output type.

The output type should be a Map-like structure but Map[Int, Double] is tricky. In Protocol Buffers, v3 "any integral or string type" key type is acceptable. In Avro "Map keys are assumed to be strings".

So, we have to determine the output boundary. Who's going to be responsible for mapping the ID space to the class (class in the machine learning sense) instances? If this is going to be Aloha's responsibility,
we'll need to make the natural type (type parameter N) of the model something like

Map[K, Double]

Then we'll need to have a mapping from N to B, like with BasicDecisionTree. This is the job of the Auditor. So we should be fine as long as we can produce an Auditor of type:

Auditor[U, Map[K, Double], B]

This works fine for things like Option where we would have:

Auditor[Option[_], Map[K, Double], Option[Map[K, Double]]]

but we will need to update AvroScoreAuditor and protobuf ScoreAuditor.

Parallel Arrays output

If we make the output type B based on parallel sequences, we will have more flexibility and the classes of type K could be returned directly. This might be more appealing. The problem with this approach is the encoding the class information in a serialization framework. If we want to encode some type like

case class ClassInfo(id: Long, name: String, description: String)

we will have problems since libraries like Avro would need this type encoded in the protocol. But the protocol wouldn't know anything about a user's problem-specific codomain type.

So what do we do here?

@amir-ziai-zefr
Copy link

The parallel arrays approach would return the input from the user with the corresponding values? Say

["class1", "class2", ...]
[1.0, 0.0, ...]

@jon-morra-zefr
Copy link
Author

After speaking about this yesterday, I think my original comment is probably not true and we might want to support LDF right out of the box based on PR #105.

@deaktator
Copy link
Contributor

deaktator commented Aug 29, 2017

This is a little tricky. Here's is a basic skeleton of the class and companion. A few things to notice:

  • We can't assume the learner is Serializableso we need a Serializable producer that produces the possibly unserializable learner in the class body. (This comes from experience in previously integrating VW)
  • The definition is currently for libraries that support sparse learning (Seq[(String, Double)]). We need to figure out if we want to support dense libraries here. If so, we'll need a separate type parameter and we'll need to parametrize learnerProducer's type.
case class MultilabelModel[U, K, -A, +B <: U](
    modelId: ModelIdentity,
    labels: collection.immutable.IndexedSeq[K],
    learnerProducer: () => (Seq[(String, Double)]) => Map[Int, Double],
    auditor: Auditor[U, Map[K, Double], B]
)
extends SubmodelBase[U, Map[K, Double], A, B] {
  @transient private[this] lazy val learner = learnerProducer()

  {
    // Force creation eagerly somehow
    learner 
  }

  override def subvalue(a: A): Subvalue[B, Map[K, Double]] = {
    val features: Seq[(String, Double)] = getFeatures(a)

    // Might need another param to specify which labels / predictions to output.
    val unmappedValues = learner(features)
    val natural = unmappedValues.map { case (k, v) => labels(k) -> v }

    // Obviously need to handle any failure cases too.
    val aud = auditor.success(modelId, natural)
    Subvalue(aud, Option(natural))
  }
}

object MultilabelModel extends ParserProviderCompanion {

  object Parser extends ModelSubmodelParsingPlugin {
    override val modelType: String = "multilabel"
    override def modelJsonReader[U, N, A, B <: U](
        factory: SubmodelFactory[U, A], 
        semantics: Semantics[A], 
        auditor: Auditor[U, N, B]
    )(implicit r: RefInfo[N], jf: JsonFormat[N]): Option[JsonReader[Model[A, B]]] = {
      if (!RefInfoOps.isSubType[N, Map[_, Double]])
        None
      else {
        // Here's the necessary reflection and JSON information needed to proceed.
        val refInfoK = RefInfoOps.typeParams(r).head
        // Returns an Option[JsonFormat[Any]]
        val jsonFormatK = factory.jsonFormat(refInfoK) 
        ???
    }
  }

  override def parser: ModelParser = Parser
}

@deaktator
Copy link
Contributor

Given the RefInfo[N] in

modelJsonReader(...)(implicit r: RefInfo[N])

we can use the following to get the RefInfo for K (but it will be untyped):

r.typeArguments.head

This should give us enough type information from the Scala perspective but we still need a JsonReader[K] which I still don't know how to get.

@deaktator deaktator self-assigned this Aug 29, 2017
@deaktator
Copy link
Contributor

@amir-ziai-zefr Does it make sense why we need a spray.json.JsonReader[K]? Otherwise, we won't be able to read the labels needed by the model.

@amirziai
Copy link
Contributor

@deaktator we'll get a JsonReader that knows how to parse the labels given the type information provided to modelJsonReader?

@amirziai
Copy link
Contributor

amirziai commented Aug 29, 2017

Here's the model spec that @deaktator proposed

{
  "type": "Multilabel"
  "labels": [ "class_one", "class_two", "class_three"  ],
  "label_extractor": A => Set[Label]
  "ldf": Seq[Label => Seq[(String, Double)]]
  "features": { },
  "submodel": { 
    "vw" : {
      "model": "..." 
    }
  }
}

@deaktator: I think maybe we should support exactly one of labels and label_extractor at a time. If labels is provided, we could apply ldf once. If label_extractor is provided, we'll have to apply ldf on each prediction. Maybe these could be wrapped and contained in an ADT. If we choose to support different feature densities, we'll have to parametrize ldf so it could be one of:

Label => Seq[Iterable[(String, Double)]]

OR

Label => Seq[Double]

@deaktator
Copy link
Contributor

deaktator commented Aug 30, 2017

After some thinking I think there's a definite path forward. There will be three stages.

  1. Write multi-label base class enabling primitive types as labels. Test class (but not parsing) without any implementations using OptionAuditor.
  2. Write VW plugin for CSOAA and/or CSOAA LDF. See Hal Daume's On Multiclass Classification in VW page for some details.
    • This will also involve writing the dataset generation code. That is not going to be trivial.
  3. Write custom Auditors and RefInfoToJsonFormat to support custom label types for multi-label problems. Write a custom "default factory" to support these new types. This might require creating a facade around ModelFactoryImpl. If we want the new model factory to be an actual ModelFactory, we can remove sealed or change ModelFactoryImpl to a non case class.

Multi-label Base Class

The skeleton for MultilabelModel (updated above) now includes how, in the factory method, we can get the reflection and JSON schema information necessary to construct the model with a desired label type. To deal with more exotic types not supported by RefInfoToJsonFormat and Auditor, we'll need to make new ones. For now, I think we may only need to update the protocol-based auditors (avro, protobuf).

We need to nail down whether we want to support dense and sparse features. VW and libraries supporting SVM-light like input are obviously sparse. H2O is dense. I assume most deep learning libraries will also be dense. (Let's figure this out ASAP)

The implementations provided to the base class parser should adhere to a plugin-based architecture. This can be done via class path scanning using Reflections like I do in ModelFactory. This will allow us to cleverly define holes in the JSON to be filled in via the plugins.

We can test the functionality of the class and companion object separately. This class will be easier to test I suspect.

Implementations (VW CSOAA and CSOAA LDF)

Once we nail down the details related to the plugin architecture, I think CSOAA should be fairly straight forward because we know the number of classes.

I am actually more concerned about CSOAA LDF. How do we create the mapping from ID space to K? We might need to make the labels parameter to MultilabelModel a Map[Int, K]. (Let's discuss this).

Auditors and RefInfoToJsonFormat

I think these are pretty easy. We'll need to define a JsonFormat for our custom label types (which should be easy). We'll also need one or more custom Auditors (at least if we want to return Avro records). This leads to a problem of changing the Avro schema which might affect consumers downstream. (How do we deal with this?)

@deaktator
Copy link
Contributor

One other question is if we decide to allow for dense features, how do we deal with missing feature values? We can't simply flat map them out because we need the indices to be consistent. We could leave the values as zeroes, but that seems like a bad idea too.

@amirziai
Copy link
Contributor

@deaktator regarding sparsity are you concerned with features or labels? My understanding of Tensorflow (and by extension Keras) is that sparse labels are not supported but sparse features (Sparse Tensors) are supported. The same is probably true for other deep learning libraries.

@deaktator
Copy link
Contributor

@amir-ziai-zefr Concerned with sparse features.

@deaktator
Copy link
Contributor

deaktator commented Aug 31, 2017

I was having trouble thinking about how to do label-dependent features. @amir-ziai-zefr suggested passing to the MultiLabelModel constructor one of the following:

Map[K, Option[com.eharmony.aloha.dataset.density.Sparse]]
Map[K, Option[sci.IndexedSeq[com.eharmony.aloha.dataset.density.Sparse]]]

I think this is pretty slick as it avoids the problem of having to construct functions with domain K. Constructing such functions is problematic because we don't have a Semantics[K] available to the parser. We only have a Semantics[A]. But we do have a RefInfo[K] and JsonFormat[K]. We also have a JsonFormat[Option[Iterable[(String, Double)]]] so we could totally do this. So the JSON would look something like:

{
  "modelType": "multilabel-sparse",
  "modelId": { "id": 0, "name": "my multilabel model" },

  "other_stuff": "...",

  "labels": [
    { 
      "label": { 
        "id": 1,
        "title": "sports",
        "desc": "sports videos for sports" 
      }, 
      "label_dep_features": [
        ["desc=sports", 2], 
        ["desc=for", 1], 
        ["desc=videos", 1]
      ]
    },
    { 
      "label": {
        "id": 2,
        "title": "movies"
      }, 
      "label_dep_features": [
        ["movies", 1]
      ] 
    },
    { 
      "label": {
        "id": 3,
        "title": "misc",
        "desc": "doesn't have label-dependent features"
      }
    }
  ]
}

This is much easier from a coding perspective than trying to produce a Semantics[K] from a Semantics[A] and then producing a bunch of label-dependent feature functions like:

Seq[GenAggFunc[K, Sparse]]

This solution is much less error prone. But requires label-dependent featurization to be done and copied into the aloha model specification.

What does everyone think?

@jon-morra-zefr
Copy link
Author

This seems like an OK thing to do. I'm not convinced I understand all the ramifications at this time. My biggest concern is that this looks burdensome for the model creator (data scientists). Can we avoid placing a large burden on model creators?

@amirziai
Copy link
Contributor

@jon-morra-zefr this should be generated programmatically and probably changes very infrequently. The alternative is computationally heavy and leaves a lot of room for things to go wrong from what I understand.

@jon-morra-zefr
Copy link
Author

If it's generated programmatically then it shouldn't be a burden for the DS in which case I don't see a problem with it.

@deaktator
Copy link
Contributor

After talking to Jon, the first stab at this will not have label dependent features. I'll leave some comments on some approaches that could be used to implement LDF in future versions.

This was referenced Oct 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants