Skip to content

Latest commit

 

History

History
68 lines (55 loc) · 3.49 KB

EvaluationMeasure.md

File metadata and controls

68 lines (55 loc) · 3.49 KB

Evaluation Measure

The evaluation measure reflects typical use cases at XING: users are presented with their top-k personalized recommendations, and a user interaction with one of the top-k is counted as a success.

The task of the challenge is to compute 30 recommendations (or less) for each of the 150,000 target users. In particular, the algorithms have to predict those items that a user will interact with. If you interpret the problem as a ranking problem then you could phrase the task as follows: given a user, the recommender algorithm has to compute a ranked list of items so that those items which the user is more likely to interact with appear at the top of the ranking.

The algorithms can therefore exploit all the training data. The submission system measures the quality of the recommendations, which were generated by your algorithm for each of the target users, based on the scoring function score(S, T). Here, S is the solution that you are uploading to the system and T is the test set that describes which items are relevant to which user. T is, of course, not accessible to the teams.

Input:

  • Test set: T = The test set consists of 150,000 (user, relevantItems)-tuples. For a given user (contained in the target users), relevantItems is the set of items that are considered as relevant:
    • Relevant items are those items on which the user performed a "positive interaction" in the test period.
    • We count clicks, bookmarks and clicks on the reply button as positive interaction.
  • Solution: S = The task is to compile for each user an ordered list of items/recommendations that has high overlap with the relevantItems for that user. Teams should predict up to 30 items for each of the 150k target users.
    • S(u) = recommendedItems = ordered list of items that are recommended to user u (may be empty if no items are recommended to the user)

score(S, T):

//main scoring function: 
function score(S, T) = {
  score = 0.0
  foreach (u, t) in T: //t = set of relevant items for user u
    r = S(u) //r = ordered list of recommended items for user u
    score +=   20 * (precisionAtK(r, t, 2) + precisionAtK(r, t, 4) + recall(r, t) + userSuccess(r, t)) 
             + 10 * (precisionAtK(r, t, 6) + precisionAtK(r, t, 20))
  
  return score
}

//precision within the first top k items: 
function precisionAtK(recommendedItems, relevantItem, k) = {
  topK = recommendedItems.take(k) //takes first k items from the list of reccommendedItems
  return intersect(topK, relevantItems).size / k
}

//recall = fraction of relevant, retrieved items (30 items 
//are allowed to be submitted at maximum per user): 
function recall(recommendedItems, relevantItem) = {
  if (relevantItems.size > 0)
    return intersect(recommendedItems.take(30), relevantItems).size / relevantItems.size
  else 
    return 0.0
}

//user success = was at least one relevant item recommended for a given user?
function userSuccess(recommendedItems, relevantItem) = {
  if (intersect(recommendedItems.take(30), relevantItems) > 0)
    1.0
  else 
    0.0
}

Additional remarks:

  • intersect(r, t) returns the intersection between the (top k) list of recommended items r and the set of relevant items t
  • intersect(r, t).size returns the size of the intersection
  • For each user, one can thus - at maximum - earn 100 points (given that there exist 20 relevant items for the user).