Why use MAP for evaluation during offline challenge?

In previous publications we have shown that recommending given names is a difficult task. For many users, many recommenders did not produce recommendations containing the test names in top positions. Thus, measures like precision@k make it hard to distinguish between results, especially for lower cut-off-thresholds k. MAP (Mean Average Precision) is a measure that is suitable for (arbitrarily long) ordered lists of recommendations. Like NDCG (Normalized Discounted Cumulative Gain) or AUC (Area Under the Curve) it evaluates the recommendations based on the positions of the left out items within the list of recommendations. It yields good scores when test items appear on the top positions of the recommendations and lower scores if they are ranked further below (unlike precision@k where lower ranked items are cut off and thus do not contribute to the score).
In the offline challenge, for each test user two names have been left out and thus have to be predicted by the recommender systems. The scores depend on the two ranks of those items. The main difference between the measures is how they incorporate these ranks into the score. While AUC yields a linear combination of the two ranks, MAP takes a linear combination of the reciprocal ranks and NDCG a linear combinations of the (logarithmically) smoothed reciprocal ranks. Among these measures, MAP is the one that discriminates the strongest between higher or lower ranks and therefore was chosen for the challenge.

Comments are closed.