Why use MAP for evaluation during offline challenge?

In previous publications we have shown that recommending given names is a difficult task. For many users, many recommenders did not produce recommendations containing the test names in top positions. Thus, measures like precision@k make it hard to distinguish between results, especially for lower cut-off-thresholds k. MAP (Mean Average Precision) is a measure that is suitable for (arbitrarily long) ordered lists of recommendations. Like NDCG (Normalized Discounted Cumulative Gain) or AUC (Area Under the Curve) it evaluates the recommendations based on the positions of the left out items within the list of recommendations. It yields good scores when test items appear on the top positions of the recommendations and lower scores if they are ranked further below (unlike precision@k where lower ranked items are cut off and thus do not contribute to the score).
In the offline challenge, for each test user two names have been left out and thus have to be predicted by the recommender systems. The scores depend on the two ranks of those items. The main difference between the measures is how they incorporate these ranks into the score. While AUC yields a linear combination of the two ranks, MAP takes a linear combination of the reciprocal ranks and NDCG a linear combinations of the (logarithmically) smoothed reciprocal ranks. Among these measures, MAP is the one that discriminates the strongest between higher or lower ranks and therefore was chosen for the challenge.

Which data will be used for the leaderboard?

The leaderboard gives you an overview about the latest results from all participating teams. Once every week we check for new submissions and calculate the new board. However, just the latest submission will be used.
This is especially important to avoid cheating. Otherwise, one team could calculate a lot of results (also by guessing) and submit them in order to be the best team with a brute-force-like approach.

Why use k = 1,000 for MAP@k evaluation?

Although we have argued against cut-off-measures like precision@k (often in recommender literature precision@5 or precsion@10) it is reasonable to cut off lists at some point. Compared to many other areas where recommendations are used (e.g., people, movie, book or website recommenders) the time needed to evaluate a recommendation is very short (if you like a name, just click it) and the cost in terms of money or time spent for following a recommendation that turns out bad, is very low. At the same time, finding the perfect name for a child is often a process of months rather than minutes (like for finding the next book to read or deciding which movie to watch) or seconds (deciding which tag to use or which website to visit on the net). Thus it is reasonable to assume that parents-to-be are willing to scroll through lists of names longer than the usual top 10. Additionally, consider that one of the traditional ways of searching for names is to go through first names dictionaries where names are listed unpersonalized, in alphabetical order. In such books usually there are a lot more than 1,000 names that have to be read and therefore it seems reasonable that readers of such books won’t mind studying longer name lists on the web.

When I submit an empty list, the evaluation script returns 0.0014975…

Shouldn’t it return 0?

The script doesn’t return 0 because we adopted the MAP score for addressing the restriction of the evaluation to the top 1000 names.

Firstly, consider a recommender A which recommends a relevant name at position 10 and a second relevant name at position 1000. It’s MAP score becomes (1/10 + 2/1000)/2 = 0.051.

Now consider another recommender B which only recommends one relevant name at position 10. Without adoption, it’s MAP score would be (1/10)/1=0.1.

That is, recommender B would achieve a MAP score which nearly doubles recommender A’s result, although A recommends two and B only one relevant name.

We tackled this case by virtually placing the names which a recommender fails to recommend at positions 1000+i. In case of the empty list, none of the relevant names was recommended, yielding an MAP score of (1/1001+2/1002)/2=0.001497504

Is there a problem to submit the result file with strings in lower case?

The case of a name is ignored during evaluation, using the following UTF-8 aware lower-case conversion in Perl:

$enc = "utf-8";
$inputline = decode($enc, $inputline);
$inputline = encode($enc, lc($inputline));

To answer the question: It is perfectly fine to submit the result file with strings in lower case!

Can you give an example on how the test data is derived from the full usage log?

Sure! Roughly, we selected users with at least five different names and withheld the last two entered names for evaluation (have a look at the description of the offline challenge for a detailed description of the selection process).

If you want to obtain comparable training and test scenarios from the public data yourself, you can use the Perl script which was used to split the test and training data on the download page.

And if you are still interested, have a look at the following example for a user with id 23 which illustrates some of the gritty details. Firstly, consider a fictive full user profile within nameling’s query logs:

userId  activity        name       POSIX_time
23      ADD_FAVORITE    max        1361099013
23      ENTER_SEARCH    carsten    1361099014
23      ENTER_SEARCH    jan        1361099015
23      ENTER_SEARCH    carsten    1361099016
23      ENTER_SEARCH    stephan    1361099017
23      ENTER_SEARCH    andreas    1361099018
23      ENTER_SEARCH    alromano   1361099019
23      LINK_SEARCH     carsten    1361099020
23      ENTER_SEARCH    andreas    1361099021
23      ENTER_SEARCH    robert     1361099022
23      ENTER_SEARCH    max        1361099023
23      LINK_SEARCH     oscar      1361099024
23      NAME_DETAILS    oscar      1361099025

According to the selection of test names from the user’s full profile, the following part is contained in the training data set:

userId  activity        name       POSIX_time
23      ADD_FAVORITE    max        1361099013
23      ENTER_SEARCH    carsten    1361099014
23      ENTER_SEARCH    jan        1361099015
23      ENTER_SEARCH    carsten    1361099016
23      ENTER_SEARCH    stephan    1361099017

The test data set contains:

userId  name_1    name_2
23      andreas   robert

Note, that alromano is not contained in nameling’s list of known names. For the evaluation andreas and robert are selected while all other activities (after 23 ENTER_SEARCH andreas 1361099021) are discarded.

Can you give an example of how the result files should look like?

The tab separated line for user 23 may look like

23    john    folke    andreas    stephan    jack     robert    ...

How can I derive my own training and test set from the public challenge data?

Due to several constraints, the choice of evaluation data is a bit complicated and described in details on the offline challenge description page.

But you don’t have to implement this process by yourself. Instead, you can use our Perl script from the download page.

The script is not very user friendly (sorry), but should do the job, assuming the list of known names, given in file /path/to/namelist.txt and the public training data in file /path/to/nameling.trainData:

$ cat /path/to/nameling.trainData | ./process_activitylog.pl /path/to/namelist.txt
...writing out name statistics to /tmp/nameling-public.names
...writing out category usage statistics to /tmp/nameling-public.categories
...writing out training data to /tmp/nameling-public.trainData and evaluation data to /tmp/nameling-public.evalData

As indicated, you will find your personal training data in file /tmp/nameling-public.trainData and your evaluation data in /tmp/nameling-public.evalData