Can you give an example on how the test data is derived from the full usage log?

Sure! Roughly, we selected users with at least five different names and withheld the last two entered names for evaluation (have a look at the description of the offline challenge for a detailed description of the selection process).

If you want to obtain comparable training and test scenarios from the public data yourself, you can use the Perl script which was used to split the test and training data on the download page.

And if you are still interested, have a look at the following example for a user with id 23 which illustrates some of the gritty details. Firstly, consider a fictive full user profile within nameling’s query logs:

userId  activity        name       POSIX_time
23      ADD_FAVORITE    max        1361099013
23      ENTER_SEARCH    carsten    1361099014
23      ENTER_SEARCH    jan        1361099015
23      ENTER_SEARCH    carsten    1361099016
23      ENTER_SEARCH    stephan    1361099017
23      ENTER_SEARCH    andreas    1361099018
23      ENTER_SEARCH    alromano   1361099019
23      LINK_SEARCH     carsten    1361099020
23      ENTER_SEARCH    andreas    1361099021
23      ENTER_SEARCH    robert     1361099022
23      ENTER_SEARCH    max        1361099023
23      LINK_SEARCH     oscar      1361099024
23      NAME_DETAILS    oscar      1361099025

According to the selection of test names from the user’s full profile, the following part is contained in the training data set:

userId  activity        name       POSIX_time
23      ADD_FAVORITE    max        1361099013
23      ENTER_SEARCH    carsten    1361099014
23      ENTER_SEARCH    jan        1361099015
23      ENTER_SEARCH    carsten    1361099016
23      ENTER_SEARCH    stephan    1361099017

The test data set contains:

userId  name_1    name_2
23      andreas   robert

Note, that alromano is not contained in nameling’s list of known names. For the evaluation andreas and robert are selected while all other activities (after 23 ENTER_SEARCH andreas 1361099021) are discarded.

15th Discovery Challenge

organized in conjunction with ECML PKDD 2013

Can you give an example on how the test data is derived from the full usage log?