How can I derive my own training and test set from the public challenge data?

Due to several constraints, the choice of evaluation data is a bit complicated and described in details on the offline challenge description page.

But you don’t have to implement this process by yourself. Instead, you can use our Perl script from the download page.

The script is not very user friendly (sorry), but should do the job, assuming the list of known names, given in file /path/to/namelist.txt and the public training data in file /path/to/nameling.trainData:

$ cat /path/to/nameling.trainData | ./process_activitylog.pl /path/to/namelist.txt
...writing out name statistics to /tmp/nameling-public.names
...writing out category usage statistics to /tmp/nameling-public.categories
...writing out training data to /tmp/nameling-public.trainData and evaluation data to /tmp/nameling-public.evalData

As indicated, you will find your personal training data in file /tmp/nameling-public.trainData and your evaluation data in /tmp/nameling-public.evalData

15th Discovery Challenge

organized in conjunction with ECML PKDD 2013

How can I derive my own training and test set from the public challenge data?