Evaluation

Here we provide details about the evaluation of your results for the offline Tasks 1 and 2.

Test data

The test dataset will consist of a bibtex, a bookmark and a tas file in the same format as the training dataset. However, the tas file does not contain tags. Instead of a tag it contains the string null and for each post only one TAS (= line). You can use a sample file created from the cleaned dump training data to test reading the file.
We will release this dataset 48h before the end of the competition. We expect from every participant a file which contains one line, for each prediction, with the content_id of the post (bibtex or bookmark) followed by the list of recommended tags (the tags are space separated and the two columns content_id and tags are separated by tab). We consider only the first five tags. Here is an example of the expected format:

content_id tags 
123456778 hello world	

We also provide an example result file matching the cleaned dump training data.

Evaluation Criterion

We will use the F1-Measure common in Information Retrieval to evaluate the recommendations. Therefore, we first compute for each post in the test data precision and recall by comparing the recommended tags against the tags the user has originally assigned to this post. Then we average precision and recall over all posts in the test data and use the resulting precision and recall to compute the F1-Measure as f1m = (2 * precision * recall) / (precision + recall). For details, we refer to the paper Tag Recommendations in Social Bookmarking Systems.

The number of tags one can recommend is not restricted. However, we will regard the first five tags only.

The comparison of the recommended tags to the true tags of a post will be done according to the following Java function

trueTag.replaceAll("[^0-9\\p{L}]+", "").equalsIgnoreCase(
                  recommendedTag.replaceAll("[^0-9\\p{L}]+", ""));

which means we ignore case of tags and remove all characters which are neither numbers nor letters (see also java.util.regex.Pattern). Since we expect all files to be UTF-8 encoded, the above function will NOT remove umlauts and other non-latin characters! We will also employ unicode normalization to normal form KC.
Additionally, the test set will not contain the following tags:

imported
public
system:imported
nn
system:unfiled

Sample Evaluation Program

This JAR file contains a Java program to calculate the precision, recall, and f1-measure for given result files. Usage of the program is as follows:

where 5 is the maximal number of tags to regard for recommendation (the default for the challenge), tas_original the path to the original tas file which includes the tags (this is the file you won't get, of course), and tas_result the path to the file which contains your recommendations in a format as described above. We provide a (zipped) version of the tas file from the cleaned dump training data which you can use to test the evaluator against the training data.
The output is then located in the file tas_result.eval which looks like this:

1       0.4877296962320823      1.0     0.6556697731682538
2       0.7431330239829573      1.0     0.8526406347175292
3       0.8598491823591262      1.0     0.9246439878188943
4       0.918164381330041       1.0     0.957336493437953
5       0.9497167202873263      1.0     0.9742099561492895
The file contains for each number of tags (up to 5) a line with the following columns:

  1. number of regarded tags
  2. recall
  3. precision
  4. F1-measure

Note, that due to UTF-8 normalization, you need Java 6 to run this program. The source code is also available in a source JAR.