ECML PKDD Discovery Challenge 2008

Introduction

This year's discovery challenge presents two tasks in the new area of social bookmarking. One task covers spam detection and the other covers tag recommendations. As we are hosting the social bookmark and publication sharing system BibSonomy, we are able to provide a dataset of BibSonomy for the challenge. A training dataset for both tasks is provided at the beginning of the competition. The test dataset will be released 48 hours before the final deadline. Due to a very tight schedule we cannot grant any deadline extension. The presentation of the results will take place at the ECML/PKDD workshop where the top teams are invited to present their approaches and results.

To get started with the tasks we suggest that you make yourself familiar with BibSonomy. A more formal description of the underlying structure which is called folksonomy is given in this paper (pdf here) where also a descriptions of the BibSonomy components are given. Next step is to subscribe to the mailing list rsdc08. We will use the list to distribute news about the challenge or other important information. Furthermore, the list can be used to clarify questions about the dataset and the different tasks. As the welcome message on the list contains information about how to access the dataset, subscribing to this list is essential to participate in the challenge. You can participate at only one of the challenges, or at both challenges.

Tasks

1. Spam Detection in Social Bookmarking Systems

With the growing popularity of social bookmarking systems, spammers discovered this kind of service as a playground for their activities. Usually they pursue two goals: On the one hand, they place links in the system to attract people to advertising sites. On the other hand, they increase the PageRank of their sites by placing links in as many popular web 2.0 sites as possible, in order to increase their visibility in Google and other search engines. Usual counter-measures like captchas are not efficient enough to effectively prevent the misuse of the system. In the last year, we were able to collect data of more than 2,000 active users and more than 25,000 spammers by manually labeling spammers and non-spammers. The provided dataset consists of these users and of all their posts. This includes all public information such as the url, the description and all tags of the post. The goal of this challenge is to learn a model which predicts whether a user is a spammer or not. In order to detect spammers as early as possible, the model should make good predictions for a user when he submits his first post.

Dataset description

A general description of the dataset can be found here. For the spam detection task all provided files are relevant.

Evaluation

All participants can use the training dataset to fit the model. The training dataset contains flags that identify users as spammers or non-spammers. The test dataset will have the same format as the training dataset and can be downloaded two days before the end of the competition. It will contain users of a future period. All participants must send a sorted file containing one line, for each user, composed by the user number and a confidence value separated by a tab. The higher the confidence value, the higher the probability that the user is a spammer. The highest confidence should come first.

			user spam
                        1234  1
			1235  0.987
			1236  0.765
                        1239  0

If no prediction is provided we assume the user is not a spammer. The evaluation criterion is the AUC (the Area under the ROC Curve) value. We compare the submitted spammer predictions of the participants with the manually assigned labels on a user basis.

Script to calculate AUC

This script (updated at 2008-07-25) provides an example how we will calculate the AUC value. As input you need two files: one with the user_id and the true class, one sorted file with the user_id and the confidence value.
With the test files true containing the true classes and pred containing the confidence, the script should output an AUC value of approx. 0.5417.

2. Tag Recommendation in Social Bookmarking Systems

To support the user during the tagging process and to facilitate the tagging, BibSonomy includes a tag recommender. This means that when a user finds an interesting web page (or publication) and posts it to BibSonomy, the system offers up to ten recommended tags on the posting page. Have a look at: Post in BibSonomy (It is necessary to have a BibSonomy account ;-) to be able to test it). The goal is to learn a model which effectively predicts the tags a user will use to describe a web page (or publication).

Dataset description

A general description of the dataset can be found here. For the tag recommendation only the tas, bookmark, and bibtex files are relevant.

Evaluation

For this task, only the non-spammer part of the dataset should be used to fit a model. The test dataset will consist of a bibtex, a bookmark and a tas file (but the tas file does not contain tags) as these files contain all information about posts entered into the system. We will release this dataset 48h before the end of the competition. We expect from every participant a file which contains one line, for each prediction, with the content_id of the post (bibtex or bookmark) followed by the list of recommended tags (the tags are space separated and the two columns content_id and tags are separated by tab). We consider only the first ten tags. Here is an example of the expected format:

 
			content_id tags 
			123456778 hello world

The F-measure will be the evaluation criterion. We compute the F-measure on a post basis by comparing the recommended tags with the tags the user has originally assigned to this post and averaging over all posts of the test dataset.
The comparison of the recommended tags to the true tags of a post will be done according to the following Java function

trueTag.replaceAll("[^0-9\\p{L}]+", "").equalsIgnoreCase(
                  recommendedTag.replaceAll("[^0-9\\p{L}]+", ""));

which means we ignore case of tags and remove all characters which are neither numbers nor letters (see also java.util.regex.Pattern). Since we expect all files to be UTF-8 encoded, the above function will NOT remove umlauts and other non-latin characters! We will also employ unicode normalization to normal form KC.
Additionally, the test set will not contain the following tags:

imported
public
system:imported
nn
system:unfiled

Program to calculate the F1-Measure

This JAR file contains a Java program to calculate the precision, recall, and f1-measure for given result files. Usage of the program is as follows:

usage:
  java -jar kddchallenge2008-0.0.1.jar\
      maxNoOfTags tas_original resultFile1 [resultFile2...resultFileN]
  The output will be written to resultFile*.eval.

where maxNoOfTags is the maximal number of tags to regard for recommendation (this is 10 in the challenge), tas_original the path to the original tas file which includes the tags (this is the file you won't get, of course), and the remaining arguments are be paths to files which contain the recommendations in a format as described above.
Each output file contains for each number of tags (up to maxNoOfTags) a line with the following columns:

number of regarded tags
recall
precision
f1-measure

Note, that due to UTF-8 normalization, you need Java 6 to run this program. The source code is also available in a source JAR.

Organization

Important Dates Google Calendar iCalendar (iCal/Outlook)

May 5, 2008	Tasks and datasets available online.
July 30th, 2008	Test dataset will be released (by midnight CEST).
August 1st, 2008	Result submission deadline (by midnight CEST).
August 4th, 2008	Workshop paper submission deadline.
August 7th 2008	Notification of winners, publication of results on webpage, notification of acceptance.
August 11th, 2008	Workshop proceedings (camera-ready) deadline.
September 15th, 2008	ECML/PKDD 2008 Discovery Challenge Workshop

We are pleased to announce that a discovery challenge will be organized in conjunction with the Web 2.0 Mining workshop; the joint workshop and challenge will be on September 15th.

Dataset

To access the challenge dataset please subscribe to the rsdc08 mailing list. The welcome message will contain all information to access the dataset (Dataset description here).

Test Datasets

The tests datasets for the challenge are now online:

2008-03-31_2008-05-16_tag_recommender.tgz - for the tag recommendation task
2008-05-16_2008-07-01_spam_detection.tgz - for the spam detection task

The corresponding true classifications:

tas_original.gz - contains the original tas file with all the tag assignments as entered by the users.
user_spam.gz - contains the classification for each user of the spam dataset.

Results

More than 150 participants registered at the mailing list and thus had a look at the dataset. We got 18 result submissions - 13 of them for the spam detection task and 5 for the tag recommendation task. 13 participants additionally submitted a paper - 11 of them were accepted. We computed the AUC and F1-Measure values with the programs described above. Below you can find the results including the team name for the three best teams of each task.

Spam Detection Task

Submission ID	AUC	Team
39014	0,97961	A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems by A. Gkanogiannis and T. Kalamboukis
83234	0,97032	Rank for spam detection - ECML Discovery Challenge by P. Gramme and J.-F. Chevalier
15076	0,93899	Naive Bayes Classifier Learning with Feature Selection for Spam Detection in Social Bookmarking by C. Kim and K.-B. Hwang
97510	0,9364
44293	0,93259
55409	0,91365
69806	0,88366
75540	0,87847
28752	0,84684
21710	0,84684
85695	0,70553
70358	0,47069
56347	0,35898

Tag Recommendation Task

Submission ID	F1-Measure	Team
72209	0,19325	RSDC'08: Tag Recommendations using Bookmark Content by M. Tatu, M. Srikanth and T. D'Silva
89760	0,18674	Tag Recommendation for Folksonomies Oriented towards Individual Users by M. Lipczak
27845	0,0284	Multilabel Text Classification for Automated Tag Suggestion by I. Katakis, G. Tsoumakas and I. Vlahavas
27876	0,02203
68481	0,01406

Submission instructions

To submit your result files, use our submission form.

The paper submissions must be submitted to the EasyChair submission system in PDF format. Although not required for the initial submission, we recommend to follow the format guidelines of ECML/PKDD (Springer LNCS -- LaTeX Style File), as this will be the required format for accepted papers.

The workshop proceedings will be distributed during the workshop. We plan to issue a post workshop publication of selected papers by Springer Lecture Notes.

Workshop Chairs

To contact us please send a mail to rsdc08-info@cs.uni-kassel.de.

Andreas Hotho, University of Kassel
Dominik Benz, University of Kassel
Robert Jäschke, University of Kassel
Beate Krause, University of Kassel

The discovery challenge is supported by the European Project Tagora - Semiotic Dynamics in Online Social Communities.

RSDC'08

Discovery Challenge

Main Conference

Attending