ECML PKDD Discovery Challenge 2008

Introduction

This year's discovery challenge presents two tasks in the new area of social bookmarking. One task covers spam detection and the other covers tag recommendations. As we are hosting the social bookmark and publication sharing system BibSonomy, we are able to provide a dataset of BibSonomy for the challenge. A training dataset for both tasks is provided at the beginning of the competition. The test dataset will be released 48 hours before the final deadline. Due to a very tight schedule we cannot grant any deadline extension. The presentation of the results will take place at the ECML/PKDD workshop where the top teams are invited to present their approaches and results.

To get started with the tasks we suggest that you make yourself familiar with BibSonomy. A more formal description of the underlying structure which is called folksonomy is given in this paper (pdf here) where also a descriptions of the BibSonomy components are given. Next step is to subscribe to the mailing list rsdc08. We will use the list to distribute news about the challenge or other important information. Furthermore, the list can be used to clarify questions about the dataset and the different tasks. As the welcome message on the list contains information about how to access the dataset, subscribing to this list is essential to participate in the challenge. You can participate at only one of the challenges, or at both challenges.

Tasks

1. Spam Detection in Social Bookmarking Systems

With the growing popularity of social bookmarking systems, spammers discovered this kind of service as a playground for their activities. Usually they pursue two goals: On the one hand, they place links in the system to attract people to advertising sites. On the other hand, they increase the PageRank of their sites by placing links in as many popular web 2.0 sites as possible, in order to increase their visibility in Google and other search engines. Usual counter-measures like captchas are not efficient enough to effectively prevent the misuse of the system. In the last year, we were able to collect data of more than 2,000 active users and more than 25,000 spammers by manually labeling spammers and non-spammers. The provided dataset consists of these users and of all their posts. This includes all public information such as the url, the description and all tags of the post. The goal of this challenge is to learn a model which predicts whether a user is a spammer or not. In order to detect spammers as early as possible, the model should make good predictions for a user when he submits his first post.

Dataset description

A general description of the dataset can be found here. For the spam detection task all provided files are relevant.

Evaluation

All participants can use the training dataset to fit the model. The training dataset contains flags that identify users as spammers or non-spammers. The test dataset will have the same format as the training dataset and can be downloaded two days before the end of the competition. It will contain users of a future period. All participants must send a sorted file containing one line, for each user, composed by the user number and a confidence value separated by a tab. The higher the confidence value, the higher the probability that the user is a spammer. The highest confidence should come first.

			user spam
                        1234  1
			1235  0.987
			1236  0.765
                        1239  0
			

If no prediction is provided we assume the user is not a spammer. The evaluation criterion is the AUC (the Area under the ROC Curve) value. We compare the submitted spammer predictions of the participants with the manually assigned labels on a user basis.

Script to calculate AUC

This script (updated at 2008-07-25) provides an example how we will calculate the AUC value. As input you need two files: one with the user_id and the true class, one sorted file with the user_id and the confidence value.
With the test files true containing the true classes and pred containing the confidence, the script should output an AUC value of approx. 0.5417.

2. Tag Recommendation in Social Bookmarking Systems

To support the user during the tagging process and to facilitate the tagging, BibSonomy includes a tag recommender. This means that when a user finds an interesting web page (or publication) and posts it to BibSonomy, the system offers up to ten recommended tags on the posting page. Have a look at: Post in BibSonomy (It is necessary to have a BibSonomy account ;-) to be able to test it). The goal is to learn a model which effectively predicts the tags a user will use to describe a web page (or publication).

Dataset description

A general description of the dataset can be found here. For the tag recommendation only the tas, bookmark, and bibtex files are relevant.

Evaluation

For this task, only the non-spammer part of the dataset should be used to fit a model. The test dataset will consist of a bibtex, a bookmark and a tas file (but the tas file does not contain tags) as these files contain all information about posts entered into the system. We will release this dataset 48h before the end of the competition. We expect from every participant a file which contains one line, for each prediction, with the content_id of the post (bibtex or bookmark) followed by the list of recommended tags (the tags are space separated and the two columns content_id and tags are separated by tab). We consider only the first ten tags. Here is an example of the expected format:

 
			content_id tags 
			123456778 hello world	
			

The F-measure will be the evaluation criterion. We compute the F-measure on a post basis by comparing the recommended tags with the tags the user has originally assigned to this post and averaging over all posts of the test dataset.
The comparison of the recommended tags to the true tags of a post will be done according to the following Java function

trueTag.replaceAll("[^0-9\\p{L}]+", "").equalsIgnoreCase(
                  recommendedTag.replaceAll("[^0-9\\p{L}]+", ""));
which means we ignore case of tags and remove all characters which are neither numbers nor letters (see also java.util.regex.Pattern). Since we expect all files to be UTF-8 encoded, the above function will NOT remove umlauts and other non-latin characters! We will also employ unicode normalization to normal form KC.
Additionally, the test set will not contain the following tags:
imported
public
system:imported
nn
system:unfiled

Program to calculate the F1-Measure

This JAR file contains a Java program to calculate the precision, recall, and f1-measure for given result files. Usage of the program is as follows:

usage:
  java -jar kddchallenge2008-0.0.1.jar\
      maxNoOfTags tas_original resultFile1 [resultFile2...resultFileN]
  The output will be written to resultFile*.eval.
where maxNoOfTags is the maximal number of tags to regard for recommendation (this is 10 in the challenge), tas_original the path to the original tas file which includes the tags (this is the file you won't get, of course), and the remaining arguments are be paths to files which contain the recommendations in a format as described above.
Each output file contains for each number of tags (up to maxNoOfTags) a line with the following columns:

  1. number of regarded tags
  2. recall
  3. precision
  4. f1-measure

Note, that due to UTF-8 normalization, you need Java 6 to run this program. The source code is also available in a source JAR.

Organization

Important Dates Google Calendar iCalendar (iCal/Outlook)

May 5, 2008 Tasks and datasets available online.
July 30th, 2008 Test dataset will be released (by midnight CEST).
August 1st, 2008 Result submission deadline (by midnight CEST).
August 4th, 2008 Workshop paper submission deadline.
August 7th 2008 Notification of winners, publication of results on webpage, notification of acceptance.
August 11th, 2008 Workshop proceedings (camera-ready) deadline.
September 15th, 2008 ECML/PKDD 2008 Discovery Challenge Workshop

We are pleased to announce that a discovery challenge will be organized in conjunction with the Web 2.0 Mining workshop; the joint workshop and challenge will be on September 15th.

Dataset

To access the challenge dataset please subscribe to the rsdc08 mailing list. The welcome message will contain all information to access the dataset (Dataset description here).

Test Datasets

The tests datasets for the challenge are now online: The corresponding true classifications:
  • tas_original.gz - contains the original tas file with all the tag assignments as entered by the users.
  • user_spam.gz - contains the classification for each user of the spam dataset.

Results

More than 150 participants registered at the mailing list and thus had a look at the dataset. We got 18 result submissions - 13 of them for the spam detection task and 5 for the tag recommendation task. 13 participants additionally submitted a paper - 11 of them were accepted. We computed the AUC and F1-Measure values with the programs described above. Below you can find the results including the team name for the three best teams of each task.

Spam Detection Task
Submission IDAUCTeam
390140,97961A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems by A. Gkanogiannis and T. Kalamboukis
832340,97032Rank for spam detection - ECML Discovery Challenge by P. Gramme and J.-F. Chevalier
150760,93899Naive Bayes Classifier Learning with Feature Selection for Spam Detection in Social Bookmarking by C. Kim and K.-B. Hwang
975100,9364
442930,93259
554090,91365
698060,88366
755400,87847
287520,84684
217100,84684
856950,70553
703580,47069
563470,35898
Tag Recommendation Task
Submission IDF1-MeasureTeam
722090,19325RSDC'08: Tag Recommendations using Bookmark Content by M. Tatu, M. Srikanth and T. D'Silva
897600,18674Tag Recommendation for Folksonomies Oriented towards Individual Users by M. Lipczak
278450,0284Multilabel Text Classification for Automated Tag Suggestion by I. Katakis, G. Tsoumakas and I. Vlahavas
278760,02203
684810,01406

Submission instructions

To submit your result files, use our submission form.

The paper submissions must be submitted to the EasyChair submission system in PDF format. Although not required for the initial submission, we recommend to follow the format guidelines of ECML/PKDD (Springer LNCS -- LaTeX Style File), as this will be the required format for accepted papers.

The workshop proceedings will be distributed during the workshop. We plan to issue a post workshop publication of selected papers by Springer Lecture Notes.

Workshop Chairs

To contact us please send a mail to rsdc08-info@cs.uni-kassel.de.

The discovery challenge is supported by the European Project Tagora - Semiotic Dynamics in Online Social Communities.