Dataset

To download the dataset, subscribe to the dc09 mailing list follow the instructions on how to get a BibSonomy dump and download the challange datasets from there (the links here worked only for participants during the challenge). We will use the list to distribute news about the challenge and you can use it to clarify questions about the dataset and the different tasks. The welcome message of the list contains information about how to access the dataset.

For the different tasks of this years Discovery Challenge we provide two different training datasets (2009-04-08):

Cleaned Dump: 2009-01-01_cleaned.tgz - an almost complete dump of BibSonomy (MD5 hash: f85c05546baf0aef4501f59e449a3f80).
Post-Core: 2009-01-01_cleaned_post-core-2.tgz - the corresponding post-core at level 2 of the former dump (MD5 hash: e7128c11546ac8a0b418d5ffd939a9ca).

We released updated datasets on April 8th, since the old ones were not properly cleaned.

Files

Each dataset consists of three table files:

tas
bookmark
bibtex

Both datasets can be uploaded into a MySQL database. The CREATE TABLE statements for the corresponding tables (each file corresponds to one table) can be found in the file tables.sql. The files are tab-separated, where each line represents a row and the fields of each row are delimited by a tabulator. Please note that the fields themselves can contain line breaks which are quoted by MySQL. The best way to load the data into a MySQL database is by using the LOAD DATA statement. The tables.sql script already contains LOAD DATA statements in which you only need to adapt the path to the extracted table files.

The fields of each row in the table files correspond to the following columns:

tas

Tag ASsignments: Fact table; who attached which tag to which resource/content

user (number; user names are anonymized)
tag
content_id (matches bookmark.content_id or bibtex.content_id)
content_type (1 = bookmark, 2 = bibtex)
date

bookmark

Dimension table for bookmark data

content_id (matches tas.content_id)
url_hash (the URL as md5 hash)
url
description
extended description
date

bibtex

Dimension table for BibTeX data

content_id (matches tas.content_id)
journal
volume
chapter
edition
month
day
booktitle
howPublished
institution
organization
publisher
address
school
series
bibtexKey (the bibtex key (in the @... line))
url
type
description
annote
note
pages
bKey (the "key" field)
number
crossref
misc
bibtexAbstract
simhash0 (hash for duplicate detection within a user -- strict -- (obsolete))
simhash1 (hash for duplicate detection among users -- sloppy --)
simhash2 (hash for duplicate detection within a user -- strict --)
entrytype
title
author
editor
year

Dataset Description

Remarks regarding the identity of resources

Both bookmarks and BibTeX references are regarded equal, when their inter hash equals. For bookmarks, the inter hash is contained in the url_hash column of the bookmark file and is just the MD5 hash of the URL. For BibTeX references the inter hash is contained in the simhash1 column is computed. Further information, examples and an online form to compute the BibTeX hashes can be found here.

Please note: The content_id columns are used to match posts of users, i.e., each post of a user consists of a resource and all the tags the user assigned to that resource. The post is identified by its content_id in the three tables tas, bookmark, and bibtex. Since each post contains exactly one resource, content_id's are unique in the bookmark and bibtex table. On the contrary, each post (and thus each content_id) belongs to exactly one user.
If you want to get some overlap between resources, you have to use the above mentioned inter hashes.

Cleaned Dump

The dump contains all public bookmarks and publication posts of BibSonomy until (but not including) 2009-01-01. Posts from the user dblp (a mirror of the DBLP Computer Science Bibliography) as well as all posts from users which have been flagged as spammers have been excluded.

Tag Cleansing

Furthermore, we cleaned the tags according to the Java method

public static String cleanTag(final String tag) {
   return Normalizer.normalize(tag.toLowerCase()
      .replaceAll("[^0-9\\p{L}]+", ""), Normalizer.Form.NFKC);
}

and removed those tags, which were empty after cleansing or matched one of the tags imported, public, systemimported, nn, systemunfiled. The cleanTag method effectively removes all characters which are neither numbers nor letters from tags (see also java.util.regex.Pattern). Since we expect all result files to be UTF-8 encoded, the method will NOT remove umlauts and other non-latin characters! We also employ unicode normalization to normal form KC.
Please note that removal of tags also caused some posts, resources, and users to disappear from the dump.

Statistics

Some SQL commands and their output follows. You can repeat the commands to roughly check the validity of your data.

statement	count	info
SELECT COUNT(*) FROM tas;	1,401,104	#tag assignments
SELECT COUNT(*) FROM bookmark;	263,004	#bookmark posts
SELECT COUNT(*) FROM bibtex;	158,924	#BibTeX posts
SELECT COUNT() FROM (SELECT FROM tas GROUP BY user) AS u;	3,617	#users
SELECT COUNT() FROM (SELECT FROM bookmark GROUP BY url_hash) AS u;	235,328	#URLs
SELECT COUNT() FROM (SELECT FROM bibtex GROUP BY simhash1) AS b;	143,050	#BibTeXs
SELECT COUNT() FROM (SELECT FROM tas GROUP BY tag) AS t;	93,756	#tags

Post-Core

For the post-core at level 2 we used the cleaned dump described above and removed all users, tags, and resources which appear in only one post. We iterated this process until convergence and got a core in which each user, tag, and resource occurs in at least two posts. For more information regarding post-cores have a look at the paper Tag Recommendations in Folksonomies or the paper Generalized Cores by V. Batagelj and M. Zaversnik.

Statistics

Some SQL commands and their output follows. You can repeat the commands to roughly check the validity of your data.

statement	count	info
SELECT COUNT(*) FROM tas;	253,615	#tag assignments
SELECT COUNT(*) FROM bookmark;	41,268	#bookmark posts
SELECT COUNT(*) FROM bibtex;	22,852	#BibTeX posts
SELECT COUNT() FROM (SELECT FROM tas GROUP BY user) AS u;	1,185	#users
SELECT COUNT() FROM (SELECT FROM bookmark GROUP BY url_hash) AS u;	14,443	#URLs
SELECT COUNT() FROM (SELECT FROM bibtex GROUP BY simhash1) AS b;	7,946	#BibTeXs
SELECT COUNT() FROM (SELECT FROM tas GROUP BY tag) AS t;	13,276	#tags

FAQ

How can I import the datasets into my MySQL-Database?

You can use the provided SQL script:

mysql -u <username> -p -D <databasename> < tables.sql

This script assumes, that the corresponding data files are located in the '/tmp' directory, are readable for everyone and that the MySQL-user has according user privileges.

When I try to import the training data using the script 'tables.sql' I get a 'Permission Denied'-Error.

Assure that '/tmp/bibtex', '/tmp/bookmark' and '/tmp/tas' are readable for everyone and that the MySQL-user has the FILE privilege:

GRANT FILE ON <databasename>.* TO '<username>'@'localhost' IDENTIFIED BY '<password>';

How can I get all information for a given post?

Assume you want to get all information for the post with content_id 42. First, get the user, all tags, content_type and date from the tas table:

SELET * from tas where content_id='42';

Now, depending on the post's content_type, get further details from the bibtex or the bookmark table. In our case:

SELECT * from bookmark where content_id='42';

If one participates in the content-based task, could he/she also use some graph-based method?

You can use the method you prefer for all tasks. It's just that the core dataset might be more suitable for graph-based methods than the plain dataset. This will also hold for the test data.

I have problems loading the data into a MySQL database. I get errors like 'ERROR 1406 (22001): Data too long for column 'annote' at row 43542'

Ensure that the charset of your database, tables, connections is UTF-8. We modified the tables.sql script lately to use UTF-8 wherever possible. However, it might be necessary to modify your database server configuration.

Is for each content_id in the test data at most one user_id associated with it?

Yes! The content_id's represent posts and each post belongs to exactly one user. The content_id's do not represent resources - this is done by the hashes (url_hash for bookmarks, simhash[0-2] for publication references). So if you need some overlap between posts (i.e., find posts with the same resource) use the hashes. For publication references there are two relevant hashes: simhash2 (intra hash) which is unique among one user (i.e., each user has at most one post with simhash2) and pretty strict (changing the journal name changes the hash); and simhash1 (inter hash) which is pretty sloppy and provides overlap between resources (resources with the same title, author, year have the same simhash1).

Dataset

Files

tas

bookmark

bibtex

Dataset Description

Remarks regarding the identity of resources

Cleaned Dump

Tag Cleansing

Statistics

Post-Core

Statistics

FAQ

How can I import the datasets into my MySQL-Database?

When I try to import the training data using the script 'tables.sql' I get a 'Permission Denied'-Error.

How can I get all information for a given post?

If one participates in the content-based task, could he/she also use some graph-based method?

I have problems loading the data into a MySQL database. I get errors like 'ERROR 1406 (22001): Data too long for column 'annote' at row 43542'

Is for each content_id in the test data at most one user_id associated with it?

Discovery Challenge

Sponsors

Main Conference