Dataset

To download the dataset, subscribe to the dc09 mailing list follow the instructions on how to get a BibSonomy dump and download the challange datasets from there (the links here worked only for participants during the challenge). We will use the list to distribute news about the challenge and you can use it to clarify questions about the dataset and the different tasks. The welcome message of the list contains information about how to access the dataset.

For the different tasks of this years Discovery Challenge we provide two different training datasets (2009-04-08):

We released updated datasets on April 8th, since the old ones were not properly cleaned.

Files

Each dataset consists of three table files:

Both datasets can be uploaded into a MySQL database. The CREATE TABLE statements for the corresponding tables (each file corresponds to one table) can be found in the file tables.sql. The files are tab-separated, where each line represents a row and the fields of each row are delimited by a tabulator. Please note that the fields themselves can contain line breaks which are quoted by MySQL. The best way to load the data into a MySQL database is by using the LOAD DATA statement. The tables.sql script already contains LOAD DATA statements in which you only need to adapt the path to the extracted table files.

The fields of each row in the table files correspond to the following columns:

tas

Tag ASsignments: Fact table; who attached which tag to which resource/content

  1. user (number; user names are anonymized)
  2. tag
  3. content_id (matches bookmark.content_id or bibtex.content_id)
  4. content_type (1 = bookmark, 2 = bibtex)
  5. date

bookmark

Dimension table for bookmark data

  1. content_id (matches tas.content_id)
  2. url_hash (the URL as md5 hash)
  3. url
  4. description
  5. extended description
  6. date

bibtex

Dimension table for BibTeX data

  1. content_id (matches tas.content_id)
  2. journal
  3. volume
  4. chapter
  5. edition
  6. month
  7. day
  8. booktitle
  9. howPublished
  10. institution
  11. organization
  12. publisher
  13. address
  14. school
  15. series
  16. bibtexKey (the bibtex key (in the @... line))
  17. url
  18. type
  19. description
  20. annote
  21. note
  22. pages
  23. bKey (the "key" field)
  24. number
  25. crossref
  26. misc
  27. bibtexAbstract
  28. simhash0 (hash for duplicate detection within a user -- strict -- (obsolete))
  29. simhash1 (hash for duplicate detection among users -- sloppy --)
  30. simhash2 (hash for duplicate detection within a user -- strict --)
  31. entrytype
  32. title
  33. author
  34. editor
  35. year

Dataset Description

Remarks regarding the identity of resources

Both bookmarks and BibTeX references are regarded equal, when their inter hash equals. For bookmarks, the inter hash is contained in the url_hash column of the bookmark file and is just the MD5 hash of the URL. For BibTeX references the inter hash is contained in the simhash1 column is computed. Further information, examples and an online form to compute the BibTeX hashes can be found here.

Please note: The content_id columns are used to match posts of users, i.e., each post of a user consists of a resource and all the tags the user assigned to that resource. The post is identified by its content_id in the three tables tas, bookmark, and bibtex. Since each post contains exactly one resource, content_id's are unique in the bookmark and bibtex table. On the contrary, each post (and thus each content_id) belongs to exactly one user.
If you want to get some overlap between resources, you have to use the above mentioned inter hashes.

Cleaned Dump

The dump contains all public bookmarks and publication posts of BibSonomy until (but not including) 2009-01-01. Posts from the user dblp (a mirror of the DBLP Computer Science Bibliography) as well as all posts from users which have been flagged as spammers have been excluded.

Tag Cleansing

Furthermore, we cleaned the tags according to the Java method

public static String cleanTag(final String tag) {
   return Normalizer.normalize(tag.toLowerCase()
      .replaceAll("[^0-9\\p{L}]+", ""), Normalizer.Form.NFKC);
}

and removed those tags, which were empty after cleansing or matched one of the tags imported, public, systemimported, nn, systemunfiled. The cleanTag method effectively removes all characters which are neither numbers nor letters from tags (see also java.util.regex.Pattern). Since we expect all result files to be UTF-8 encoded, the method will NOT remove umlauts and other non-latin characters! We also employ unicode normalization to normal form KC.
Please note that removal of tags also caused some posts, resources, and users to disappear from the dump.

Statistics

Some SQL commands and their output follows. You can repeat the commands to roughly check the validity of your data.

statement count info
SELECT COUNT(*) FROM tas; 1,401,104 #tag assignments
SELECT COUNT(*) FROM bookmark; 263,004 #bookmark posts
SELECT COUNT(*) FROM bibtex; 158,924 #BibTeX posts
SELECT COUNT(*) FROM (SELECT * FROM tas GROUP BY user) AS u; 3,617 #users
SELECT COUNT(*) FROM (SELECT * FROM bookmark GROUP BY url_hash) AS u; 235,328 #URLs
SELECT COUNT(*) FROM (SELECT * FROM bibtex GROUP BY simhash1) AS b; 143,050 #BibTeXs
SELECT COUNT(*) FROM (SELECT * FROM tas GROUP BY tag) AS t; 93,756 #tags

Post-Core

For the post-core at level 2 we used the cleaned dump described above and removed all users, tags, and resources which appear in only one post. We iterated this process until convergence and got a core in which each user, tag, and resource occurs in at least two posts. For more information regarding post-cores have a look at the paper Tag Recommendations in Folksonomies or the paper Generalized Cores by V. Batagelj and M. Zaversnik.

Statistics

Some SQL commands and their output follows. You can repeat the commands to roughly check the validity of your data.

statement count info
SELECT COUNT(*) FROM tas; 253,615 #tag assignments
SELECT COUNT(*) FROM bookmark; 41,268 #bookmark posts
SELECT COUNT(*) FROM bibtex; 22,852 #BibTeX posts
SELECT COUNT(*) FROM (SELECT * FROM tas GROUP BY user) AS u; 1,185 #users
SELECT COUNT(*) FROM (SELECT * FROM bookmark GROUP BY url_hash) AS u; 14,443 #URLs
SELECT COUNT(*) FROM (SELECT * FROM bibtex GROUP BY simhash1) AS b; 7,946 #BibTeXs
SELECT COUNT(*) FROM (SELECT * FROM tas GROUP BY tag) AS t; 13,276 #tags

FAQ

How can I import the datasets into my MySQL-Database?

You can use the provided SQL script:

mysql -u <username> -p -D <databasename> < tables.sql

This script assumes, that the corresponding data files are located in the '/tmp' directory, are readable for everyone and that the MySQL-user has according user privileges.

When I try to import the training data using the script 'tables.sql' I get a 'Permission Denied'-Error.

Assure that '/tmp/bibtex', '/tmp/bookmark' and '/tmp/tas' are readable for everyone and that the MySQL-user has the FILE privilege:

GRANT FILE ON <databasename>.* TO '<username>'@'localhost' IDENTIFIED BY '<password>';

How can I get all information for a given post?

Assume you want to get all information for the post with content_id 42. First, get the user, all tags, content_type and date from the tas table:

SELET * from tas where content_id='42';

Now, depending on the post's content_type, get further details from the bibtex or the bookmark table. In our case:

SELECT * from bookmark where content_id='42';

If one participates in the content-based task, could he/she also use some graph-based method?

You can use the method you prefer for all tasks. It's just that the core dataset might be more suitable for graph-based methods than the plain dataset. This will also hold for the test data.

I have problems loading the data into a MySQL database. I get errors like 'ERROR 1406 (22001): Data too long for column 'annote' at row 43542'

Ensure that the charset of your database, tables, connections is UTF-8. We modified the tables.sql script lately to use UTF-8 wherever possible. However, it might be necessary to modify your database server configuration.

Is for each content_id in the test data at most one user_id associated with it?

Yes! The content_id's represent posts and each post belongs to exactly one user. The content_id's do not represent resources - this is done by the hashes (url_hash for bookmarks, simhash[0-2] for publication references). So if you need some overlap between posts (i.e., find posts with the same resource) use the hashes. For publication references there are two relevant hashes: simhash2 (intra hash) which is unique among one user (i.e., each user has at most one post with simhash2) and pretty strict (changing the journal name changes the hash); and simhash1 (inter hash) which is pretty sloppy and provides overlap between resources (resources with the same title, author, year have the same simhash1).