About

For research purposes we offer a dataset of the BibSonomy database in form of an SQL dump to interested people. Before you get access to the dataset, you have to sign up our license agreement and send it as a scanned file (in pdf, jpg or png format) via email to our office. Alternatively you may send the document via FAX, see number on our contact page.
Additionally, we would like to ask you to subscribe to the BibSonomy-Research mailing list . Upon receipt of your faxed license agreement, we will approve the subscription request and in the welcome mail you will get instructions on how to access the dataset.

On this page you can download the dumps as compressed tar archive. A README describing the format of the files is contained in each archive. Please note that the easiest way to work with the dumps is by using a MySQL database. Detailed information on the table structure can be found below on this page.

We are quite interested in results you got with the help of this dataset. Therefore, please inform us about your publications. Concerning citing this data in publications, please refer to the following reference (adapting the date):

Knowledge and Data Engineering Group, University of Kassel: Benchmark Folksonomy Data from BibSonomy, version of June 30th, 2007.

If you want to refer to the system, please use the following publication:

Dominik Benz, Andreas Hotho, Robert Jäschke, Beate Krause, Folke Mitzlaff, Christoph Schmitz, and Gerd Stumme. The Social Bookmark and Publication Management System BibSonomy. The VLDB Journal, 19(6):849-875, Dec. 2010. [BibTeX]

Datasets

file size description
2006-06-30.tgz 5.05 MB
2006-12-31.tgz 10.0 MB
2007-04-30_post-core-5.tgz 43.8 KB This is the 2007-04-30 BibSonomy post-core at level 5, used for evaluation in
Robert Jäschke, Leandro Balby Marinho, Andreas Hotho, Lars Schmidt-Thieme and Gerd Stumme. Tag Recommendations in Social Bookmarking Systems. AI Communications, 21(4):231-247, 2008.
2007-06-30.tgz 17.0 MB
2007-10-31.tgz 25.6 MB
2007-12-31.tgz 29.9 MB
2008-06-30.tgz 69.9 MB
2008-09-30.tgz 80.0 MB
2009-01-01.tgz 85.6 MB
2009-07-01.tgz 115 MB
2010-01-01.tgz 155 MB
2010-07-01.tgz 186 MB
2011-01-01.tgz 188 MB
2011-07-01.tgz 164 MB
2012-01-01.tgz 174 MB
2012-07-01.tgz 182 MB
2013-01-01.tgz 191 MB
2013-07-01.tgz 199 MB
2014-01-01.tgz 204 MB
2014-07-01.tgz 209 MB
2015-01-01.tgz 222 MB
2015-07-01.tgz 227 MB
2016-01-01.tgz 237 MB
2016-07-01.tgz 242 MB
2017-01-01.tgz 247 MB
2017-07-01.tgz 259 MB
2018-01-01.tgz 266 MB
2018-07-01.tgz 278 MB
2019-01-01.tgz 287 MB
2020-01-01.tgz 307 MB
2020-07-01.tgz 312 MB
2021-01-01.tgz 312 MB
2022-01-01.tgz 329 MB
2022-07-01.tgz 334 MB
2023-01-01.tgz 339 MB
2023-07-01.tgz 341 MB
BibSonomy_Agreement.pdf 62.0 KB
dc09test.tar 11.8 MB ECML PKDD Discovery Challenge 2009 (DC09) Test Dataset
dc09train.tar 75.7 MB ECML PKDD Discovery Challenge 2009 (DC09) Training Dataset
rsdc08train.tar.gz 256 MB ECML PKDD Discovery Challenge 2008 (RSDC08) Training Dataset
rsdctest.zip 46.3 MB ECML PKDD Discovery Challenge 2008 (RSDC08) Test Dataset

Dataset description

The dataset has been created using the mysqldump command of a MySQL database. The CREATE statements for the corresponding tables (each file = one table) can be found in the file tables.sql, together with the LOAD DATA statements which insert the data into the database. For the latter to work you must adapt the paths to the datafiles at the end of tables.sql.

The dataset consists of four files:

These are tab-separated files, where each line represents a row and the fields of each row are delimited by a tabulator. Please note that the fields themselves can contain line breaks which are quoted by MySQL. The best way to load the data into a MySQL database is by using the LOAD DATA statement. If you have problems in reading or understanding the data, please have a look at our FAQ.

The fields of each row correspond to the following columns:

File tas

Tag Assignments: Fact table; who attached which tag to which resource/content

  1. user (number; user names are anonymized)
  2. tag
  3. content_id (matches bookmark.content_id or bibtex.content_id)
  4. content_type (1 = bookmark, 2 = bibtex)
  5. date

File bookmark

Dimension table for bookmark data

  1. content_id (matches tas.content_id)
  2. url_hash (the URL as MD5 hash)
  3. url
  4. description
  5. extended description
  6. date

File bibtex

Dimension table for BibTeX data

  1. content_id (matches tas.content_id)
  2. journal
  3. volume
  4. chapter
  5. edition
  6. month
  7. day
  8. booktitle
  9. howPublished
  10. institution
  11. organization
  12. publisher
  13. address
  14. school
  15. series
  16. bibtexKey (the bibtex key (in the @... line))
  17. url
  18. type
  19. description
  20. annote
  21. note
  22. pages
  23. bKey (the "key" field)
  24. number
  25. crossref
  26. misc
  27. bibtexAbstract
  28. simhash0 (hash for duplicate detection within a user -- strict -- (obsolete))
  29. simhash1 (hash for duplicate detection among users -- sloppy --)
  30. simhash2 (hash for duplicate detection within a user -- strict --)
  31. entrytype
  32. title
  33. author
  34. editor
  35. year

File relation

Tag-tag relations of users

  1. user (number, no user names available; matches tas.user)
  2. sub-tag
  3. super-tag
  4. date

Server Log Data

We also offer a new dataset containing data of the http requests recorded in our web server logs. To get information regarding the data please contact us.

Frequently Asked Questions

How can I import the datasets into my MySQL-Database?

You can use the provided SQL script:

	    mysql -u <username> -p -D <databasename> < tables.sql
	  

This script assumes, that the corresponding data files are located in the '/tmp' directory, are readable for everyone and that the MySQL-user has according user privileges.

When I try to import the training data using the script 'tables.sql' I get a 'Permission Denied'-Error.

Assure that '/tmp/bibtex', '/tmp/bookmark' and '/tmp/tas' are readable for everyone and that the MySQL-user has the FILE privilege:

	    GRANT FILE ON <databasename>.* TO '<username>'@'localhost' IDENTIFIED BY '<password>';
	  

How can I get all information for a given post?

Assume you want to get all information for the post with content_id 42. First, get the user, all tags, content_type and date from the tas table:

	    SELET * from tas where content_id='42';
	  

Now, depending on the post's content_type, get further details from the bibtex or the bookmark table. In our case:

	    SELECT * from bookmark where content_id='42';
	  

I have problems loading the data into a MySQL database.

- I get errors like 'ERROR 1406 (22001): Data too long for column 'annote' at row 43542'

Ensure that the charset of your database, tables, connections is UTF-8. We modified the tables.sql script lately to use UTF-8 wherever possible. However, it might be necessary to modify your database server configuration.

Is for each content_id at most one user_id associated with it?

Yes! The content_id's represent posts and each post belongs to exactly one user. The content_id's do not represent resources - this is done by the hashes (url_hash for bookmarks, simhash[0-2] for publication references). So if you need some overlap between posts (i.e., find posts with the same resource) use the hashes. For publication references there are two relevant hashes: simhash2 (intra hash) which is unique among one user (i.e., each user has at most one post with simhash2) and pretty strict (changing the journal name changes the hash); and simhash1 (inter hash) which is pretty sloppy and provides overlap between resources (resources with the same title, author, year have the same simhash1).