A blue social bookmark and publication sharing system.

For research purposes we offer a dataset of the BibSonomy database in form of an SQL dump to interested people. Before you get access to the dataset, you have to sign up our license agreement and send it via fax to our office.
Additionally, we would like to ask you to subscribe to the BibSonomy-Research mailing list. Upon receipt of your faxed license agreement, we will approve the subscription request and in the welcome mail you will get instructions on how to access the dataset.

On this page you can download the dumps as compressed tar archive. A README describing the format of the files is contained in each archive. Please note that the easiest way to work with the dumps is by using a MySQL database. Detailed information on the table structure can be found below on this page.

We are quite interested in results you got with the help of this dataset. Therefore, please inform us about your publications. Concerning citing this data in publications, please refer to the following reference (adapting the date):

Knowledge and Data Engineering Group, University of Kassel: Benchmark Folksonomy Data from BibSonomy, version of June 30th, 2007.

If you want to refer to the system, please use the following publication:

Dominik Benz, Andreas Hotho, Robert Jäschke, Beate Krause, Folke Mitzlaff, Christoph Schmitz, and Gerd Stumme. The Social Bookmark and Publication Management System BibSonomy In: The VLDB Journal, Vol. 19, Nr. 6 Berlin / Heidelberg: Springer, Dec (2010), p. 849--875.

Datasets

file size description
2006-06-30.tgz 5.05 MB
2006-12-31.tgz 10.0 MB
2007-04-30_post-core-5.tgz 43.8 KB This is the 2007-04-30 BibSonomy post-core at level 5, used for evaluation in
Robert Jäschke, Leandro Balby Marinho, Andreas Hotho, Lars Schmidt-Thieme and Gerd Stumme. Tag Recommendations in Social Bookmarking Systems. AI Communications, 21(4):231-247, 2008.
2007-06-30.tgz 17.0 MB
2007-10-31.tgz 25.6 MB
2007-12-31.tgz 29.9 MB
2008-06-30.tgz 69.9 MB
2008-09-30.tgz 80.0 MB
2009-01-01.tgz 85.6 MB
2009-07-01.tgz 115 MB
2010-01-01.tgz 155 MB
2010-07-01.tgz 186 MB
2011-01-01.tgz 188 MB
2011-07-01.tgz 164 MB
2012-01-01.tgz 174 MB
BibSonomy_Agreement.pdf 62.0 KB
BibSonomy_Agreement.tex 2.51 KB
BibSonomy_Agreement.odt 12.4 KB
dc09test.tar 11.8 MB ECML PKDD Discovery Challenge 2009 (DC09) Test Dataset
dc09train.tar 75.7 MB ECML PKDD Discovery Challenge 2009 (DC09) Training Dataset
rsdc08train.tar.gz 256 MB ECML PKDD Discovery Challenge 2008 (RSDC08) Training Dataset
rsdctest.zip 46.3 MB ECML PKDD Discovery Challenge 2008 (RSDC08) Test Dataset

Dataset description

The dataset has been created using the mysqldump command of a MySQL database. The CREATE statements for the corresponding tables (each file = one table) can be found in the file tables.sql, together with the LOAD DATA statements which insert the data into the database. For the latter to work you must adapt the paths to the datafiles at the end of tables.sql.

The dataset consists of four files:

These are tab-separated files, where each line represents a row and the fields of each row are delimited by a tabulator. Please note that the fields themselves can contain line breaks which are quoted by MySQL. The best way to load the data into a MySQL database is by using the LOAD DATA statement. If you have problems in reading or understanding the data, please have a look at this FAQ.

The fields of each row correspond to the following columns:

File tas

Tag Assignments: Fact table; who attached which tag to which resource/content

  1. user (number; user names are anonymized)
  2. tag
  3. content_id (matches bookmark.content_id or bibtex.content_id)
  4. content_type (1 = bookmark, 2 = bibtex)
  5. date

File bookmark

Dimension table for bookmark data

  1. content_id (matches tas.content_id)
  2. url_hash (the URL as MD5 hash)
  3. url
  4. description
  5. extended description
  6. date

File bibtex

Dimension table for BibTeX data

  1. content_id (matches tas.content_id)
  2. journal
  3. volume
  4. chapter
  5. edition
  6. month
  7. day
  8. booktitle
  9. howPublished
  10. institution
  11. organization
  12. publisher
  13. address
  14. school
  15. series
  16. bibtexKey (the bibtex key (in the @... line))
  17. url
  18. type
  19. description
  20. annote
  21. note
  22. pages
  23. bKey (the "key" field)
  24. number
  25. crossref
  26. misc
  27. bibtexAbstract
  28. simhash0 (hash for duplicate detection within a user -- strict -- (obsolete))
  29. simhash1 (hash for duplicate detection among users -- sloppy --)
  30. simhash2 (hash for duplicate detection within a user -- strict --)
  31. entrytype
  32. title
  33. author
  34. editor
  35. year

File relation

Tag-tag relations of users

  1. user (number, no user names available; matches tas.user)
  2. sub-tag
  3. super-tag
  4. date