BibSonomy Dataset :: dumps for research purposes

About

For research purposes we offer a dataset of the BibSonomy database in form of an SQL dump to interested people. Before you get access to the dataset, you have to sign up our license agreement and send it as a scanned file (in pdf, jpg or png format) via email to our office. Alternatively you may send the document via FAX, see number on our contact page.
Additionally, we would like to ask you to subscribe to the BibSonomy-Research mailing list . Upon receipt of your faxed license agreement, we will approve the subscription request and in the welcome mail you will get instructions on how to access the dataset.

On this page you can download the dumps as compressed tar archive. A README describing the format of the files is contained in each archive. Please note that the easiest way to work with the dumps is by using a MySQL database. Detailed information on the table structure can be found below on this page.

We are quite interested in results you got with the help of this dataset. Therefore, please inform us about your publications. Concerning citing this data in publications, please refer to the following reference (adapting the date):

Knowledge and Data Engineering Group, University of Kassel: Benchmark Folksonomy Data from BibSonomy, version of June 30th, 2007.

If you want to refer to the system, please use the following publication:

Dominik Benz, Andreas Hotho, Robert Jäschke, Beate Krause, Folke Mitzlaff, Christoph Schmitz, and Gerd Stumme. The Social Bookmark and Publication Management System BibSonomy. The VLDB Journal, 19(6):849-875, Dec. 2010. [BibTeX]

Datasets

file	size	description
2006-06-30.tgz	5.05 MB
2006-12-31.tgz	10.0 MB
2007-04-30_post-core-5.tgz	43.8 KB	This is the 2007-04-30 BibSonomy post-core at level 5, used for evaluation in Robert Jäschke, Leandro Balby Marinho, Andreas Hotho, Lars Schmidt-Thieme and Gerd Stumme. Tag Recommendations in Social Bookmarking Systems. AI Communications, 21(4):231-247, 2008.
2007-06-30.tgz	17.0 MB
2007-10-31.tgz	25.6 MB
2007-12-31.tgz	29.9 MB
2008-06-30.tgz	69.9 MB
2008-09-30.tgz	80.0 MB
2009-01-01.tgz	85.6 MB
2009-07-01.tgz	115 MB
2010-01-01.tgz	155 MB
2010-07-01.tgz	186 MB
2011-01-01.tgz	188 MB
2011-07-01.tgz	164 MB
2012-01-01.tgz	174 MB
2012-07-01.tgz	182 MB
2013-01-01.tgz	191 MB
2013-07-01.tgz	199 MB
2014-01-01.tgz	204 MB
2014-07-01.tgz	209 MB
2015-01-01.tgz	222 MB
2015-07-01.tgz	227 MB
2016-01-01.tgz	237 MB
2016-07-01.tgz	242 MB
2017-01-01.tgz	247 MB
2017-07-01.tgz	259 MB
2018-01-01.tgz	266 MB
2018-07-01.tgz	278 MB
2019-01-01.tgz	287 MB
2020-01-01.tgz	307 MB
2020-07-01.tgz	312 MB
2021-01-01.tgz	312 MB
2022-01-01.tgz	329 MB
2022-07-01.tgz	334 MB
2023-01-01.tgz	339 MB
2023-07-01.tgz	341 MB
2024-01-01.tgz	346 MB
2024-07-01.tgz	350 MB
2025-01-01.tgz	347 MB
2025-07-01.tgz	350 MB
BibSonomy_Agreement.pdf	62.0 KB
dc09test.tar	11.8 MB	ECML PKDD Discovery Challenge 2009 (DC09) Test Dataset
dc09train.tar	75.7 MB	ECML PKDD Discovery Challenge 2009 (DC09) Training Dataset
rsdc08train.tar.gz	256 MB	ECML PKDD Discovery Challenge 2008 (RSDC08) Training Dataset
rsdctest.zip	46.3 MB	ECML PKDD Discovery Challenge 2008 (RSDC08) Test Dataset

Dataset description

The dataset has been created using the mysqldump command of a MySQL database. The CREATE statements for the corresponding tables (each file = one table) can be found in the file tables.sql, together with the LOAD DATA statements which insert the data into the database. For the latter to work you must adapt the paths to the datafiles at the end of tables.sql.

The dataset consists of four files:

tas
bookmark
bibtex
relation (since 2009-07-01)

These are tab-separated files, where each line represents a row and the fields of each row are delimited by a tabulator. Please note that the fields themselves can contain line breaks which are quoted by MySQL. The best way to load the data into a MySQL database is by using the LOAD DATA statement. If you have problems in reading or understanding the data, please have a look at our FAQ.

The fields of each row correspond to the following columns:

File tas

Tag Assignments: Fact table; who attached which tag to which resource/content

user (number; user names are anonymized)
tag
content_id (matches bookmark.content_id or bibtex.content_id)
content_type (1 = bookmark, 2 = bibtex)
date

File bookmark

Dimension table for bookmark data

content_id (matches tas.content_id)
url_hash (the URL as MD5 hash)
url
description
extended description
date

File bibtex

Dimension table for BibTeX data

content_id (matches tas.content_id)
journal
volume
chapter
edition
month
day
booktitle
howPublished
institution
organization
publisher
address
school
series
bibtexKey (the bibtex key (in the @... line))
url
type
description
annote
note
pages
bKey (the "key" field)
number
crossref
misc
bibtexAbstract
simhash0 (hash for duplicate detection within a user -- strict -- (obsolete))
simhash1 (hash for duplicate detection among users -- sloppy --)
simhash2 (hash for duplicate detection within a user -- strict --)
entrytype
title
author
editor
year

File relation

Tag-tag relations of users

user (number, no user names available; matches tas.user)
sub-tag
super-tag
date

Server Log Data

We also offer a new dataset containing data of the http requests recorded in our web server logs. To get information regarding the data please contact us.

Frequently Asked Questions

How can I import the datasets into my MySQL-Database?

You can use the provided SQL script:

	    mysql -u <username> -p -D <databasename> < tables.sql

This script assumes, that the corresponding data files are located in the '/tmp' directory, are readable for everyone and that the MySQL-user has according user privileges.

When I try to import the training data using the script 'tables.sql' I get a 'Permission Denied'-Error.

Assure that '/tmp/bibtex', '/tmp/bookmark' and '/tmp/tas' are readable for everyone and that the MySQL-user has the FILE privilege:

	    GRANT FILE ON <databasename>.* TO '<username>'@'localhost' IDENTIFIED BY '<password>';

How can I get all information for a given post?

Assume you want to get all information for the post with content_id 42. First, get the user, all tags, content_type and date from the tas table:

	    SELET * from tas where content_id='42';

Now, depending on the post's content_type, get further details from the bibtex or the bookmark table. In our case:

	    SELECT * from bookmark where content_id='42';

I have problems loading the data into a MySQL database.

- I get errors like 'ERROR 1406 (22001): Data too long for column 'annote' at row 43542'

Ensure that the charset of your database, tables, connections is UTF-8. We modified the tables.sql script lately to use UTF-8 wherever possible. However, it might be necessary to modify your database server configuration.

Is for each content_id at most one user_id associated with it?

Yes! The content_id's represent posts and each post belongs to exactly one user. The content_id's do not represent resources - this is done by the hashes (url_hash for bookmarks, simhash[0-2] for publication references). So if you need some overlap between posts (i.e., find posts with the same resource) use the hashes. For publication references there are two relevant hashes: simhash2 (intra hash) which is unique among one user (i.e., each user has at most one post with simhash2) and pretty strict (changing the journal name changes the hash); and simhash1 (inter hash) which is pretty sloppy and provides overlap between resources (resources with the same title, author, year have the same simhash1).