For research purposes we offer a dataset of the BibSonomy database in form of an SQL dump to
interested people. Before you get access to the dataset, you have to
sign up our license
agreement and send it via fax to our office.
Additionally, we would like to ask you to subscribe to the
BibSonomy-Research mailing list.
Upon receipt of your faxed license agreement, we will approve the subscription request and in the welcome mail you will get
instructions on how to access the dataset.
On this page you can download the dumps as compressed tar archive. A README describing the format of the files is contained in each archive. Please note that the easiest way to work with the dumps is by using a MySQL database. Detailed information on the table structure can be found below on this page.
We are quite interested in results you got with the help of this dataset. Therefore, please inform us about your publications. Concerning citing this data in publications, please refer to the following reference (adapting the date):
Knowledge and Data Engineering Group, University of Kassel: Benchmark Folksonomy Data from BibSonomy, version of June 30th, 2007.
If you want to refer to the system, please use the following publication:
Dominik Benz, Andreas Hotho, Robert Jäschke, Beate Krause, Folke Mitzlaff, Christoph Schmitz, and Gerd Stumme. The Social Bookmark and Publication Management System BibSonomy In: The VLDB Journal, Vol. 19, Nr. 6 Berlin / Heidelberg: Springer, Dec (2010), p. 849--875.
Datasets
Dataset description
The dataset has been created using the mysqldump command of a MySQL database. The CREATE statements for the corresponding tables (each file = one table) can be found in the file tables.sql, together with the LOAD DATA statements which insert the data into the database. For the latter to work you must adapt the paths to the datafiles at the end of tables.sql.
The dataset consists of four files:
These are tab-separated files, where each line represents a row and the fields of each row are delimited by a tabulator. Please note that the fields themselves can contain line breaks which are quoted by MySQL. The best way to load the data into a MySQL database is by using the LOAD DATA statement. If you have problems in reading or understanding the data, please have a look at this FAQ.
The fields of each row correspond to the following columns:
File tas
Tag Assignments: Fact table; who attached which tag to which resource/content
- user (number; user names are anonymized)
- tag
- content_id (matches bookmark.content_id or bibtex.content_id)
- content_type (1 = bookmark, 2 = bibtex)
- date
File bookmark
Dimension table for bookmark data
- content_id (matches tas.content_id)
- url_hash (the URL as MD5 hash)
- url
- description
- extended description
- date
File bibtex
Dimension table for BibTeX data
- content_id (matches tas.content_id)
- journal
- volume
- chapter
- edition
- month
- day
- booktitle
- howPublished
- institution
- organization
- publisher
- address
- school
- series
- bibtexKey (the bibtex key (in the @... line))
- url
- type
- description
- annote
- note
- pages
- bKey (the "key" field)
- number
- crossref
- misc
- bibtexAbstract
- simhash0 (hash for duplicate detection within a user -- strict -- (obsolete))
- simhash1 (hash for duplicate detection among users -- sloppy --)
- simhash2 (hash for duplicate detection within a user -- strict --)
- entrytype
- title
- author
- editor
- year
File relation
Tag-tag relations of users
- user (number, no user names available; matches tas.user)
- sub-tag
- super-tag
- date