Knowledge and Data Engineering
Uni Kassel

Visualising BibSonomy

Viszards Session at Sunbelt 2009


This year's Viszards Sesson takes place in the new area of social bookmarking, and is about visualising the content of the publication sharing system BibSonomy, which is hosted by the Knwoledge and Data Engineering Group of the University of Kassel.

To get started with the tasks we suggest that you make yourself familiar with BibSonomy. A more formal description of the underlying structure -- called folksonomy -- is given in this paper (pdf here) where also a description of the BibSonomy components is provided. Your next step is to subscribe to the mailing list viszards09. We will use the list to distribute news about the data and other relevant information. Furthermore, the list can be used to clarify questions about the dataset and the different tasks. As the welcome message on the list contains information about how to access the dataset, subscribing to this list is essential to participate in the viszards session.


  1. Ulrik Brandes: About Viszards session
  2. Jurgen Pfeffer: Measures
  3. Vlado Batagelj: Basic analyses
  4. Ann McCranie: BibSonomy Anatomy
  5. Ulrik Brandes: Dynamics of tags
  6. Lothar Krempel: URL and tags


To access the dataset please subscribe to the viszards09 mailing list. The welcome message will contain all information to access the dataset.

The dataset has been created using the mysqldump command of a MySQL database. The CREATE statements for the corresponding tables (each file = one table) can be found in the file tables.sql, together with the LOAD DATA statements which insert the data into the database. For the latter to work you must adapt the paths to the datafiles at the end of tables.sql.

The dataset consists of seven files:

These are tab-separated files, where each line represents a row and the fields of each row are delimited by a tabulator. Please note that the fields themselves can contain line breaks which are quoted by MySQL. The best way to load the data into a MySQL database is by using the LOAD DATA statement.

The fields of each row correspond to the following columns:

Files tas and tas_spam

Tag Assignments: Fact table; who attached which tag to which resource/content

  1. user (number; user names are anonymized)
  2. tag
  3. content_id (matches bookmark.content_id or bibtex.content_id)
  4. content_type (1 = bookmark, 2 = bibtex)
  5. date

Files bookmark and bookmark_spam

Dimension table for bookmark data

  1. content_id (matches tas.content_id)
  2. url_hash (the URL as md5 hash)
  3. url
  4. description
  5. extended description
  6. date

Files bibtex and bibtex_spam

Dimension table for BibTeX data

  1. content_id (matches tas.content_id)
  2. journal volume
  3. chapter
  4. edition
  5. month
  6. day
  7. booktitle
  8. howPublished
  9. institution
  10. organization
  11. publisher
  12. address
  13. school
  14. series
  15. bibtexKey (the bibtex key (in the @... line))
  16. url
  17. type
  18. description
  19. annote
  20. note
  21. pages
  22. bKey (the "key" field)
  23. number
  24. crossref
  25. misc
  26. bibtexAbstract
  27. simhash0 (hash for duplicate detection within a user -- strict -- (obsolete))
  28. simhash1 (hash for duplicate detection among users -- sloppy --)
  29. simhash2 (hash for duplicate detection within a user -- strict --)
  30. entrytype
  31. title
  32. author
  33. editor
  34. year

File user

Mapping of non-spammer / spammer for each user. This file can be used for spam classification.

  1. user (matches tas.user)
  2. spam flag (0 = non-spammer, 1 = spammer)

Size of Files

Number of lines in files:

  1. tas 1,376,048 / tas_spam 22,288,129
  2. bookmark 254,146 / bookmark_spam 3,479,567
  3. bibtex 556,357 / bibtex_spam 5,813
  4. user_spam 57,803


To contact us please send a mail to

The viszards session is supported by the European Project Tagora - Semiotic Dynamics in Online Social Communities.