Sunbelt Viszards Session 2009

Introduction

This year's Viszards Sesson takes place in the new area of social bookmarking, and is about visualising the content of the publication sharing system BibSonomy, which is hosted by the Knwoledge and Data Engineering Group of the University of Kassel.

To get started with the tasks we suggest that you make yourself familiar with BibSonomy. A more formal description of the underlying structure -- called folksonomy -- is given in this paper (pdf here) where also a description of the BibSonomy components is provided. Your next step is to subscribe to the mailing list viszards09. We will use the list to distribute news about the data and other relevant information. Furthermore, the list can be used to clarify questions about the dataset and the different tasks. As the welcome message on the list contains information about how to access the dataset, subscribing to this list is essential to participate in the viszards session.

Program

Dataset

To access the dataset please subscribe to the viszards09 mailing list. The welcome message will contain all information to access the dataset.

The dataset has been created using the mysqldump command of a MySQL database. The CREATE statements for the corresponding tables (each file = one table) can be found in the file tables.sql, together with the LOAD DATA statements which insert the data into the database. For the latter to work you must adapt the paths to the datafiles at the end of tables.sql.

The dataset consists of seven files:

tas, tas_spam
bookmark, bookmark_spam
bibtex, bibtex_spam
user

These are tab-separated files, where each line represents a row and the fields of each row are delimited by a tabulator. Please note that the fields themselves can contain line breaks which are quoted by MySQL. The best way to load the data into a MySQL database is by using the LOAD DATA statement.

The fields of each row correspond to the following columns:

Files tas and tas_spam

Tag Assignments: Fact table; who attached which tag to which resource/content

user (number; user names are anonymized)
tag
content_id (matches bookmark.content_id or bibtex.content_id)
content_type (1 = bookmark, 2 = bibtex)
date

Files bookmark and bookmark_spam

Dimension table for bookmark data

content_id (matches tas.content_id)
url_hash (the URL as md5 hash)
url
description
extended description
date

Files bibtex and bibtex_spam

Dimension table for BibTeX data

content_id (matches tas.content_id)
journal volume
chapter
edition
month
day
booktitle
howPublished
institution
organization
publisher
address
school
series
bibtexKey (the bibtex key (in the @... line))
url
type
description
annote
note
pages
bKey (the "key" field)
number
crossref
misc
bibtexAbstract
simhash0 (hash for duplicate detection within a user -- strict -- (obsolete))
simhash1 (hash for duplicate detection among users -- sloppy --)
simhash2 (hash for duplicate detection within a user -- strict --)
entrytype
title
author
editor
year

File user

Mapping of non-spammer / spammer for each user. This file can be used for spam classification.

user (matches tas.user)
spam flag (0 = non-spammer, 1 = spammer)

Size of Files

Number of lines in files:

tas 1,376,048 / tas_spam 22,288,129
bookmark 254,146 / bookmark_spam 3,479,567
bibtex 556,357 / bibtex_spam 5,813
user_spam 57,803

Contact

To contact us please send a mail to viszards09-info@cs.uni-kassel.de.

Robert Jäschke, University of Kassel
Gerd Stumme, University of Kassel

The viszards session is supported by the European Project Tagora - Semiotic Dynamics in Online Social Communities.

Visualising BibSonomy

Viszards Session at Sunbelt 2009