Description of the dataset

The dataset can be uploaded in a mysql database. The CREATE statements for the corresponding tables (each file = one table) can be found in the file tables.sql.

The dataset consists of seven files:

These are tab-separated files which have the following columns:

Files tas and tas_spam

Tag ASsignments: Fact table; who attached which tag to which resource/content

  1. user (number; user names are anonymized)
  2. tag
  3. content_id (matches bookmark.content_id or bibtex.content_id)
  4. content_type (1 = bookmark, 2 = bibtex)
  5. date

Files bookmark and bookmark_spam

Dimension table for bookmark data

  1. content_id (matches tas.content_id)
  2. url_hash (the URL as md5 hash)
  3. url
  4. description
  5. extended description
  6. date

Files bibtex and bibtex_spam

Dimension table for BibTeX data

  1. content_id (matches tas.content_id)
  2. journal volume
  3. chapter
  4. edition
  5. month
  6. day
  7. booktitle
  8. howPublished
  9. institution
  10. organization
  11. publisher
  12. address
  13. school
  14. series
  15. bibtexKey (the bibtex key (in the @... line))
  16. url
  17. type
  18. description
  19. annote
  20. note
  21. pages
  22. bKey (the "key" field)
  23. number
  24. crossref
  25. misc
  26. bibtexAbstract
  27. simhash0 (hash for duplicate detection within a user -- strict -- (obsolete))
  28. simhash1 (hash for duplicate detection among users -- sloppy --)
  29. simhash2 (hash for duplicate detection within a user -- strict --)
  30. entrytype
  31. title
  32. author
  33. editor
  34. year

File user

Mapping of non-spammer / spammer for each user. This file can be used for spam classification.

  1. user (matches tas.user)
  2. spam flag (0 = non-spammer, 1 = spammer)

Size of Files

Number of lines in files:

  1. tas 816,197 / tas_spam 13,258,759
  2. bookmark 181,833 / bookmark_spam 2,059,991
  3. bibtex 219,417 / bibtex_spam 716
  4. user_spam 31,715

Additional Files

For the tag recommender competition, the tas table of the test dataset will not contain tags, as it is the task to predict these tags. The tas table of the test dataset contains for every post only one line having the tag null. No information about the actual number of tas will by given.

You can download a version of the training tas file converted to the descibed format here: tas_testing_recommender.gz.