2nd Leaderboard

Nothing changed this week since there where no new result submissions. We are waiting for your recommender result submission for the next leaderboard which will be online on April 12th.

Pos Diff Team Name Score

TeamUFCG 0,0262

TomFu 0,0158


Our first baby name heroes are team TeamUFCG and TomFu. We are waiting for your recommender result submission for the next leaderboard which will be online on April 5th.

Pos Diff Team Name Score

TeamUFCG 0,0262

TomFu 0,0158

Teaser #3: I follow whom my name is alike

Once again, it’s time for some number crunching fun… Today, I looked at the interrelationship of first names within Twitter’s Follower graph and got some beautiful results.

For the analysis, I used an excerpt of the Follower graph, consisting of 1,486,403 users and 72,590,619 links (as described here), as well as the name co-occurrence graph based on the English Wikipedia corpus which is used for calculating name similarities in Nameling (as described in the Nameling papers). The fist names of Twitter users were extracted from the users’ profile data, where a user may provide her or his full name. Of course, many users just entered some fantasy name. Accordingly, the first token of the provided name string which matched against our list of known names was chosen as the user’s first name. This process induces some noise into the data, but due to the vast number of considered pairs of users, this effect should be neglectable.

Now, relative to 3,078 randomly chosen users (our Linux cluster is still crunching on more), I calculated the average name similarity of direct neighbours in the Follower graph, the average name similarity between pairs of users at a (shortest path) distance of two, …of three, and so on. For reference, I also added the total average name similarity for all considered pairs of users, as depicted by the grey dashed line. Finally, the error bars correspond to the 95% confidence interval.

As we can see, users which are located more closely within the follower graph tend to have more similar names than distant users. Additionally, a monotonically decreasing dependency between the average name similarity and the shortest path distance in the follower graph can be observed. Moreover, users at a distance up to three tend to have more similar names than in average, whereas users with shortest path distances above three tend to have less similar names than in average.

Stay tuned for more results (eg. considering the ReTweet graph and ReTweet frequency) and happy number crunching!

Participants from 16 countries!

Aloha! Привет! Hola! Shalom! السلام علیکم Olá! Salut! Hallo! Cześć! Hi! 你好 Pozdravljeni! …

Participants from 16 countries already registered to the challenge. Don’t miss the chance and join! First results will be published on April, 1st. We are currently preparing the leader board, which will then be updated every Friday.

As deviating naming habits emerge from different cultural contexts, we are also looking forward to inspiring conversations at the workshop!

Script for splitting training and test data updated

Today we were made aware of some inconveniences in the Perl script for splitting your training and test set from the public challenge data:

  • the output file names were not consistent with the description (fixed)
  • the anonymous user ids were anonymized another time (fixed)
  • different date representation was expected (fixed)

The script is still not user friendly, but at least it prints out some messages now. We added a FAQ entry which exemplifies the process of splitting the public data.

Teaser #2: Given Names and the Co-Authorship Relation

Science is universal, independent of cultural prejudices and political boundaries – and of course, independent of an author’s name. Or not? Try calculating the average similarity of your name with all of your co-authors’ names, and the average name similarity with your co-authors’ co-authors, and so on.

Read more on our analysis of Paul Erdős’ Collaboration network.

Recommending Given Names

nameling logoWe are pleased to organize this year’s ECML PKDD Discovery Challenge, tackling the task of recommending given names. The challenge comprises two phases:

  1. First, there will be an offline competition, where participants predict future search activities based on a training data set which is derived from the name search website nameling.
  2. Then, there will also be  an online competition, where participants integrate their recommender systems into the nameling website.

Of course there will be prizes that we will announce later!