Following Trackbacks

It's been a really long time since I wrote an article but here I am again. For the last month, I've been working on something awesome : trying to understand how information spreads on the internet. This basically involves a lot of graph theory, a little sociology and a huge amount of intuition.

What I call information diffusion is barely a tree which nodes are web pages/articles and let u, v be two nodes of the tree, if v is a child of u, then the information posted on v was first read on u. The root being the original source of information (or at least the first observable occurrence).

The difficult part is to try and build such a tree. There are several papers on the subject, but most of them reconstruct the tree, using semantic wizardry : take a blog network, identify subjects (whatever it may mean) and using statistics and temporal information, try to guess the path the rumor/buzz/information took. While this method is OK and gives result, I'm not convinced this is the best way of doing things. I strongly believe that whatever are the results one wants to see, there is a way to reconstruct the diffusion to fit those specific results, and this might influence the measurement.

Then I thought that on blogs we had this wonderful concept called trackback. A couple of weeks ago I firmly believed that you could actually track the diffusion of an information just by looking at the trackbacks. The idea was pretty simple : I made the assumption that if blogger B reads something on blogger B's website, he'll create a trackback.

So I wrote a few lines of python and using the Technorati API I built a few trees. In each of those, the nodes are the articles, and there is an (directed) edge between two nodes if the target has made a trackback to the source.

Here is an example : I looked at how the announcement of YouTube's acquisition by Google spread from two different sources. The obvious thing is that the tree is really large and not deep at all. And this is not an artifact, I have quite a lot of data on other information trackbacks and they all follow the same pattern : almost everybody makes a trackback to the source, and there are extremely rare occurrences of trackbacks to another article.

I started wondering where I made a mistake, and after some reflexion I came to the conclusion that there is a very simple yet enlightening explanation : things do not work that way at all, trackbacks do not capture information diffusion.

I'll give you a real life illustration : let's say you have a friend and that this guy tells you something like "Hey, I saw an awesome video on YouTube yesterday, blablabla". Let's assume that you're interested, you'll go and watch the video and when you'll talk about it, you'll never say : "One of my pals told me he saw this video on YouTube, check it out" ; you'll directly cite the original source. This is exactly what happens on the internet.

Moreover, there is another important thing to consider : when blog A tracks back to blog B, there usually a link on blog B to blog A. And if blog B is TechCrunch (or any other huge blog), it's fair to assume that a little traffic will leak to blog A. This is an incentive to track back to highly visited blogs (which happen to be, in most cases, the source of the information).

You might wonder what I'll be doing now that I realized that the whole trackback stuff doesn't work out so well… I'm planning on launching something quite interesting in the following weeks, which might help understanding how information flows from blog to blog, so stay tuned ;)

One Response to "Following Trackbacks"

Leave a Reply