The Data Science of Human Behavior

Burt Kaliski | Jun 15, 2012

One of my favorite data scientists of the “next web” generation is Hilary Mason, chief scientist of Bitly.

Hilary was the presenter at the Verisign Labs Distinguished Speakers Series in late May, and brought a fascinating perspective on what Bitly is learning about human behavior through its URL shortening service.  The company is asking questions like: What links are users shortening so they can share them with other users?  And what shortened links are they clicking on?  It turns out that there’s a difference:  users tend to shorten more links that make them seem intelligent – such as world news – yet click on more links that arguably would make them seem less so – like celebrity gossip.  The links that are shared and clicked on more are the lowest common denominator among the ones that were initially shortened.

The pattern of clicks on a newly shortened link, Hilary explained, is fairly consistent:  an initial increase where user interest grows as the link propagates through a social network, followed by a decay as attention dissipates.  The same pattern applies whether the link goes to breaking news or “evergreen content.”  However, the time constant – how long it takes until the link reaches its “half-life” – can vary significantly.  The half-life also varies by medium.  A YouTube link takes much longer to propagate presumably because of the time it takes for a user to settle down, view a video, and share its link, than a Twitter or Facebook link that is shared almost immediately.

These kinds of observations about what users share and click, Hilary said, are part of observing the “greatest theater of humanity.” While at first somewhat discouraging to data scientists (“Is that what people are really spending their time on?”), it’s eventually inspiring (“This is people connecting with each other!”). 

The availability of quality data facilitates all kinds of insights about usage patterns, whether it’s social sharing at the application layer, or infrastructure-layer observations around IPv6 based on DNS queries – something the Verisign Labs team is spending a good amount of time studying, particularly as World IPv6 Launch has just occurred.

I particularly liked Hilary’s four steps of machine learning at scale:

  1. Research offline
  2. Do fancy math – find the shortcuts
  3. Design infrastructure
  4. Re-design to run at scale

It’s a good model for research in general:  get an idea right on paper, put it in a prototype form, then put the idea into practice.  There can be even a little Jugaad principle in the shortcut process.

Bitly, it turns out, was just a feature of another product that failed, and the data feeds – now the most valuable part of the service – initially just a side effect.  Innovation proceeds with surprises like this for those who are paying attention.

The final point I took away from the presentation was Hilary’s principle that “simple math is better than fancy math.”

There’s beauty in simplicity.  Fancy methods may get the ideal answer, but simple ones are a lot easier to work with – and usually good enough, especially for breaking new ground in the huge and mostly still unexplored space of data about what people are doing on the Internet.

What are the ways you use the Internet most? Do they involve short links?