Languages Magazine

How Not to Analyze Lots of Social Data

By Sdlong

The Atlantic reports on research from the Vermont Complex Systems Center. The research team attempted to rate states’ happiness levels by coding 10 million geo-tagged tweets according to their happiness quotient. How did they do that?

The researchers coded each tweet for its happiness content, based on the appearance and frequency of words determined by Mechanical Turk workers to be happy (rainbow, love, beauty, hope, wonderful, wine) or sad (damn, boo, ugly, smoke, hate, lied). While the researchers admit their technique ignores context, they say that for large datasets, simply counting the words and averaging their happiness content produces “reliable” results.

Ignoring context isn’t the problem. The problem lies in what they coded for, which, if the article is accurate, was an extremely limited word set. If the researchers designed algorithms to search tweets for very specific Happy Words–like rainbow and wine–then of course places like Hawaii and Napa will be high on their list of Happy Places. What the researchers created is not a map of the happiest states, or even a map of the states that have the happiest tweets, but a map of states where tweeters are most likely to tweet about things like rainbows and wine. The fact that Napa not only appears at the top of their Happiest Cities list but far outranks the other cities proves to me that their results are seriously skewed by the word set for which they coded. (I actually think the study would have been more valuable had it stuck to coding for swear words, emoticons, and acronyms.)

Perhaps the study had a much more robust word set than the article lets on.  But as described, the study clearly suffers from a problematic methodology, which is unfortunate, because 10 million geo-tagged tweets is a highly valuable data set. I hope the Vermont Complex Systems Center makes the data publicly available or redesigns a tighter study with the help of some linguists, rhetoricians, semanticists, or Bayesian statisticians.


Back to Featured Articles on Logo Paperblog

Magazines