Computing Magazine

How Google Research Revisited Mayzner, 743 Billion Words Later

Posted on the 22 February 2013 by Expectlabs @ExpectLabs

In 1965, Mark Mayzner meticulously analyzed over 20,000 words from books, magazines, and newspapers using an IBM card-sorting machine, in order to paint a more complete picture of the various word and letter frequencies that characterize the English language. Mayzner recently contacted Peter Norvig, Google’s head of research, to see if he could update his experiment by leveraging the enormity of data in the Google Books Ngram Corpus. Norvig agreed to the challenge, and updated Mayzner’s study by analyzing the over 97,565 distinct words which were mentioned over 743 billion times in the Google data collection. In fact, Norvig’s sample had 37 million more word occurrences than the 20,000-word sample that Mayzner used. 

Norvig’s chart below visualizes letter counts by word position, with the frequencies proportional to the length of the bars. The results show that the most common first letter in English is T, while the most common second letter is O.

image

Here are some more highlights from the report:

  • Popular Letters: A, T, and E are the most common letters in the alphabet.
  • Word Counts: The English language’s most popular words are TheOfAndToIn and A.
  • Word Length: The average word length is 7.6 letters long.

Take a look at Norvig’s extensive findings here, along with detailed explanations of his methodologies. For those of us interested in language, Mayzner’s updated study is a complete treasure trove.



Back to Featured Articles on Logo Paperblog