"Culturomics" via the Google Ngram Viewer

January 11, 2011

From the recently published Science (PDF) article and the Culturomics website:

The Google Labs N-gram Viewer is the first tool of its kind, capable of precisely and rapidly quantifying cultural trends based on massive quantities of data. It is a gateway to culturomics! The browser is designed to enable you to examine the frequency of words (banana) or phrases ('United States of America') in books over time. You'll be searching through over 5.2 million books: ~4% of all books ever published!

In the paper itself the authors hop around a bit, to show the potential of this kind of research. They look at the evolving size of the English lexicon and grammatical trends such as the "regularization" of irregular verbs over time. By searching for specific years (e.g. "1945", "1960") they investigate when and how often published books address the past, the present and the future, as mentions of specific years appear and disappear with time.

Carefully constructed queries can provide insights into the ways in which culture and knowledge have changed over time, and the results could conceivably be used as avenues of entry into more in-depth examinations of the particular issues or factors surrounding the query terms. As Google continues to accumulate and sort through the dataset of human written knowledge, the answers that may be found will likely become more and more robust.

(of course, until the day one of their servers achieves sentience and destroys us all).

It would be remiss to omit the fact that scholars have been working on this sort of thing for years, as mentioned here. The Google effort is just a more comprehensive digital approach made possible by their scouring of humanity's data.

*

I was trying to think of words relating to a topic that may have changed drastically over the years in its treatment in published books, and for now I settled on terms about homosexuality. I want to use these terms to illustrate the basic features and limitations of this service.

Searching for a term yields a graph with the frequency of occurrance of your term along the Y-axis, and year of publication along the X-axis. Here is the result for the search term "gay": Search for "gay"

Seeing as this word was commonly used in the past to mean "happy" or "full of joy" (albeit with a tinge of promiscuity), it's not surprising to see its frequent appearance. (Frequent being a relative term, since the numbers are actually quite low, but this seems par for the course for any given individual word). What's interesting to a layman such as myself is the many peaks and valleys that appear in the general upward trend, and then the steady decline from the early 1800s. For example, what's with the giant spike past 1760?

This brings us to one of many caveats that come with the dataset. For instance, one would have to dig deeper to figure out whether some of the peaks are due to erroneous metadata:

Errors in the date assigned to a book can sometimes lead to little peaks from out of nowhere; for instance if a book from 1985 gets misdated as 1885 and brings all of its 1985 lingo with it. This is especially common in 1899 and 1905 for reasons described in the question about metadata quality. This can also happen because a book first published in 1885 is reprinted much later, with a preface written much later, and the new edition is scanned and assigned a date of 1885.

What's also interesting to me is the lack of any noticeable spike around the 1890s, or just past it -- that decade was known as the Gay Nineties. But a quick look at how that terminology came about reveals why:

The term itself began to be used in the 1920s and is believed to have been created by the artist Richard V. Culter, who first released a series of drawings in Life magazine entitled "the Gay Nineties" and later published a book of drawings with the same name.

And sure enough, you can see the upward trend starting in 1920.

*

I wonder if there are questions of statistical significance involved in the analysis of such data. For instance, assuming other questions about sources of data and quality of metadata are appropriately addressed, does the drop in frequency of about 0.008% from the peak circa 1810 to just a few years later have any significance? Do issues of random variation affect the frequency of words in publications? Given the nature of the words we use and why we use them (i.e. words don't occur randomly, but have specific meanings causing them to appear non-randomly in texts with subjects appropriate to specific words; statistically, I'd be more likely to find the word "phytoplankton" in a body of text about the ocean than in one about airplanes), I would guess not, but I don't know enough about this subject of study.

*

Now let's compare the trend for "gay" with the trend for "homosexual":

Search for "gay" vs. "homosexual"

What's interesting here is that the increase in the use of "homosexual" appears to accompany the decrease in the use of "gay", until the early 1980s.

Finally, let's add the word "queer". Originally meaning "strange" or "unusual", it was then used as a derogatory term, but eventually it was reappropriated by the LGBT community -- although surely its use still varies. Does the graph reflect these cultural changes?

Search for "gay" vs. "homosexual" vs. "queer"

I am not a social historian, I just play one on the Internet. Therefore I don't know why there is a peak in the use of the word "queer" around the 1930s -- does this reflect its use as "strange" or "unusual", or had it already switched meanings? Does the upward trend in the 1990s represent the period of reappropriation? Does the drop in usage around World War II represent a reaction to the word's renewed use as a derogatory epithet?

According to George Chauncey's doctoral dissertation "Gay New York: Urban Culture and the Making of a Gay Male World, 1890-1940" (Yale U, 1989), what we would now term as gay men were using the word "queer" to identify themselves, and each other, in New York City by the early part of the 20th century.
It was only around 1940 that homosexual men, especially younger men, began to replace the word "queer" with the word "gay". The evidence generally suggests that it was after World War II, in the late 1940s and the early 1950s, that gay men's dislike of the word "queer", and the derogatory use of that term to describe gay men, became common.

These are rudimentary answers, obviously -- I haven't put in the time to research this thoroughly; but I think this is a fascinating way to start asking questions about a given cultural topic. Of course, this depends on the existence of printed books using your search terms in the appropriate context; while the usage of the term "queer" may have varied as described in the previous quotes, I have no direct evidence that it was being used as such in printed materials.

*

I should note that these graphs are being generated using the English dataset. There are also subsets of American English (books published in the US), British English, subsets of Fiction literature, as well as other languages. This link describes the characteristics of each data set. The English set is characterized as such:

Similar to the Google Million, but not filtered by subject and with no per-year caps.

And the Google Million is:

All are in English with dates ranging from 1500 to 2008. No more than about 6000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled. The random samplings reflect the subject distributions for the year (so there are more computer books in 2000 than 1980). Books with low OCR quality were removed, and serials were removed.

So my impression is that the English set includes the American English and the British English sets, but there's probably something wrong with that assumption, given that some searches yielded no results under "English" yet were successful under either "American English" or "British English". The supplementary materials (PDF) for the Science paper have more detail about methodology, I should read that at some point.

I searched for other terms which have been used in the past, but most of them were found at very low frequencies, such that they would just look like flat lines along the x-axis of the graph under the scale used above.

One final observation: the content of millions of books were digitized using OCR technology. Sometimes it is not accurate. For instance, this is the graph I obtained when I searched for the derogatory term "fag" (CAVEATS: I know it's an awful word -- except when reappropriated I suppose -- and I know it's a contraction of "faggot", which has a non-LGBT meaning as well, but basically I wanted to use it to illustrate the OCR issue shown below. I can also imagine that even if used in context, its frequency surely couldn't ever have been that high in printed books? But I don't know for sure.)

Search for "fag"

The frequency numbers are low, but it certainly seems that this graph shows the word being used more frequently in the 1700s than in later years. At first I thought perhaps this was representative of some archaic meaning. The results pages give you links at the bottom to search for your term within specified windows of time. By following that trail, the mystery is quickly resolved:

OCR mistake

For some reason, the OCR software misread the "p" as "f" from a number of books from the 18th century. As addressed in the FAQ at culturomics.org:

By and large the OCR, the dating accuracy, and the volume of data are all much bigger issues before 1800 than after. That's why our paper doesn't use any data before 1800.

So that's another thing to keep in mind when thinking about the results of your searches.

In conclusion, I think this is a pretty interesting tool for studies of all sorts -- epistemological, sociological, you name it -- at least in terms of sparking questions. If you're willing to take the time to investigate them, every peak and valley in the results could indicate something interesting about social, cultural, and technological changes.


*    *    *