Google Ngrams Data, 1800-2000

Data Archive > Other Data > Summary


Despite the importance of trend data for understanding key substantive and theoretical questions on American culture and religion, almost no such data exist. By searching the massive Google Books collection, however, the Google Ngram Viewer provides quantitative data on cultural and religious trends over time. The Ngram Viewer searches the entire collection of Google Books and reports on the number of times an Ngram is used annually in the books. Ngrams are most commonly words, but can be any given sequence of text. In an effort to democratize access to these trend data, the ARDA has created a dataset with more than 400 Ngram variables generated by the Ngram Viewer and more than 20 historical trend variables taken from the Historical Statistics of the United States and other sources. When available, we also included measures on education and clergy training.

The Ngram variables included in this file were generated by using both specific terms and composite data, where scales are created out of similar words (e.g., Atheist scale = atheist + Atheist + atheism + Atheism). These Ngram data were drawn from Google's American English corpus, which contains more than 3 million books. The Ngram variables were calculated as rates and can be interpreted as how often "xyz" is used, as a proportion of the total words in Google’s American English Corpus. We would caution, however, that the Ngram data included in this file are based on very simple searches. The Ngram Viewer also allows users to customize measures by using a wildcard search, inflection search, case insensitive search, part-of-speech tags and ngram compositions. For many research projects, users will want to refine the searches to better provide the measure desired. See the Finke and McClure working paper for more details.

Data File
Cases: 201
Variables: 434
Weight Variable: None
Data Collection
Date Collected: 2014-2015
Original Survey (Instrument)
Google Ngram Viewer
Funded By
The John Templeton Foundation
Collection Procedures
Ngram counts were gathered from the Google Ngrams American English 2012 corpus. These data are available for viewing at http://books.google.com/ngrams and download at http://storage.googleapis.com/books/ngrams/books/datasetsv2.html.

The historical data were collected from the following sources and organizations: Historical Statistics of the United States; Yearbook of American and Canadian Churches; National Center for Education Statistics; Association of Theological Schools. Some historical measures did not have data for each year, and linear interpolation was used to estimate data for years with missing data.
Principal Investigators
The Association of Religion Data Archives
Roger Finke, Director
Jennifer M. McClure, Research Associate
Nathaniel D. Porter, Research Associate
Related Publications
Finke, Roger and Jennifer M. McClure. "A Quick Review of Millions of Books: Charting Cultural and Religious Trends with Google’s Ngram Viewer."
Note on Pre-1840 Data
The data from before 1840 can contain erratic fluctuations due to the small number of books in this period. In some cases, it may be best to omit data before 1840.

Bookmark and Share