Democratizing access to the best data on religion since 1997
DATA ARCHIVE
DATA ARCHIVE

Archive Navigation

Summary
Codebook
Downloads
View Question Bank

Archive Categories


Sort the above Archive Category by:



Browse Alphabetically
Browse All Categories
Browse Newest Additions

File Summaries


Search Data Archive


Google Ngrams Data, 1800-2000

DOI

10.17605/OSF.IO/BNDZK

Citation

Finke, R., McClure, J. M., & Porter, N. D. (2020, April 25). Google Ngrams Data, 1800-2000.

Summary

Despite the importance of trend data for understanding key substantive and theoretical questions on American culture and religion, almost no such data exist. By searching the massive Google Books collection, however, the Google Ngram Viewer provides quantitative data on cultural and religious trends over time. The Ngram Viewer searches the entire collection of Google Books and reports on the number of times an Ngram is used annually in the books. Ngrams are most commonly words, but can be any given sequence of text. In an effort to democratize access to these trend data, the ARDA has created a dataset with more than 400 Ngram variables generated by the Ngram Viewer and more than 20 historical trend variables taken from the Historical Statistics of the United States and other sources. When available, we also included measures on education and clergy training.

The Ngram variables included in this file were generated by using both specific terms and composite data, where scales are created out of similar words (e.g., Atheist scale = atheist + Atheist + atheism + Atheism). These Ngram data were drawn from Google's American English corpus, which contains more than 3 million books. The Ngram variables were calculated as rates and can be interpreted as how often "xyz" is used, as a proportion of the total words in Google's American English Corpus. We would caution, however, that the Ngram data included in this file are based on very simple searches. The Ngram Viewer also allows users to customize measures by using a wildcard search, inflection search, case insensitive search, part-of-speech tags and ngram compositions. For many research projects, users will want to refine the searches to better provide the measure desired. See the Finke and McClure working paper for more details.

Data File

Cases: 201
Variables: 434
Weight Variable: None

Data Collection

2014-2015

Original Survey (Instrument)

Google Ngram Viewer

Funded By

The John Templeton Foundation

Collection Procedures

Ngram counts were gathered from the Google Ngrams American English 2012 corpus. These data are available for viewing at https://books.google.com/ngrams and download at http://storage.googleapis.com/books/ngrams/books/datasetsv2.html.

The historical data were collected from the following sources and organizations: Historical Statistics of the United States; Yearbook of American and Canadian Churches; National Center for Education Statistics; Association of Theological Schools. Some historical measures did not have data for each year, and linear interpolation was used to estimate data for years with missing data.

Principal Investigators

The Association of Religion Data Archives
Roger Finke, Director
Jennifer M. McClure, Research Associate
Nathaniel D. Porter, Research Associate

Related Publications

Finke, Roger and Jennifer M. McClure. "A Quick Review of Millions of Books: Charting Cultural and Religious Trends with Google's Ngram Viewer."

Note on Pre-1840 Data

The data from before 1840 can contain erratic fluctuations due to the small number of books in this period. In some cases, it may be best to omit data before 1840.

Our Sponsors

Our Affiliates

US RELIGION
WORLD RELIGION
DATA ARCHIVE
RESEARCH
TEACHING
CONGREGATIONS
ABOUT
© 2023 The Association of Religion Data Archives. All rights reserved.