Corpus of Contemporary American English

The Corpus of Contemporary American English is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University.

WWW english-corpora.org/coca

Content

The Corpus of Contemporary American English is composed of one billion words as of November 2021. The corpus is constantly growing: In 2009 it contained more than 385 million words; in 2010 the corpus grew in size to 400 million words; by March 2019, the corpus had grown to 560 million words.
As of November 2021, the Corpus of Contemporary American English is composed of 485,202 texts. According to the corpus website, the current corpus is composed of texts that include 24-25 million words for each year 1990–2019.
For each year contained in the corpus, the corpus is evenly divided between six registers/genres: TV/movies, spoken, fiction, magazine, newspaper, and academic. In addition to the six registers that were previously listed, COCA also contains 125,496,215 words from blogs, and 129,899,426 from websites, making it a corpus that is truly composed of contemporary English.
The texts come from a variety of sources:

TV/Movies subtitles: Texts taken from the OpenSubtitles collection of American TV shows and movies.
Spoken: Transcripts of unscripted conversation from nearly 150 TV and radio programs.
Fiction: Short stories and plays, first chapters of books 1990–present, and movie scripts.
Popular magazines: Nearly 100 magazines, from a range of domains such as news, health, home and gardening, women's, financial, religion, and sports.
Newspapers: Ten newspapers from across the US, with text from different sections of the newspapers, such as local news, opinion, sports, and the financial section.
Academic journals: Nearly 100 peer-reviewed journals. These were selected to cover the entire range of the Library of Congress Classification system.

Availability

The Corpus of Contemporary American English is free to search for registered users.

Queries

The interface is the same as the BYU-BNC interface for the 100 million word British National Corpus, the 100 million word Time Magazine Corpus, and the 400 million word Corpus of Historical American English, the 1810s–2000s
Queries by word, phrase, alternates, substring, part of speech, lemma, synonyms, and customized lists
The corpus is tagged by CLAWS, the same part of speech tagger that was used for the BNC and the Time corpus
Chart listings and table listings
Full collocates searching
Re-sortable concordances, showing the most common words/strings to the left and right of the searched word
Comparisons between genres or time periods
One-step comparisons of collocates of related words, to study semantic or cultural differences between words
Users can include semantic information from a 60,000 entry thesaurus directly as part of the query syntax
Users can also create their own 'customized' word lists, and then re-use these as part of subsequent queries
Note that the corpus is available only through the web interface, due to copyright restrictions.

The corpus of contains about 1.9 billion words of text from twenty countries. This makes it about 100 times as large as other corpora like the International Corpus of English, and it allows for many types of searches that would not be possible otherwise. In addition to this online interface, you can also download full-text data from the corpus.
It is unique in the way that it allows one to carry out comparisons between different varieties of English. GloWbE is related to the many other corpora of English.

Corpus of Contemporary American English

Content

Availability

Queries

Related