Published 2002 – 2021 | Version v1
Dataset Open

English Corpora

  • 1. ROR icon Brigham Young University

Description

English Corpora is a collection of 17 text corpora, compiled by Mark Davies. The text corpora have been used in linguistics, language learning, natural language processing, and a wide range of other disciplines. The online interface contains extensive descriptions, guides, and tools for using each corpus. More information on the corpora is at https://www.corpusdata.org/corpora.asp In addition to the English-Corpora web interface, NYU users have access to the archive of raw files for computational analysis. The corpora in this package include: iWeb, the Intelligent Web Corpus; NOW, News on the Web; Coronavirus Corpus; COCA: Corpus of Contemporary American English; GloWbE, Global Web-based English; Wikipedia Corpus; COHA: Corpus of Historical American English; TV Corpus; Movies Corpus, SOAP Corpus, and the Corpus del Español and Corpus do Português. These are complete, with the exception of NOW (updated on the web to the present day, but available here only through November 2020) and the Coronavirus Corpus (available here through August 2021, but on the web through December 2022) Each corpus is provided in 3 formats: relational database files usable with SQL; text data broken up line-by-line into individual numbered words, each classified as word/lemma/part of speech; and linear text data (every text within the corpus provided as a single unbroken line). Each also includes a zipped text file called "sources," which lists the sources for all the material included in the corpus, and a zipped file called "lexicon," which provides a lexicon for the corpus.  

Access Information

NYU researchers with a valid netID can mount the cloud-based ds_collections Research Workspace share on any local computer on the NYU network (i.e. on campus or on NYU VPN if off campus). Follow the instructions here for how to access Research Workspace, using ds_collections as the project name. The English-Corpora collection will be found at ds_collections/english-corpora.  Usage of this collection is subject to the restrictions contained in the license.txt file included in that folder.

Files

english_corpora_documentation.zip
Files (45.5 MB)
Name Size Download all
md5:b292ab94ebac02a19a0e41e4890f5ec4
2.5 kB Download
md5:496496e2933403fdd93f87b9a0c645f7
45.5 MB Preview Download

Additional details

Created:
October 14, 2024
Modified:
October 14, 2024