Resource Guides: Linguistics: Datasets and corpora

Datasets and corpora for linguistics research

The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories, hosted by the University of Pennsylvania and is a center within the University’s School of Arts and Sciences. LDC creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research.

Georgia Tech is an LDC member organization. Georgia Tech users can create an account in order to license data and download datasets with the following steps:

Go to the Linguistic Data Consortium homepage.
Under Members, choose User Login.
Create an account with your GT email address.
Check you email for approval.

Wikipedia maintains a database of mostly freely-available text corpora in many languages.

List of text corpora from Wikipedia

Library

Linguistics

Ask an Expert

Datasets and corpora for linguistics research

Georgia Tech Library