Skip to Main Content


A guide to linguistics resources at the Georgia Tech Library

Datasets and corpora for linguistics research

The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories, hosted by the University of Pennsylvania and is a center within the University’s School of Arts and Sciences. LDC creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research.

Georgia Tech is an LDC member organization. Georgia Tech users can create an account in order to license data and download datasets.

  • Under Members, choose User Login
  • Create an account with your GT email address
  • You will be upgraded from guest user to organization user upon approval

Wikipedia maintains a database of mostly freely-available text corpora in many languages.