Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.


A guide to linguistics resources at the Georgia Tech Library

Datasets and corpora for linguistics research

The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories, hosted by the University of Pennsylvania and is a center within the University’s School of Arts and Sciences. LDC creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research.

Georgia Tech is an LDC member organization. Georgia Tech users can create an account in order to license data and download datasets.

  • Under Members, choose User Login
  • Create an account with your GT email address
  • You will be upgraded from guest user to organization user upon approval

Wikipedia maintains a database of mostly freely-available text corpora in many languages.