5 Ways to Open Up Corpora for Language Learning

5 Ways to Open Up Corpora for Language Learning

Corpora developed by linguists to study languages are a promising source of authentic materials to employ in the development of OER for language learning. Recently, COERLL’s SpinTX Corpus-to-Classroom project launched a new open resource that seeks to make it easy to search and adapt materials from a video corpus.

The SpinTX video archive  provides a pedagogically-friendly web interface to search hundreds of videos from the Spanish in Texas Corpus. Each of the videos is accompanied by synchronized closed captions and a transcript that has been annotated with thematic, grammatical, functional and metalinguistic information. Educators using the site can also tag videos for features that match their interests, and share favorite videos in playlists.

A collaboration among educators, professional linguists, and technologists, the SpinTX project leverages different aspects of the “openness” movement including open research, open data, open source software, and open education. It is our hope that by opening up this corpus, and by sharing the strategies and tools we used to develop it, others may be able to replicate and build on our work in other contexts.

So, how do we make a corpus open and beneficial across communities? Here are 5 ways:

1. Create an open and accessible search interface

Minimize barriers to your content. Searching the SpinTX video archive requires no registration, passwords or fees. To maximize accessibility, think about your audience’s context and needs. The SpinTX video archive offers a corpus interface specifically for educators, and plans to to create a different interface for researchers.

2. Use open content licences

Add a Creative Commons license to your corpus materials. The SpinTX video archive uses a CC BY-NC-SA license that requires attribution but allows others to reuse the materials different contexts.

3. Make your data open and share content

Allow others to easily embed or download your content and data. The SpinTX video archive provides social sharing buttons for each video, as well as providing access to the source data (tagged transcripts) through Google Fusion Tables.

4. Embrace open source development

When possible, use and build upon open source tools. The SpinTX project was developed using a combination of open source software (e.g. TreeTagger, Drupal) and open APIs (e.g. YouTube Captioning API). Custom code developed for the project is openly shared through a GitHub repository.

5. Make project documentation open

Make it easy for others to replicate and build on your work. The SpinTX team is publishing its research protocols, development processes and methodologies, and other project documentation on the SpinTX Corpus-to-Classroom blog.

Openly sharing language corpora may have wide-ranging benefits for diverse communities of researchers, educators, language learners, and the public interest. The SpinTX team is interested in starting a conversation across these communities. Have you ever used a corpus before? What did you use it for? If you have never used a corpus, how do you find and use authentic videos in the classroom?  How can we make video corpora more accessible and useful for teachers and learners?

gilgRachael Gilg is the Project Manager and Lead Developer for COERLL’s Spanish in Texas Corpus project and the SpinTX Corpus-to-Classroom project. She has acted as project manager, designer, and developer on a diverse set of projects, including educational websites and online courses, video and interactive media, digital archives, and social/community websites.


  1. Carl Blyth says

    Great post, Rachael. And what a fantastic OER! Kudos to the SpinTX team. I can’t wait for Spanish teachers to discover this amazing resource.

    I’d like to mention that COERLL will be presenting SpinTX at several conferences this summer (CALICO in Honolulu, AATSP in San Antonio) as well as a free webinar on June 26th. Register for the June webinar by going here:

    Btw, I want to draw readers’ attention to two Canadian video archives for learning French:
    1. “Francotoile” from the University of Victoria (BC)

    2. ‘Vidéotech” (Carleton University, Ontario)

    Francotoile presents French as an international language with clips of native speakers from 5 continents. You can easily download all videos to your hard drive!
    Vidéotech contains NS videos and NNS videos. It also allows you to create activities based on easy-to-use templates.

    Both resources feature many of the elements that Rachael cites for “opening up foreign language learning,” including Creative Commons licenses.

  2. Seems like a worthy endeavour. I would like to add that it is important to be mindful of the learning process and let that drive any developments. Too often developments in this field lose track of the fundamental of what is required for a language learner to learn. One practical suggestion is to consider game theory ( what drives the phenomenal addictive power of video games) as a way to integrate what we know about successful use of this medium into learning

    • Excellent point Andrew! Since educators and language learners are our primary stakeholders, their needs should drive the design process. We started our project by interviewing educators and developing a needs assessment that informed our design. In addition, we are trying to follow a lean startup approach, which means getting a simple version of the tool launched early and adding new features incrementally based on user observation and feedback.

If you are a first-time commenter, your comment will be held for moderation.

Leave a Comment

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.