This site holds the conversational corpora assembled by the former ESRC Centre for Research on Bilingualism in Theory & Practice at University of Wales Bangor.
We are seeking to gain a greater understanding of how bilingual individuals in a variety of communities manage both their languages within the same conversation.
The questions we consider include:
To date, we have assembled three corpora:
Summary data for each corpus:
Welsh | English | Spanish | indeterminate | Total (words) | |
---|---|---|---|---|---|
Siarad | 84% | 4% | --- | 13% | 447507 |
Patagonia | 78% | <0.5% | 17% | 5% | 193102 |
Miami | --- | 63% | 34% | 3% | 235871 |
The material on this site is all available under the Free Software Foundation's General Public License v3 (or later). This means it can be used freely, adapted and extended as required by the user, subject to the same GPLv3 (or later) licence being used for any derived version that is distributed. We would be grateful, however, if derived versions could acknowledge the ESRC Centre.
Siarad v1.5 (the autoglossed version on this website) is additionally licensed under the CC-BY-SA licence, requiring attribution, with the same licence being used for derived versions.
Patagonia, Miami, and Siarad v1.0 (the original version with manual glosses only, downloadable here) are additionally licensed under the CC-BY licence, requiring attribution.
The equivalent material on Talkbank is all available under the CC-BY-NC-SA licence, requiring attribution, no commercial use, and the same licence being used for derived versions.
The choice of licence for each corpus (GPLv3 or later, Creative Commons) is left to the user.
The Siarad corpus was originally published as a CD in 2009, under the GPL2 licence. In this version, Siarad v1.0, the transcripts contained only manual glosses. As of August 2019, this manually-glossed version (download) is being made available under a dual licence, GPLv3 (or later) or CC-BY, with the choice of licence left to the user.
The version on this website, Siarad v1.5, contains both manual glosses and autoglosses. As of August 2019, this autoglossed version is being made available under a dual licence, GPLv3 (or later) or CC-BY-SA, with the choice of licence left to the user.
As of August 2019, the Patagonia and Miami corpora are being made available under a dual licence, GPLv3 (or later) or CC-BY, with the choice of licence left to the user.
The menu page for each conversation now includes two new download links giving access to tab-separated files and compressed PostgreSQL table dumps for the word data.
Building and Using the Siarad Corpus: Bilingual conversations in Welsh and English (Margaret Deuchar, Peredur Webb-Davies, and Kevin Donnelly) has been published by John Benjamins. The first part of the book describes the methods used to build the first sizeable corpus of informal conversational data collected from bilingual speakers of Welsh and English: Siarad. The second part describes the linguistic analysis of data from this corpus.
The .cha format is tiered, with different lines in the file reflecting a different attribute of the text. To help those using simple concordancers, a linear version of the files is now available, in three flavours:
A new search page is available, which returns 20 instances of a word from all conversations in the Siarad or Patagonia corpora. The conversations are combined into one file, but some information such as glosses and (optionally) transcription marking is removed.
A number of publications and presentations have resulted from mining the corpora for the linguistic information they contain.
The researchers have received input and assistance from a variety of collaborators around the world. We have also received help in translating the Miami corpus from a number of people, listed on this page.
Our corpus material is transcribed and annotated using the CHAT and CLAN applications developed by Prof Brian MacWhinney and Leonid Spektor at Carnegie Mellon University. Our Siarad data is also available via the Talkbank portal (although the version there differs slightly from the one on this website.)
To gloss the Miami and Patagonia corpora we are using autoglossing software we have developed in-house. To mine all three corpora we are using a variety of techniques, including the output from the autoglosser.
The ESRC Centre has collected these materials following the ethical guidelines set out in the Talkbank Code of Ethics.
bilingualism@bangor.ac.uk
The Siarad corpus
The Patagonia corpus
The Miami corpus
The support of the Arts and Humanities Research Council (AHRC), the Economic and Social Research Council (ESRC), the Higher Education Funding Council for Wales (HEFCW) and the Welsh Government is gratefully acknowledged.