Project

Parsed Corpus of Spoken Dutch Dialects + (GCND+)

Code

I002124N

Duration

01 May 2024 → 30 April 2028

Funding

Research Foundation - Flanders (FWO)

Promotor-spokesperson

Anne Breitbarth

Research disciplines

Humanities and the arts
- Computational linguistics
- Corpus linguistics
- Dialectology
- Syntax
Engineering and technology
- Audio and speech processing

Keywords

automatic speech recognition parsed dialect corpus NLP for non-standard speech

Project description

The current application proposes the construction of an electronic annotated corpus of spontaneous
dialect speech, which fulfils two urgent desiderata: (1) Annotated corpora of spontaneous speech are
still rare compared to corpora of written texts. Particularly parsed dialect corpora are practically nonexistent –notable exceptions are AAPCAppE or the CorDial-Sin, with ca. 1mio tokens each. Yet,
spoken (dialect) corpora are indispensable for a better understanding of language structure,
language change, interaction, and the limits of variation in human language. Given rapidly
progressing dialect loss, the transcription of existing audio collections is urgent, as is their linguistic
annotation. The proposed infrastructure represents a significant geographical extension of the parsed
corpus of spoken Dutch dialects (GCND) currently under construction at Ghent University, to cover
the entire European Dutch dialects area. The entire infrastructure will comprise ca. 10mio. tokens.
(2) Dialects and regional speech still hamper the accuracy of language and speech technologies,
which are increasingly used in all areas of daily life. The current project will use the existing audioaligned transcriptions and annotations from the first stage of the GCND to re-train ASR and NLP tools
to speed up the transcription and annotation of the new data. Improved ASR tools and more robust
NLP pipelines are furthermore very important for more inclusion in an increasingly digital society