Project

Parsed Corpus of Spoken Dutch Dialects + (GCND+)

Code
I002124N
Duration
01 May 2024 → 30 April 2028
Funding
Research Foundation - Flanders (FWO)
Promotor-spokesperson
Research disciplines
  • Humanities and the arts
    • Computational linguistics
    • Corpus linguistics
    • Dialectology
    • Syntax
  • Engineering and technology
    • Audio and speech processing
Keywords
automatic speech recognition parsed dialect corpus NLP for non-standard speech
 
Project description

The current application proposes the construction of an electronic annotated corpus of spontaneous
dialect speech, which fulfils two urgent desiderata: (1) Annotated corpora of spontaneous speech are
still rare compared to corpora of written texts. Particularly parsed dialect corpora are practically nonexistent –notable exceptions are AAPCAppE or the CorDial-Sin, with ca. 1mio tokens each. Yet,
spoken (dialect) corpora are indispensable for a better understanding of language structure,
language change, interaction, and the limits of variation in human language. Given rapidly
progressing dialect loss, the transcription of existing audio collections is urgent, as is their linguistic
annotation. The proposed infrastructure represents a significant geographical extension of the parsed
corpus of spoken Dutch dialects (GCND) currently under construction at Ghent University, to cover
the entire European Dutch dialects area. The entire infrastructure will comprise ca. 10mio. tokens.
(2) Dialects and regional speech still hamper the accuracy of language and speech technologies,
which are increasingly used in all areas of daily life. The current project will use the existing audioaligned transcriptions and annotations from the first stage of the GCND to re-train ASR and NLP tools
to speed up the transcription and annotation of the new data. Improved ASR tools and more robust
NLP pipelines are furthermore very important for more inclusion in an increasingly digital society