Project

Machine translation augmented with automatically extracted similar translations

Code
DOCT/002980
Duration
16 November 2023 → 21 September 2025 (Ongoing)
Doctoral researcher
Research disciplines
  • Natural sciences
    • Natural language processing
  • Humanities and the arts
    • Computational linguistics
    • Translation studies
    • Interpreting studies
  • Social sciences
    • Artificial intelligence
Keywords
Machine Translation Artificial Intelligence Natural Language Processing Retrieval-based Machine Translation Large Language Models (LLMs) Computational Linguistics
 
Project description

This project aims to improve machine translation (MT) accuracy and efficiency by integrating Large Language Models (LLMs) with retrieval-based MT techniques and synthetic data augmentation. The approach involves generating synthetic bilingual and monolingual datasets from existing parallel corpora, such as DGT, ParaCrawl, and news crawls, and enhancing these synthetic datasets through neural fuzzy repair and back translation techniques.

Expected outcomes include the development of MT systems that demonstrate improved translation quality by leveraging the capabilities of LLMs. By exploring the combination of retrieval-based methods and synthetic data generation and augmentation, this project seeks to contribute to the ongoing development of more accurate and efficient MT systems, facilitating better global communication.