> For the complete documentation index, see [llms.txt](https://langtech-bsc.gitbook.io/alia-kit/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://langtech-bsc.gitbook.io/alia-kit/datasets/datos-y-herramientas-para-traduccion-automatica.md).

# Datos y herramientas para traducción automática

### Corpus paralelos para el entrenamiento de modelos de traducción automática

<table data-view="cards"><thead><tr><th>Idiomas</th><th>Núm. de Frases</th><th>Origen de los datos</th><th>Disponible en</th><th>Nombre del corpus</th></tr></thead><tbody><tr><td>Multilingüe</td><td>453.783.349</td><td>OPUS + otras funets públicas + corpus sintético</td><td><a href="https://huggingface.co/datasets/BSC-LT/ALIA_mixed_authentic_synthetic_MT">https://huggingface.co/datasets/BSC-LT/ALIA_mixed_authentic_synthetic_MT</a></td><td>ALIA_mixed_authentic_synthetic_MT</td></tr><tr><td>Catalán-Gallego</td><td>33.668.599</td><td>NOS + AINA</td><td><a href="https://huggingface.co/datasets/projecte-aina/CA-GL_Parallel_Corpus">https://huggingface.co/datasets/projecte-aina/CA-GL_Parallel_Corpus</a></td><td>CA-GL_Parallel_Corpus</td></tr><tr><td>Catalán-Euskera</td><td>10.471.139</td><td>GAITU + AINA</td><td><a href="https://huggingface.co/datasets/projecte-aina/CA-EU_Parallel_Corpus">https://huggingface.co/datasets/projecte-aina/CA-EU_Parallel_Corpus</a></td><td>CA-EU_Parallel_Corpus</td></tr><tr><td>Catalán - Aranés</td><td>539.110</td><td>Distintas fuentes de datos paralelos + Sintético</td><td><a href="https://huggingface.co/datasets/BSC-LT/Catalan-Aranese_Parallel_Corpus">https://huggingface.co/datasets/BSC-LT/Catalan-Aranese_Parallel_Corpus</a></td><td>Catalan-Aranese Parallel Corpus</td></tr><tr><td>Español-Aragonés</td><td>47.521</td><td>Corpus sintético + OPUS</td><td><a href="https://huggingface.co/datasets/projecte-aina/ES-AN_Parallel_Corpus">https://huggingface.co/datasets/projecte-aina/ES-AN_Parallel_Corpus</a></td><td>ES-AN Parallel Corpus</td></tr><tr><td>Español-Asturiano</td><td>704.378</td><td>Corpus sintético + OPUS</td><td><a href="https://huggingface.co/datasets/projecte-aina/ES-AST_Parallel_Corpus">https://huggingface.co/datasets/projecte-aina/ES-AST_Parallel_Corpus</a></td><td>ES-AST Parallel Corpus</td></tr><tr><td>Español-Aranés</td><td>419.908</td><td>Corpus sintético + OPUS</td><td><a href="https://huggingface.co/datasets/projecte-aina/ES-OC_Parallel_Corpus">https://huggingface.co/datasets/projecte-aina/ES-OC_Parallel_Corpus</a></td><td>ES-OC Parallel Corpus</td></tr><tr><td>Español - Valenciano</td><td>2.162.451</td><td>BOUA + DOGV + BOUMH + Generalitat Valenciana + Les Corts Valencianes</td><td><a href="https://huggingface.co/datasets/BSC-LT/Spanish-Valencian_Catalan_Parallel_Corpus">https://huggingface.co/datasets/BSC-LT/Spanish-Valencian_Catalan_Parallel_Corpus</a></td><td>Spanish-Valencian Catalan Parallel Corpus</td></tr><tr><td>Valenciano - Español</td><td>15.697</td><td>Universitat de València</td><td><a href="https://huggingface.co/datasets/gplsi/uv_parallel_va_es">https://huggingface.co/datasets/gplsi/uv_parallel_va_es</a></td><td>UV_PARALLEL_VA_ES</td></tr><tr><td>Valenciano - Inglés</td><td>6.494</td><td>Universitat de València</td><td><a href="https://huggingface.co/datasets/gplsi/uv_parallel_va_en">https://huggingface.co/datasets/gplsi/uv_parallel_va_en</a></td><td>UV_PARALLEL_VA_EN</td></tr><tr><td>Valenciano - Español</td><td>120.281</td><td>Universitat Jaume I</td><td><a href="https://huggingface.co/datasets/gplsi/uji_parallel_va_es">https://huggingface.co/datasets/gplsi/uji_parallel_va_es</a></td><td>UJI_PARALLEL_VA_ES Dataset</td></tr><tr><td>Valenciano - Inglés</td><td>43.107</td><td>Universitat Jaume I</td><td><a href="https://huggingface.co/datasets/gplsi/uji_parallel_va_en">https://huggingface.co/datasets/gplsi/uji_parallel_va_en</a></td><td>UJI_PARALLEL_VA_EN Dataset</td></tr><tr><td>Valenciano - Español</td><td>8.759.238</td><td>Diari Oficial de la Generalitat Valenciana</td><td><a href="https://huggingface.co/datasets/gplsi/dogv_parallel">https://huggingface.co/datasets/gplsi/dogv_parallel</a></td><td>DOGV_PARALLEL Dataset</td></tr><tr><td>Valenciano - Español</td><td>738.777</td><td>Associació de Mitjans d'Informació i Comunicació</td><td><a href="https://huggingface.co/datasets/gplsi/amic_parallel">https://huggingface.co/datasets/gplsi/amic_parallel</a></td><td>AMIC_PARALLEL Dataset</td></tr><tr><td>Valenciano - Español</td><td>357.518</td><td>Boletín Oficial de la Universidad de Alicante</td><td><a href="https://huggingface.co/datasets/gplsi/boua_parallel">https://huggingface.co/datasets/gplsi/boua_parallel</a></td><td>BOUA_PARALLEL Dataset</td></tr><tr><td>Español - Catalán</td><td>1.958</td><td>Common Voice</td><td><a href="https://huggingface.co/datasets/gplsi/ES-CA_translation_test">https://huggingface.co/datasets/gplsi/ES-CA_translation_test</a></td><td>ES-CA_alignment_test Dataset</td></tr><tr><td>Español - Valenciano</td><td>1.958</td><td>Common Voice</td><td><a href="https://huggingface.co/datasets/gplsi/ES-VA_translation_test">https://huggingface.co/datasets/gplsi/ES-VA_translation_test</a></td><td>ES-VA_alignment_test Dataset</td></tr><tr><td>Catalán - Valenciano</td><td>1.958</td><td>Common Voice</td><td><a href="https://huggingface.co/datasets/gplsi/CA-VA_alignment_test">https://huggingface.co/datasets/gplsi/CA-VA_alignment_test</a></td><td>CA-VA_alignment_test Dataset</td></tr><tr><td>español-valenciano-catalán</td><td>802 (palabras/expresiones)</td><td></td><td><a href="https://huggingface.co/datasets/gplsi/es_vaca">https://huggingface.co/datasets/gplsi/es_vaca</a></td><td>es_vaca</td></tr><tr><td>Español - Inglés (MWEs)</td><td>235 (expresiones multi-palabra)</td><td>Películas de Pedro Almodóvar</td><td><a href="https://huggingface.co/datasets/gplsi/almo-mwe">https://huggingface.co/datasets/gplsi/almo-mwe</a></td><td>ALMO-MWE</td></tr><tr><td>Inglés - Español</td><td>35.753.765</td><td>Distintas fuentes de dominios legal-administrativo, biomédico y patrimonial</td><td><a href="https://huggingface.co/datasets/SINAI/ALIA-parallel-translation">https://huggingface.co/datasets/SINAI/ALIA-parallel-translation</a></td><td>ALIA-parallel-translation</td></tr><tr><td>Inglés - Español</td><td>288.955 documentos</td><td>Distintas fuentes de dominio patrimonial</td><td><a href="https://huggingface.co/datasets/SINAI/ALIA-heritage-parallel-translation">https://huggingface.co/datasets/SINAI/ALIA-heritage-parallel-translation</a></td><td>ALIA-heritage-parallel-translation</td></tr><tr><td>Inglés - Español - Euskera</td><td>137.726</td><td>Berria (Sintético)</td><td><a href="https://huggingface.co/datasets/HiTZ/ALIA_syntethic_MT">https://huggingface.co/datasets/HiTZ/ALIA_syntethic_MT</a></td><td>ALIA synthetic MT</td></tr><tr><td>Español-Gallego</td><td>8.800 pares oracionales bilingües</td><td>CORPES,CORGA + Sintético</td><td><a href="https://huggingface.co/datasets/proxectonos/corpus_paralelo_idioms">https://huggingface.co/datasets/proxectonos/corpus_paralelo_idioms</a></td><td>Spanish–Galician Idiom Parallel Corpus</td></tr><tr><td>Español–Gallego e Inglés–Gallego</td><td>300.000 oraciones alineadas</td><td>SCIELO</td><td><a href="https://huggingface.co/datasets/proxectonos/SciELO-GL">https://huggingface.co/datasets/proxectonos/SciELO-GL</a></td><td>corpus SCIELO</td></tr><tr><td>Español–Gallego</td><td>320.000 pares de oraciones alineadas</td><td>Dirección General de Traducción</td><td><a href="https://huggingface.co/datasets/proxectonos/DGT-GL">https://huggingface.co/datasets/proxectonos/DGT-GL</a></td><td>Corpus DGT</td></tr><tr><td>gallego, portugués, español, catalán, euskera, inglés</td><td>190.000 pares de oraciones alineadas</td><td>TowerBlocks</td><td><a href="https://huggingface.co/datasets/proxectonos/Finetuning-MT">https://huggingface.co/datasets/proxectonos/Finetuning-MT</a></td><td>Finetuning-MT</td></tr></tbody></table>

***

***

### Corpus  para la adaptación y la evaluación de modelos de traducción automática

<table data-view="cards"><thead><tr><th>Idiomas</th><th data-type="number">Núm. Frases</th><th>Origen de los datos</th><th>Disponible en</th><th>Nombre </th></tr></thead><tbody><tr><td>Multilingüe</td><td>742.183</td><td>Repositorios académicos europeos</td><td><a href="https://huggingface.co/datasets/BSC-LT/ACAData">https://huggingface.co/datasets/BSC-LT/ACAData</a></td><td>ACAData</td></tr><tr><td>español, catalán, euskera, inglés</td><td>518</td><td>FLORES</td><td><a href="https://huggingface.co/datasets/HiTZ/flores_plus_gender">https://huggingface.co/datasets/HiTZ/flores_plus_gender</a></td><td>FLORES+G</td></tr><tr><td>euskera</td><td>1.827</td><td>WinoMT</td><td><a href="https://huggingface.co/datasets/HiTZ/winomteus">https://huggingface.co/datasets/HiTZ/winomteus</a></td><td>WinoTMeus</td></tr><tr><td>español - gallego</td><td>13.198</td><td></td><td><a href="https://huggingface.co/datasets/proxectonos/corpus_paralelo_idioms">https://huggingface.co/datasets/proxectonos/corpus_paralelo_idioms</a></td><td>Corpus paralelo idioms</td></tr><tr><td>español - gallego</td><td>13.6</td><td></td><td><a href="https://huggingface.co/datasets/proxectonos/erros_sistematicos_traducion_es_gl">https://huggingface.co/datasets/proxectonos/erros_sistematicos_traducion_es_gl</a></td><td>Errores sistemáticos traducción</td></tr></tbody></table>

### Herramientas para modelos de traducción automática

* Herramienta optimizada para el alineamiento de oraciones, párrafos y documentos: <https://github.com/gplsi/translation-alignment>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://langtech-bsc.gitbook.io/alia-kit/datasets/datos-y-herramientas-para-traduccion-automatica.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
