# Datos para la traducción automática

### Corpus paralelos para el entrenamiento de modelos de traducción automática

<table data-view="cards"><thead><tr><th>Idiomas</th><th>Núm. de Frases</th><th>Origen de los datos</th><th>Disponible en</th><th>Nombre del corpus</th></tr></thead><tbody><tr><td>Multilingüe</td><td>453.783.349</td><td>OPUS + otras funets públicas + corpus sintético</td><td><a href="https://huggingface.co/datasets/BSC-LT/ALIA_mixed_authentic_synthetic_MT">https://huggingface.co/datasets/BSC-LT/ALIA_mixed_authentic_synthetic_MT</a></td><td>ALIA_mixed_authentic_synthetic_MT</td></tr><tr><td>Catalán-Gallego</td><td>33.668.599</td><td>NOS + AINA</td><td><a href="https://huggingface.co/datasets/projecte-aina/CA-GL_Parallel_Corpus">https://huggingface.co/datasets/projecte-aina/CA-GL_Parallel_Corpus</a></td><td>CA-GL_Parallel_Corpus</td></tr><tr><td>Catalán-Euskera</td><td>10.471.139</td><td>GAITU + AINA</td><td><a href="https://huggingface.co/datasets/projecte-aina/CA-EU_Parallel_Corpus">https://huggingface.co/datasets/projecte-aina/CA-EU_Parallel_Corpus</a></td><td>CA-EU_Parallel_Corpus</td></tr><tr><td>Catalán - Aranés</td><td>539.110</td><td>Distintas fuentes de datos paralelos + Sintético</td><td><a href="https://huggingface.co/datasets/BSC-LT/Catalan-Aranese_Parallel_Corpus">https://huggingface.co/datasets/BSC-LT/Catalan-Aranese_Parallel_Corpus</a></td><td>Catalan-Aranese Parallel Corpus</td></tr><tr><td>Español-Aragonés</td><td>47.521</td><td>Corpus sintético + OPUS</td><td><a href="https://huggingface.co/datasets/projecte-aina/ES-AN_Parallel_Corpus">https://huggingface.co/datasets/projecte-aina/ES-AN_Parallel_Corpus</a></td><td>ES-AN Parallel Corpus</td></tr><tr><td>Español-Asturiano</td><td>704.378</td><td>Corpus sintético + OPUS</td><td><a href="https://huggingface.co/datasets/projecte-aina/ES-AST_Parallel_Corpus">https://huggingface.co/datasets/projecte-aina/ES-AST_Parallel_Corpus</a></td><td>ES-AST Parallel Corpus</td></tr><tr><td>Español-Aranés</td><td>419.908</td><td>Corpus sintético + OPUS</td><td><a href="https://huggingface.co/datasets/projecte-aina/ES-OC_Parallel_Corpus">https://huggingface.co/datasets/projecte-aina/ES-OC_Parallel_Corpus</a></td><td>ES-OC Parallel Corpus</td></tr><tr><td>Español - Valenciano</td><td>2.162.451</td><td>BOUA + DOGV + BOUMH + Generalitat Valenciana + Les Corts Valencianes</td><td><a href="https://huggingface.co/datasets/BSC-LT/Spanish-Valencian_Catalan_Parallel_Corpus">https://huggingface.co/datasets/BSC-LT/Spanish-Valencian_Catalan_Parallel_Corpus</a></td><td>Spanish-Valencian Catalan Parallel Corpus</td></tr><tr><td>Valenciano - Español</td><td>120.281</td><td>Universitat Jaume I</td><td><a href="https://huggingface.co/datasets/gplsi/uji_parallel_va_es">https://huggingface.co/datasets/gplsi/uji_parallel_va_es</a></td><td>UJI_PARALLEL_VA_ES Dataset</td></tr><tr><td>Valenciano - Español</td><td>8.759.238</td><td>Diari Oficial de la Generalitat Valenciana</td><td><a href="https://huggingface.co/datasets/gplsi/dogv_parallel">https://huggingface.co/datasets/gplsi/dogv_parallel</a></td><td>DOGV_PARALLEL Dataset</td></tr><tr><td>Valenciano - Español</td><td>738.777</td><td>Associació de Mitjans d'Informació i Comunicació</td><td><a href="https://huggingface.co/datasets/gplsi/amic_parallel">https://huggingface.co/datasets/gplsi/amic_parallel</a></td><td>AMIC_PARALLEL Dataset</td></tr><tr><td>Valenciano - Español</td><td>357.518</td><td>Boletín Oficial de la Universidad de Alicante</td><td><a href="https://huggingface.co/datasets/gplsi/boua_parallel">https://huggingface.co/datasets/gplsi/boua_parallel</a></td><td>BOUA_PARALLEL Dataset</td></tr><tr><td>Valenciano - Inglés</td><td>43.107</td><td>Universitat Jaume I</td><td><a href="https://huggingface.co/datasets/gplsi/uji_parallel_va_en">https://huggingface.co/datasets/gplsi/uji_parallel_va_en</a></td><td>UJI_PARALLEL_VA_EN Dataset</td></tr><tr><td>Español - Catalán</td><td>1.958</td><td>Common Voice</td><td><a href="https://huggingface.co/datasets/gplsi/ES-CA_translation_test">https://huggingface.co/datasets/gplsi/ES-CA_translation_test</a></td><td>ES-CA_alignment_test Dataset</td></tr><tr><td>Español - Valenciano</td><td>1.958</td><td>Common Voice</td><td><a href="https://huggingface.co/datasets/gplsi/ES-VA_translation_test">https://huggingface.co/datasets/gplsi/ES-VA_translation_test</a></td><td>ES-VA_alignment_test Dataset</td></tr><tr><td>Catalán - Valenciano</td><td>1.958</td><td>Common Voice</td><td><a href="https://huggingface.co/datasets/gplsi/CA-VA_alignment_test">https://huggingface.co/datasets/gplsi/CA-VA_alignment_test</a></td><td>CA-VA_alignment_test Dataset</td></tr><tr><td>Inglés - Español</td><td>35.753.765</td><td>Distintas fuentes de dominios legal-administrativo, biomédico y patrimonial</td><td><a href="https://huggingface.co/datasets/SINAI/ALIA-parallel-translation">https://huggingface.co/datasets/SINAI/ALIA-parallel-translation</a></td><td>ALIA-parallel-translation</td></tr><tr><td>Inglés - Español</td><td>288.955 documentos</td><td>Distintas fuentes de dominio patrimonial</td><td><a href="https://huggingface.co/datasets/SINAI/ALIA-heritage-parallel-translation">https://huggingface.co/datasets/SINAI/ALIA-heritage-parallel-translation</a></td><td>ALIA-heritage-parallel-translation</td></tr><tr><td>Inglés - Español - Euskera</td><td>137.726</td><td>Berria (Sintético)</td><td><a href="https://huggingface.co/datasets/HiTZ/ALIA_syntethic_MT">https://huggingface.co/datasets/HiTZ/ALIA_syntethic_MT</a></td><td>ALIA synthetic MT</td></tr><tr><td>Español-Gallego</td><td>8.800 pares oracionales bilingües</td><td>CORPES,CORGA + Sintético</td><td><a href="https://huggingface.co/datasets/proxectonos/corpus_paralelo_idioms">https://huggingface.co/datasets/proxectonos/corpus_paralelo_idioms</a></td><td>Spanish–Galician Idiom Parallel Corpus</td></tr><tr><td>Español–Gallego e Inglés–Gallego</td><td>300.000 oraciones alineadas</td><td>SCIELO</td><td><a href="https://huggingface.co/datasets/proxectonos/SciELO-GL">https://huggingface.co/datasets/proxectonos/SciELO-GL</a></td><td>corpus SCIELO</td></tr><tr><td>Español–Gallego</td><td>320.000 pares de oraciones alineadas</td><td>Dirección General de Traducción</td><td><a href="https://huggingface.co/datasets/proxectonos/DGT-GL">https://huggingface.co/datasets/proxectonos/DGT-GL</a></td><td>Corpus DGT</td></tr><tr><td>gallego, portugués, español, catalán, euskera, inglés</td><td>190.000 pares de oraciones alineadas</td><td>TowerBlocks</td><td><a href="https://huggingface.co/datasets/proxectonos/Finetuning-MT">https://huggingface.co/datasets/proxectonos/Finetuning-MT</a></td><td>Finetuning-MT</td></tr></tbody></table>

***

***

### Corpus  para la adaptación y la evaluación de modelos de traducción automática

<table data-view="cards"><thead><tr><th>Idiomas</th><th data-type="number">Núm. Frases</th><th>Origen de los datos</th><th>Disponible en</th><th>Nombre </th></tr></thead><tbody><tr><td>Multilingüe</td><td>742.183</td><td>Repositorios académicos europeos</td><td><a href="https://huggingface.co/datasets/BSC-LT/ACAData">https://huggingface.co/datasets/BSC-LT/ACAData</a></td><td>ACAData</td></tr><tr><td>español, catalán, euskera, inglés</td><td>518</td><td>FLORES</td><td><a href="https://huggingface.co/datasets/HiTZ/flores_plus_gender">https://huggingface.co/datasets/HiTZ/flores_plus_gender</a></td><td>FLORES+G</td></tr><tr><td>euskera</td><td>1.827</td><td>WinoMT</td><td><a href="https://huggingface.co/datasets/HiTZ/winomteus">https://huggingface.co/datasets/HiTZ/winomteus</a></td><td>WinoTMeus</td></tr><tr><td>español - gallego</td><td>13.198</td><td></td><td><a href="https://huggingface.co/datasets/proxectonos/corpus_paralelo_idioms">https://huggingface.co/datasets/proxectonos/corpus_paralelo_idioms</a></td><td>Corpus paralelo idioms</td></tr><tr><td>español - gallego</td><td>13.6</td><td></td><td><a href="https://huggingface.co/datasets/proxectonos/erros_sistematicos_traducion_es_gl">https://huggingface.co/datasets/proxectonos/erros_sistematicos_traducion_es_gl</a></td><td>Errores sistemáticos traducción</td></tr></tbody></table>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://langtech-bsc.gitbook.io/alia-kit/datasets/datos-para-la-traduccion-automatica.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
