Datasets de pre entrenamiento ALIA 40B

Relación de datasets con referencias en el marco del pre entrenamiento del modelo ALIA 40B.

Los corpus de entrenamiento se listan a continuación:

Corpus

Lenguas

Link

Colossal OSCAR 1.0

bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk

Aya Dataset (w/o Evaluation Suite)

eu, hr, nl, fi, ka, hu, lt, nn, ro, sk, lv, cy, bg, cs, en, fr, de, ga, mt, pl, ru, sl, sv, ca, da, et, gl, el, it, no, pt, sr, es, uk

Wikimedia dumps

bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk

OpenSubtitles v2016

bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk

EurLEX-Resources

bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv

MC4-Legal

bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv

Parlamint

at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si

MaCoCu

bg, ca, el, hr, mt, sl, sr, uk

CURLICAT

bg, hr, hu, pl, ro, sk, sl

Brazilian Portuguese Web as Corpus (BrWaC)

pt

Bulgarian National Corpus (BulNC)

bg

Croatian Web as Corpus 2.1 (hrWaC)

hr

Dolmino-mix-1124 (subset without synthetically generated data and privative licenses)

en

DK-CLARIN Reference Corpus of General Danish

da

EusCrawl (w/o Wikipedia or NC-licenses)

eu

Fineweb2 (ad hoc subset of 178BT)

ar, as, bg, ca, cs, cy, da, de, el, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sk, sl, sr, sv, uk

French Public Domain Books (French-PD)

fr

French Public Domain Newspapers (French-PD)

fr

Irish Universal Dependencies (Ga-UD)

ga

Korpus slovenských právnych predpisov v1.9 (SK-Laws)

sk

Laws and legal acts of Ukraine (UK-Laws)

uk

NKPJ National Corpus of Polish v1.2 (NKPJ)

pl

Occitan Corpus (IEA-AALO)

oc

Datos descargados de la web del institut a través de acuerdo, no publicados.

https://www.institutestudisaranesi.cat/

Open Legal Data - German court decisions and laws

de

https://openlegaldata.io/

enlace de descarga desactualizado

Polish Parliamentary Corpus (PPC)

pl

Spanish Legal Domain Corpora (Spanish-Legal)

es

State-related content from the Latvian Web (State-Latvian-Web)

lv

The Danish Parliament Corpus 2009 - 2017, v1

da

The Swedish Culturomics Gigaword Corpus (Swedish- Gigaword)

sv

Última actualización