Datasets de pre entrenamiento ALIA 40B
Relación de datasets con referencias en el marco del pre entrenamiento del modelo ALIA 40B.
Los corpus de entrenamiento se listan a continuación:
Corpus
Lenguas
Link
Colossal OSCAR 1.0
bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk
Aya Dataset (w/o Evaluation Suite)
eu, hr, nl, fi, ka, hu, lt, nn, ro, sk, lv, cy, bg, cs, en, fr, de, ga, mt, pl, ru, sl, sv, ca, da, et, gl, el, it, no, pt, sr, es, uk
Wikimedia dumps
bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk
OpenSubtitles v2016
bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk
EurLEX-Resources
bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv
MC4-Legal
bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv
Parlamint
at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si
MaCoCu
bg, ca, el, hr, mt, sl, sr, uk
CURLICAT
bg, hr, hu, pl, ro, sk, sl
Norwegian Colossal Corpus (NCC)
nn, no
Academic Slovene KAS 2.0
sl
BIGPATENT
en
Biomedical-ES
es
Brazilian Portuguese Web as Corpus (BrWaC)
pt
Bulgarian National Corpus (BulNC)
bg
CaBeRnet
fr
CATalog 1.0
ca
CorpusNÓS
gl
Croatian Web as Corpus 2.1 (hrWaC)
hr
DaNewsroom
da
Danish GigaWord
da
Dolmino-mix-1124 (subset without synthetically generated data and privative licenses)
en
DK-CLARIN Reference Corpus of General Danish
da
Estonian National Corpus 2021 (ENC)
et
Estonian Reference Corpus (ERC)
et
EusCrawl (w/o Wikipedia or NC-licenses)
eu
FineWeb-Edu (350BT subset)
en
Fineweb2 (ad hoc subset of 178BT)
ar, as, bg, ca, cs, cy, da, de, el, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sk, sl, sr, sv, uk
French Public Domain Books (French-PD)
fr
French Public Domain Newspapers (French-PD)
fr
German Web as Corpus (DeWaC)
de
Greek Legal Code (GLC)
el
Greek Web Corpus (GWC)
el
HPLT v1 - Spanish
es
HPLT v1.1 - Spanish
es
Irish Universal Dependencies (Ga-UD)
ga
Italian Web as Corpus (ItWaC)
it
Korpus Malti
mt
Korpus slovenských právnych predpisov v1.9 (SK-Laws)
sk
Latxa Corpus v1.1 (GAITU)
eu
Laws and legal acts of Ukraine (UK-Laws)
uk
Legal-ES
es
MARCELL Romanian legislative subcorpus v2
ro
Math AMPS
en
NKPJ National Corpus of Polish v1.2 (NKPJ)
pl
Occitan Corpus (IEA-AALO)
oc
Datos descargados de la web del institut a través de acuerdo, no publicados.
Open Legal Data - German court decisions and laws
de
enlace de descarga desactualizado
ParlamentoPT
pt
peS2o
en
PG-19
en
Pile of Law (selected subsets)
en
Polish Parliamentary Corpus (PPC)
pl
Proof Pile
en
RedPajama-Data T1 (StackExchange subset)
en
Scientific-ES
es
SK Court Decisions v2.0 (OD-Justice)
sk
Slovene Web as Corpus (slWaC)
sl
SoNaR Corpus NC 1.2
nl
Spanish Legal Domain Corpora (Spanish-Legal)
es
SrpKorSubset: news, legal, academic, conversation, lit- erary (SrpKor)
sr
Starcoder
code
State-related content from the Latvian Web (State-Latvian-Web)
lv
SYN v9: large corpus of written Czech
cs
Tagesschau Archive Article
de
The Danish Parliament Corpus 2009 - 2017, v1
da
The Gaois bilingual corpus of English-Irish legislation (Ga-Legislation)
ga
The Pile (PhilPapers)
en
The Swedish Culturomics Gigaword Corpus (Swedish- Gigaword)
sv
Welsh-GOV
cy
Yle Finnish News Archive (Yle-News)
Última actualización