Datasets de pre entrenamiento ALIA 40B

Relación de datasets con referencias en el marco del pre entrenamiento del modelo ALIA 40B.

Los corpus de entrenamiento se listan a continuación:

Corpus

Lenguas

Link

Colossal OSCAR 1.0

bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk

https://huggingface.co/datasets/oscar-corpus/colossal-oscar-1.0

Aya Dataset (w/o Evaluation Suite)

eu, hr, nl, fi, ka, hu, lt, nn, ro, sk, lv, cy, bg, cs, en, fr, de, ga, mt, pl, ru, sl, sv, ca, da, et, gl, el, it, no, pt, sr, es, uk

https://huggingface.co/datasets/CohereForAI/aya_dataset

Wikimedia dumps

bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk

https://dumps.wikimedia.org/

OpenSubtitles v2016

bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk

https://huggingface.co/datasets/Helsinki-NLP/open_subtitles

EurLEX-Resources

bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv

https://huggingface.co/datasets/joelniklaus/eurlex_resources

MC4-Legal

bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv

https://huggingface.co/datasets/joelniklaus/legal-mc4

Parlamint

at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si

https://clarin-eric.github.io/ParlaMint/

MaCoCu

bg, ca, el, hr, mt, sl, sr, uk

https://macocu.eu/

CURLICAT

bg, hr, hu, pl, ro, sk, sl

https://curlicat-project.eu/

Norwegian Colossal Corpus (NCC)

nn, no

https://github.com/NbAiLab/notram/blob/master/guides/corpus_description.md

Academic Slovene KAS 2.0

https://www.clarin.si/repository/xmlui/handle/11356/1448

BIGPATENT

https://huggingface.co/datasets/NortheasternUniversity/big_patent

Biomedical-ES

https://zenodo.org/records/4561971

Brazilian Portuguese Web as Corpus (BrWaC)

https://huggingface.co/datasets/dominguesm/brwac

Bulgarian National Corpus (BulNC)

http://old.dcl.bas.bg/dataset/BulNC.7z

CaBeRnet

https://aclanthology.org/2020.cmlc-1.3/

CATalog 1.0

https://huggingface.co/datasets/projecte-aina/CATalog

CorpusNÓS

https://zenodo.org/records/11655219

Croatian Web as Corpus 2.1 (hrWaC)

https://clarin.si/repository/xmlui/handle/11356/1064

DaNewsroom

https://github.com/danielvarab/da-newsroom

Danish GigaWord

https://huggingface.co/datasets/danish-foundation-models/danish-gigaword

Dolmino-mix-1124 (subset without synthetically generated data and privative licenses)

https://huggingface.co/datasets/allenai/dolmino-mix-1124

DK-CLARIN Reference Corpus of General Danish

https://korpus.dsl.dk/clarin/

Estonian National Corpus 2021 (ENC)

https://metashare.ut.ee/repository/search/?q=estonian%20national%20corpus

Estonian Reference Corpus (ERC)

https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1068

EusCrawl (w/o Wikipedia or NC-licenses)

https://huggingface.co/datasets/HiTZ/euscrawl

FineWeb-Edu (350BT subset)

https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

Fineweb2 (ad hoc subset of 178BT)

ar, as, bg, ca, cs, cy, da, de, el, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sk, sl, sr, sv, uk

https://huggingface.co/datasets/HuggingFaceFW/fineweb-2

French Public Domain Books (French-PD)

https://huggingface.co/datasets/PleIAs/French-PD-Books

French Public Domain Newspapers (French-PD)

https://huggingface.co/datasets/PleIAs/French-PD-Newspapers

German Web as Corpus (DeWaC)

https://wacky.sslmit.unibo.it/doku.php?id=seed_urls

Greek Legal Code (GLC)

https://huggingface.co/datasets/AI-team-UoA/greek_legal_code

Greek Web Corpus (GWC)

http://nlp.polytechnique.fr/resources-greek

HPLT v1 - Spanish

https://hplt-project.org/datasets/v1

HPLT v1.1 - Spanish

https://hplt-project.org/datasets/v1.1

Irish Universal Dependencies (Ga-UD)

https://universaldependencies.org/ga/

Italian Web as Corpus (ItWaC)

https://wacky.sslmit.unibo.it/doku.php?id=seed_urls

Korpus Malti

https://huggingface.co/datasets/MLRS/korpus_malti

Korpus slovenských právnych predpisov v1.9 (SK-Laws)

https://www.juls.savba.sk/data.html

Latxa Corpus v1.1 (GAITU)

https://huggingface.co/datasets/HiTZ/latxa-corpus-v1.1

Laws and legal acts of Ukraine (UK-Laws)

https://lang.org.ua/en/corpora/#anchor7

Legal-ES

https://aclanthology.org/2020.lt4gov-1.6/

MARCELL Romanian legislative subcorpus v2

https://elrc-share.eu/repository/browse/marcell-romanian-legislative-subcorpus-v2/2da548428b9d11eb9c1a00155d026706ce94a6b59ffc4b0e9fb5cd9cebe6889e/

Math AMPS

https://github.com/hendrycks/math

NKPJ National Corpus of Polish v1.2 (NKPJ)

https://nkjp.pl/index.php?page=0&lang=1

Occitan Corpus (IEA-AALO)

Datos descargados de la web del institut a través de acuerdo, no publicados.

https://www.institutestudisaranesi.cat/

Open Legal Data - German court decisions and laws

https://openlegaldata.io/

enlace de descarga desactualizado

ParlamentoPT

https://huggingface.co/datasets/PORTULAN/parlamento-pt

peS2o

https://huggingface.co/datasets/allenai/peS2o

PG-19

https://huggingface.co/datasets/deepmind/pg19

Pile of Law (selected subsets)

https://huggingface.co/datasets/pile-of-law/pile-of-law

Polish Parliamentary Corpus (PPC)

https://clip.ipipan.waw.pl/PPC

Proof Pile

https://huggingface.co/datasets/hoskinson-center/proof-pile

RedPajama-Data T1 (StackExchange subset)

https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

Scientific-ES

Recopilación de repositorios en abierto (https://dialnet.unirioja.es/, https://scielo.isciii.es/scielo.php, https://revistas.csic.es/, https://www.tesisenred.net/, https://docta.ucm.es/home)

SK Court Decisions v2.0 (OD-Justice)

https://www.juls.savba.sk/data/od-justice/od-justice-2.0.ver.xz

Slovene Web as Corpus (slWaC)

https://www.sketchengine.eu/slwac-slovenian-corpus-from-the-web/

SoNaR Corpus NC 1.2

https://elrc-share.eu/repository/browse/sonar-corpus/9735a54f1f9111e7bfe700155d020502b917ac3b8c8844e19665914d110e94d1/

Spanish Legal Domain Corpora (Spanish-Legal)

https://zenodo.org/records/5495529

SrpKorSubset: news, legal, academic, conversation, lit- erary (SrpKor)

http://metashare.elda.org/repository/browse/corpus-of-contemporary-serbian/00cc41168bdf11e29c9e0015171445924cdac8693bf840f780418187133495b8/

Starcoder

code

https://huggingface.co/datasets/bigcode/starcoderdata

State-related content from the Latvian Web (State-Latvian-Web)

https://catalog.elra.info/en-us/repository/browse/ELRA-W0169/

SYN v9: large corpus of written Czech

https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-4635

Tagesschau Archive Article

https://huggingface.co/datasets/bjoernp/tagesschau-2018-2023

The Danish Parliament Corpus 2009 - 2017, v1

https://repository.clarin.dk/repository/xmlui/handle/20.500.12115/8

The Gaois bilingual corpus of English-Irish legislation (Ga-Legislation)

https://portulanclarin.net/repository/browse/the-gaois-bilingual-corpus-of-english-irish-legislation-processed/daeac17c9e3511ea9b7f02420a000407b83de243dc0b469aab41084386c5b80f/

The Pile (PhilPapers)

https://github.com/thoppe/The-Pile-PhilPapers

The Swedish Culturomics Gigaword Corpus (Swedish- Gigaword)

https://spraakbanken.gu.se/en/resources/gigaword

Welsh-GOV

Crawling de aqui: https://www.llyw.cymru/

Yle Finnish News Archive (Yle-News)

https://www.kielipankki.fi/download/YLE/fi/2019-2020-src/

AnteriorDatos y herramientas para modelos de texto SiguienteDatos y herramientas para modelos de voz

Última actualización hace 11 meses