LogoLogo
LogoLogo
  • Inicio
  • 🤗Modelos
    • 🔠Modelos de texto
    • ↔️Modelos de traducción automática
  • 📚Datasets
    • 🔠Datos y herramientas para modelos de texto
      • Datasets de pre entrenamiento ALIA 40B
    • 🗣️Datos y herramientas para modelos de voz
    • ↔️Datos para la traducción automática
  • 🕹️Demostradores
    • 🤗En Spaces de Hugging Face
    • Page
Con tecnología de GitBook
En esta página
Exportar como PDF
  1. Datasets
  2. Datos y herramientas para modelos de texto

Datasets de pre entrenamiento ALIA 40B

Relación de datasets con referencias en el marco del pre entrenamiento del modelo ALIA 40B.

Los corpus de entrenamiento se listan a continuación:

Corpus

Lenguas

Link

Colossal OSCAR 1.0

bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk

Aya Dataset (w/o Evaluation Suite)

eu, hr, nl, fi, ka, hu, lt, nn, ro, sk, lv, cy, bg, cs, en, fr, de, ga, mt, pl, ru, sl, sv, ca, da, et, gl, el, it, no, pt, sr, es, uk

Wikimedia dumps

bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk

OpenSubtitles v2016

bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk

EurLEX-Resources

bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv

MC4-Legal

bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv

Parlamint

at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si

MaCoCu

bg, ca, el, hr, mt, sl, sr, uk

CURLICAT

bg, hr, hu, pl, ro, sk, sl

Norwegian Colossal Corpus (NCC)

nn, no

Academic Slovene KAS 2.0

sl

BIGPATENT

en

Biomedical-ES

es

Brazilian Portuguese Web as Corpus (BrWaC)

pt

Bulgarian National Corpus (BulNC)

bg

CaBeRnet

fr

CATalog 1.0

ca

CorpusNÓS

gl

Croatian Web as Corpus 2.1 (hrWaC)

hr

DaNewsroom

da

Danish GigaWord

da

Dolmino-mix-1124 (subset without synthetically generated data and privative licenses)

en

DK-CLARIN Reference Corpus of General Danish

da

Estonian National Corpus 2021 (ENC)

et

Estonian Reference Corpus (ERC)

et

EusCrawl (w/o Wikipedia or NC-licenses)

eu

FineWeb-Edu (350BT subset)

en

Fineweb2 (ad hoc subset of 178BT)

ar, as, bg, ca, cs, cy, da, de, el, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sk, sl, sr, sv, uk

French Public Domain Books (French-PD)

fr

French Public Domain Newspapers (French-PD)

fr

German Web as Corpus (DeWaC)

de

Greek Legal Code (GLC)

el

Greek Web Corpus (GWC)

el

HPLT v1 - Spanish

es

HPLT v1.1 - Spanish

es

Irish Universal Dependencies (Ga-UD)

ga

Italian Web as Corpus (ItWaC)

it

Korpus Malti

mt

Korpus slovenských právnych predpisov v1.9 (SK-Laws)

sk

Latxa Corpus v1.1 (GAITU)

eu

Laws and legal acts of Ukraine (UK-Laws)

uk

Legal-ES

es

MARCELL Romanian legislative subcorpus v2

ro

Math AMPS

en

NKPJ National Corpus of Polish v1.2 (NKPJ)

pl

Occitan Corpus (IEA-AALO)

oc

Datos descargados de la web del institut a través de acuerdo, no publicados.

Open Legal Data - German court decisions and laws

de

enlace de descarga desactualizado

ParlamentoPT

pt

peS2o

en

PG-19

en

Pile of Law (selected subsets)

en

Polish Parliamentary Corpus (PPC)

pl

Proof Pile

en

RedPajama-Data T1 (StackExchange subset)

en

Scientific-ES

es

SK Court Decisions v2.0 (OD-Justice)

sk

Slovene Web as Corpus (slWaC)

sl

SoNaR Corpus NC 1.2

nl

Spanish Legal Domain Corpora (Spanish-Legal)

es

SrpKorSubset: news, legal, academic, conversation, lit- erary (SrpKor)

sr

Starcoder

code

State-related content from the Latvian Web (State-Latvian-Web)

lv

SYN v9: large corpus of written Czech

cs

Tagesschau Archive Article

de

The Danish Parliament Corpus 2009 - 2017, v1

da

The Gaois bilingual corpus of English-Irish legislation (Ga-Legislation)

ga

The Pile (PhilPapers)

en

The Swedish Culturomics Gigaword Corpus (Swedish- Gigaword)

sv

Welsh-GOV

cy

Yle Finnish News Archive (Yle-News)

AnteriorDatos y herramientas para modelos de textoSiguienteDatos y herramientas para modelos de voz

Última actualización hace 1 mes

Recopilación de repositorios en abierto (, , , /, )

📚
🔠
https://huggingface.co/datasets/oscar-corpus/colossal-oscar-1.0
https://huggingface.co/datasets/CohereForAI/aya_dataset
https://dumps.wikimedia.org/
https://huggingface.co/datasets/Helsinki-NLP/open_subtitles
https://huggingface.co/datasets/joelniklaus/eurlex_resources
https://huggingface.co/datasets/joelniklaus/legal-mc4
https://clarin-eric.github.io/ParlaMint/
https://macocu.eu/
https://curlicat-project.eu/
https://github.com/NbAiLab/notram/blob/master/guides/corpus_description.md
https://www.clarin.si/repository/xmlui/handle/11356/1448
https://huggingface.co/datasets/NortheasternUniversity/big_patent
https://zenodo.org/records/4561971
https://huggingface.co/datasets/dominguesm/brwac
http://old.dcl.bas.bg/dataset/BulNC.7z
https://aclanthology.org/2020.cmlc-1.3/
https://huggingface.co/datasets/projecte-aina/CATalog
https://zenodo.org/records/11655219
https://clarin.si/repository/xmlui/handle/11356/1064
https://github.com/danielvarab/da-newsroom
https://huggingface.co/datasets/danish-foundation-models/danish-gigaword
https://huggingface.co/datasets/allenai/dolmino-mix-1124
https://korpus.dsl.dk/clarin/
https://metashare.ut.ee/repository/search/?q=estonian%20national%20corpus
https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1068
https://huggingface.co/datasets/HiTZ/euscrawl
https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
https://huggingface.co/datasets/HuggingFaceFW/fineweb-2
https://huggingface.co/datasets/PleIAs/French-PD-Books
https://huggingface.co/datasets/PleIAs/French-PD-Newspapers
https://wacky.sslmit.unibo.it/doku.php?id=seed_urls
https://huggingface.co/datasets/AI-team-UoA/greek_legal_code
http://nlp.polytechnique.fr/resources-greek
https://hplt-project.org/datasets/v1
https://hplt-project.org/datasets/v1.1
https://universaldependencies.org/ga/
https://wacky.sslmit.unibo.it/doku.php?id=seed_urls
https://huggingface.co/datasets/MLRS/korpus_malti
https://www.juls.savba.sk/data.html
https://huggingface.co/datasets/HiTZ/latxa-corpus-v1.1
https://lang.org.ua/en/corpora/#anchor7
https://aclanthology.org/2020.lt4gov-1.6/
https://elrc-share.eu/repository/browse/marcell-romanian-legislative-subcorpus-v2/2da548428b9d11eb9c1a00155d026706ce94a6b59ffc4b0e9fb5cd9cebe6889e/
https://github.com/hendrycks/math
https://nkjp.pl/index.php?page=0&lang=1
https://www.institutestudisaranesi.cat/
https://openlegaldata.io/
https://huggingface.co/datasets/PORTULAN/parlamento-pt
https://huggingface.co/datasets/allenai/peS2o
https://huggingface.co/datasets/deepmind/pg19
https://huggingface.co/datasets/pile-of-law/pile-of-law
https://clip.ipipan.waw.pl/PPC
https://huggingface.co/datasets/hoskinson-center/proof-pile
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
https://dialnet.unirioja.es/
https://scielo.isciii.es/scielo.php
https://revistas.csic.es/
https://www.tesisenred.net
https://docta.ucm.es/home
https://www.juls.savba.sk/data/od-justice/od-justice-2.0.ver.xz
https://www.sketchengine.eu/slwac-slovenian-corpus-from-the-web/
https://elrc-share.eu/repository/browse/sonar-corpus/9735a54f1f9111e7bfe700155d020502b917ac3b8c8844e19665914d110e94d1/
https://zenodo.org/records/5495529
http://metashare.elda.org/repository/browse/corpus-of-contemporary-serbian/00cc41168bdf11e29c9e0015171445924cdac8693bf840f780418187133495b8/
https://huggingface.co/datasets/bigcode/starcoderdata
https://catalog.elra.info/en-us/repository/browse/ELRA-W0169/
https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-4635
https://huggingface.co/datasets/bjoernp/tagesschau-2018-2023
https://repository.clarin.dk/repository/xmlui/handle/20.500.12115/8
https://portulanclarin.net/repository/browse/the-gaois-bilingual-corpus-of-english-irish-legislation-processed/daeac17c9e3511ea9b7f02420a000407b83de243dc0b469aab41084386c5b80f/
https://github.com/thoppe/The-Pile-PhilPapers
https://spraakbanken.gu.se/en/resources/gigaword
Crawling de aqui: https://www.llyw.cymru/
https://www.kielipankki.fi/download/YLE/fi/2019-2020-src/