DOSMo-7B: A Large Language Model Trained Exclusively on German text

18 Sept 2024, 16:50
1h 30m
Emmy-Noether-Saal

Emmy-Noether-Saal

Speaker

Maximilian Idahl (L3S)

Description

We introduce DOSMo-7B, an open 7 billion parameter large language model (LLM) trained on 1T tokens of exclusively German text. DOSMo-7B uses the same architecture as Mistral-7B, paired with a custom tokenizer to maximize the encoding efficiency for German text. In contrast to existing approaches, which typically improve the German skills of LLMs with continued pretaining, we perform from scratch pretraining to explore the potential of training LLMs with only German text. In this technical report, we describe our approach to dataset creation, training, and evaluation of DOSMo-7B.

Primary author

Maximilian Idahl (L3S)

Presentation materials