Speaker
Maximilian Idahl
(L3S)
Description
We introduce DOSMo-7B, an open 7 billion parameter large language model (LLM) trained on 1T tokens of exclusively German text. DOSMo-7B uses the same architecture as Mistral-7B, paired with a custom tokenizer to maximize the encoding efficiency for German text. In contrast to existing approaches, which typically improve the German skills of LLMs with continued pretaining, we perform from scratch pretraining to explore the potential of training LLMs with only German text. In this technical report, we describe our approach to dataset creation, training, and evaluation of DOSMo-7B.
Primary author
Maximilian Idahl
(L3S)