We use cookies to provide the best site experience.
Ok, don't show again
Open source
Open Sourcing German BERT
Insights into pre-training BERT from scratch
Today we are excited to open source our German BERT model, trained from scratch, that significantly outperforms the Google multilingual model in 4 of 5 downstream NLP tasks. The model is available (TF version, PyTorch version, vocab) or can be directly used within huggingface's pytorch repository.

In this post we compare the performance of our German model against the multilingual model, and share insights we gained along the way.
Why a German BERT Model?
Although the multilingual models released by Google have increased vocab sizes (> 100k tokens) and cover quite a lot of German text, we realized its limitations. Especially when words are chunked into small parts, we believe the model will have a difficult time making sense of the individual chunks. Check out the following hand picked example.
Comparison of multilingual (cased + uncased) vs German tokenization in BERT
Pre-training details
  • We trained using Google's Tensorflow code on a single cloud TPU v2 with standard settings.
  • We trained 810k steps with a batch size of 1024 for sequence length 128 and 30k steps with sequence length 512. Training took about 9 days.
  • As training data we used the latest German Wikipedia dump (6GB of raw txt files), the OpenLegalData dump (2.4 GB) and news articles (3.6 GB).
  • We cleaned the data dumps with tailored scripts and segmented sentences with spacy v2.1. To create tensorflow records we used the recommended sentencepiece library for creating the word piece vocabulary and tensorflow scripts to convert the text to data usable by BERT.

Evaluating performance on downstream tasks
There does not seem to be any consensus in the community about when to stop pre-training or how to interpret the loss coming from BERT's self-supervision. We take the approach of BERT's original authors and evaluated the model performance on downstream tasks. As tasks we gathered the following German datasets:

  • germEval18Fine: Macro f1 score for multiclass sentiment classification
  • germEval18coarse: Macro f1 score for binary sentiment classification
  • germEval14: Seq f1 score for NER (file names deuutf.*)
  • CONLL03: Seq f1 score for NER
  • 10kGNAD: Accuracy for document classification

Even without thorough hyperparameter tuning, we observed quite stable learning especially for our German model. Multiple restarts with different seeds produced quite similar results.

While outperforming on 4 out of 5 task we wondered why the German BERT model did not outperform on CONLL03-de. So we compared English BERT with multilingual on CONLL03-en and found them to perform similar as well.

We further evaluated different points during the 9 days of pre-training and were astonished how fast the model converges to the maximally reachable performance. We ran all 5 downstream tasks on 7 different model checkpoints - starting with a completely untrained model. Most checkpoints are taken from early training where we expected most performance changes. Surprisingly, even a completely random BERT can be trained with small downstream datasets from scratch and nearly reach top performance (blue line, GermEval 2018 Coarse task, 795 kB trainset size).

Evaluation on different model checkpoints. Be aware of non-linear x axis scaling
In the coming weeks we will also release a more in depth article about our various experiments and insights, as well as code for running all downstream tasks. We are working on training another model with more data, too - so stay tuned.

If you have any questions, want to share your own insights into pre-training from scratch or using our Bert model feel free to reach out to us at engage@deepset.ai
This research was conducted in equal parts by Branden Chan, Timo Möller, Malte Pietsch, Tanay Soni and Chin Man Yeung. Thumbs up for all the good discussions and productive coding sessions!
Write Close
Get in touch
By clicking the button you agree to our Privacy Policy