We use cookies to provide the best site experience.
Ok, don't show again
Open source
Open Sourcing German BERT
Insights into pre-training BERT from scratch
Today we are excited to open source our German BERT model, trained from scratch, that significantly outperforms the Google multilingual model in 4 of 5 downstream NLP tasks. The model is publicy available in different versions: TF version, PyTorch version, vocab.

In this post we compare the performance of our German model against the multilingual model, and share insights we gained along the way.
Update: We also released our new transfer learning framework FARM. Check it out for a simple one-click evaluation and adaptation of GermanBERT: https://github.com/deepset-ai/FARM
Why a German BERT Model?
Although the multilingual models released by Google have increased vocab sizes (> 100k tokens) and cover quite a lot of German text, we realized its limitations. Especially when words are chunked into small parts, we believe the model will have a difficult time making sense of the individual chunks. Check out the following hand picked example.
Comparison of multilingual (cased + uncased) vs German tokenization in BERT
Pre-training details
  • We trained using Google's Tensorflow code on a single cloud TPU v2 with standard settings.
  • We trained 810k steps with a batch size of 1024 for sequence length 128 and 30k steps with sequence length 512. Training took about 9 days.
  • As training data we used the latest German Wikipedia dump (6GB of raw txt files), the OpenLegalData dump (2.4 GB) and news articles (3.6 GB).
  • We cleaned the data dumps with tailored scripts and segmented sentences with spacy v2.1. To create tensorflow records we used the recommended sentencepiece library for creating the word piece vocabulary and tensorflow scripts to convert the text to data usable by BERT.

Evaluating performance on downstream tasks
There does not seem to be any consensus in the community about when to stop pre-training or how to interpret the loss coming from BERT's self-supervision. We take the approach of BERT's original authors and evaluated the model performance on downstream tasks. As tasks we gathered the following German datasets:

  • germEval18Fine: Macro f1 score for multiclass sentiment classification
  • germEval18coarse: Macro f1 score for binary sentiment classification
  • germEval14: Seq f1 score for NER (file names deuutf.*)
  • CONLL03: Seq f1 score for NER
  • 10kGNAD: Accuracy for document classification

Even without thorough hyperparameter tuning, we observed quite stable learning especially for our German model. Multiple restarts with different seeds produced quite similar results.

While outperforming on 4 out of 5 task we wondered why the German BERT model did not outperform on CONLL03-de. So we compared English BERT with multilingual on CONLL03-en and found them to perform similar as well.

We further evaluated different points during the 9 days of pre-training and were astonished how fast the model converges to the maximally reachable performance. We ran all 5 downstream tasks on 7 different model checkpoints - taken at 0 up to 840k training steps (x-axis in figure below). Most checkpoints are taken from early training where we expected most performance changes. Surprisingly, even a randomly initialized BERT can be trained only on labeled downstream datasets and reach good performance (blue line, GermEval 2018 Coarse task, 795 kB trainset size).

Evaluation on different model checkpoints. Be aware of non-linear x axis scaling
If you have any questions, want to share your own insights into pre-training from scratch or using our Bert model feel free to reach out to us at engage@deepset.ai
This research was conducted in equal parts by Branden Chan, Timo Möller, Malte Pietsch, Tanay Soni and Chin Man Yeung. Thumbs up for all the good discussions and productive coding sessions!
Write Close
Get in touch
By clicking the button you agree to our Privacy Policy