German Word Embeddings

Pretrained and dockerized GloVe, Word2Vec & fastText

GloVe, Word2Vec, and fastText embeddings

We're sharing our code to easily train word embeddings and the German embeddings derived from the Wikipedia corpus.

At deepset we are passionate supporters and active members of the open source community. We believe that in the field of machine learning being open and transparent about our findings is the only way to go. This in turn should lead us towards innovative, transparent and responsible AI.

To the best of our knowledge, these are the first German GloVe embeddings being published.

Features

  • Dockerized models with a straightforward config via docker-compose.yml that allow simple training on EC2
  • Preprocessing of Wiki corpus
  • Crawling, preprocessing and mixing of different corpora (coming soon)

Trained embeddings (German Wikipedia)

GloVe, Word2Vec, and fastText embeddings

Dockerized training of German embeddings using latest Wikipedia corpus or other preprocessed plain text files.