German Word Embeddings

Pretrained and dockerized GloVe, Word2Vec & fastText

GloVe, Word2Vec, and fastText embeddings

We're sharing our code to easily train word embeddings and the German embeddings derived from the Wikipedia corpus.

At deepset we are passionate supporters and active members of the open source community. We believe that in the field of machine learning being open and transparent about our findings is the only way to go. This in turn should lead us towards innovative, transparent and responsible AI.

To the best of our knowledge, these are the first German GloVe embeddings being published.

Features

Dockerized models with a straightforward config via docker-compose.yml that allow simple training on EC2
Preprocessing of Wiki corpus
Crawling, preprocessing and mixing of different corpora (coming soon)

Trained embeddings (German Wikipedia)

GloVe: Vectors and Vocab
Word2Vec: Vectors and Vocab
fastText (required fastText being already installed)

GloVe, Word2Vec, and fastText embeddings

Dockerized training of German embeddings using latest Wikipedia corpus or other preprocessed plain text files.

Check GitLab repo