We at deepset are passionate supporters and active members of the open-source community. Especially, in the field of machine learning we value openness and believe that this is the path towards innovative, transparent and responsible AI.
As a small contribution, we are sharing today our code to easily train word embeddings. In addition, we publish German embeddings derived on the Wikipedia Corpus. As far as we know, these are the first published german GloVe embeddings.
Enjoy!Code for Models
- Dockerized models with straightforward config via docker-compose.yml that allow simple training on EC2
- Preprocessing of Wiki corpus
- Crawling, preprocessing and mixing of different corpora (coming soon)Trained embeddings (German Wikipedia):
- GloVe: Vectors
- Word2Vec: Vectors
(needs fastText installed)