word2vec | RARE Technologies

Machine learning benchmarks: Hardware providers (part 1)

Shiva Manne 2017-11-26 Machine Learning, Open Source, Student Incubator 13 Comments

The rise of machine learning as a discipline brings new demands for number crunching and computing power. With easily accessible and cheap hardware resources, one has to pick the right platform to run the experiments and model training on. Should you use Amazon’s AWS EC2 instances? Or go with IBM’s Softlayer, Google’s Compute Engine, Microsoft’s Azure? How about a real …

Translation Matrix: how to connect “embeddings” in different languages?

Ji Xiaohong 2017-09-13 gensim, Student Incubator

This is a blog post by one of our Incubator students, Ji Xiaohong. Ji worked on the problem of aligning differently trained word embeddings (such as word2vec), which is useful in applications such as machine translation or tracking language evolution within the same language.

WordRank embedding: “crowned” is most similar to “king”, not word2vec’s “Canute”

Parul Sethi 2017-01-23 gensim, Student Incubator

Comparisons to Word2Vec and FastText with TensorBoard visualizations. With various embedding models coming up recently, it could be a difficult task to choose one. Should you simply go with the ones widely used in NLP community such as Word2Vec, or is it possible that some other model could be more accurate for your use case? There are some evaluation metrics …

Gensim word2vec on CPU faster than Word2veckeras on GPU (Incubator Student Blog)

Šimon Pavlík 2016-10-12 gensim

Word2Vec became so popular mainly thanks to huge improvements in training speed producing high-quality words vectors of much higher dimensionality compared to then widely used neural network language models. Word2Vec is an unsupervised method that can process potentially huge amounts of data without the need for manual labeling. There is really no limit to size of a dataset that can …

FastText and Gensim word embeddings

Jayant Jain 2016-08-31 gensim

Facebook Research open sourced a great project recently – fastText, a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are an extension of word2vec. The main goal of the Fast Text …

Does Python Stand a Chance in Today’s World of Data Science? [video]

Tony DiLoreto 2015-08-30 gensim, Machine Learning 4 Comments

Earlier this summer, our director Radim Řehůřek, led a talk about the state of Python in today’s world of Data Science. Covered in the talk is how businesses are using Python for commercial success, Python vs Java, and an interesting comparison of the popular latent semantic analysis (SVD) and word2vec algorithms running on with different platforms: Spark MLlib, gensim, scikit-learn …

Making sense of word2vec

Radim Řehůřek 2014-12-23 gensim, programming 50 Comments

One year ago, Tomáš Mikolov (together with his colleagues at Google) made some ripples by releasing word2vec, an unsupervised algorithm for learning the meaning behind words. In this blog post, I’ll evaluate some extensions that have appeared over the past year, including GloVe and matrix factorization via SVD.

Doc2vec tutorial

Radim Řehůřek 2014-12-15 gensim, programming 89 Comments

The latest gensim release of 0.10.3 has a new class named Doc2Vec. All credit for this class, which is an implementation of Quoc Le & Tomáš Mikolov: “Distributed Representations of Sentences and Documents”, as well as for this tutorial, goes to the illustrious Tim Emerick. Doc2vec (aka paragraph2vec, aka sentence embeddings) modifies the word2vec algorithm to unsupervised learning of continuous …

Word2vec Tutorial

Radim Řehůřek 2014-02-02 gensim, programming 158 Comments

I never got round to writing a tutorial on how to use word2vec in gensim. It’s simple enough and the API docs are straightforward, but I know some people prefer more verbose formats. Let this post be a tutorial and a reference example.

Parallelizing word2vec in Python

Radim Řehůřek 2013-10-04 gensim, programming 21 Comments

The final instalment on optimizing word2vec in Python: how to make use of multicore machines. You may want to read Part One and Part Two first.