Google Summer of Code 2017 – Week 1 of Integrating Gensim with scikit-learn and Keras

Chinmaya Pancholi gensim, Student Incubator

This is my first post as part of Google Summer of Code 2017 working with Gensim. I would be working on the project ‘Gensim integration with scikit-learn and Keras‘ this summer.

I stumbled upon Gensim while working on a project which utilized the Word2Vec model. I was looking for a functionality to suggest words semantically similar to the given input word and Gensim’s similar_by_word function did it for me! After this, I started to dig into Gensim’s codebase further and found the library to be slick, robust and well-documented. When I came to know that Gensim was participating in GSoC 2017, I was stoked as I believed this was a chance for me to impactfully use my background with Natural Language Processing and Machine Learning by contributing to and improving a popular library like Gensim. In the past, I have undertaken some relevant coursework which includes courses like Deep Learning, Machine Learning and Information Retrieval to name a few. These courses not only helped me to get a strong theoretical understanding of the domains of NLP and ML in general but the associated lab components also gave me hands-on development experience to work on a real-world task. As a sophomore, I was also an intern at the Language Technologies Research Center at International Institute of Information Technology, Hyderabad where I worked on developing an open-domain question-answering system using deep learning.

My first substantial contribution to Gensim was PR #1207. The PR helped to ensure that a Word2Vec model could be trained again if the function _minimize_model did not actually modify the model. After this, I worked on PR #1209 which fixed issue #863. This PR added a function predict_output_word to Word2Vec class which gave the probability distribution of the central word given the context words as input. Another task that I worked on was issue #1082 which was resolved in PR #1327. The PR fixed the backward-incompatibility arising because of the attribute random_state added to LDA model in the Gensim’s 0.13.2 version.

Apart from this, I have already worked to some extent on the integration of Gensim with scikit-learn and Keras in PR #1244 and PR #1248 respectively. In PR #1244, I worked on adding a scikit-learn wrapper for Gensim’s LSI (Latent Semantic Indexing) Model. This enabled us to use “sklearn-like” API for Gensim’s LSI Model using functions like fit, transform, partial_fit, get_params and set_params. PR #1248 added a function get_embedding_layer to Gensim’s KeyedVectors class which simplified incorporating a pre-trained Word2Vec model in one’s Keras model. Hopefully, the learnings from both these pull-requests would be helpful while coding up the wrappers for the remaining models as well. Currently, I am working towards wrapping up PR #1201 which enables one to keep track of the training loss for Word2Vec model.

All these previous contributions to Gensim have helped me immensely in getting comfortable with Gensim’s codebase as well as the community’s coding standards and practices.

Gensim is a Python library for topic modeling, document indexing and similarity retrieval with large corpora. The package is designed mainly for unsupervised learning tasks and thus, to usefully apply it to a real business problem, the output generated by Gensim models should go to a supervised learning system. Presently, the most popular choices for supervised learning libraries are scikit-learn (for simpler data analysis) and Keras (for artificial neural networks). The objective of my project is to create wrappers around all Gensim models to seamlessly integrate Gensim with these libraries. You could take a look at my detailed proposal here.

This work would be a joint project with the shorttext package. shorttext is a collection of algorithms for multi-class classification for short texts using Python. shorttext already has scikit-learn wrappers for some of Gensim’s models such as Latent Dirichlet Allocation model, Latent Semantic Analysis model and Random Projections model. Similarly, shorttext also has wrapper implementations for integration of various neural network algorithms in Keras with Gensim. However, there are certain differences in the implementation of the Keras wrappers in shorttext with the implementation planned in Gensim. For instance, for the wrapper of any Gensim model using KeyedVectors class, shorttext uses a matrix for converting the training input data into a format suitable for training the neural network (see here), while Gensim would be using a Keras Embedding layer. This not just simplifies the way in which we create our Keras model (simply create the first layer of the model as the Keras Embedding layer and then the remaining Keras model) but also uses less memory since we are not using any extra matrix. In any case, we can take several cues from the wrappers implemented in shorttext while developing wrappers in Gensim as well. So, a big shout-out to Stephen for creating this useful package! 🙂

I would like to thank Radim, Lev, Stephen, Ivan and Gordon who have all helped me tremendously to learn and improve through their valuable suggestions and feedback. The Gensim community has been really forthcoming right from the start and on several occasions, I have been guided in the right direction by the members. I am exhilarated to be working with Gensim and I really hope that the work that I do this summer would be useful for Gensim users.