Chinmaya’s Google Summer of Code 2017 Live-Blog : a Chronicle of Integrating Gensim with scikit-learn and Keras

Chinmaya Pancholi gensim, Student Incubator

20th June, 2017 During the last week, I continued working on creating scikit-learn wrappers for Gensim’s LDA (PR #1398), LSI (PR #1398), RandomProjections (PR #1395) and LDASeq (PR #1405) models. After making several changes including updating wrapper class-methods and adding unit-tests for features like model persistence, integration with sklearn’s Pipeline, incorporating NotFittedError as well as fixing some of the older unit-tests, these PRs have now been accepted and merged. 🙂 I also created PR …

Google Summer of Code 2017: Training and Topic Visualizations

Parul Sethi gensim, Student Incubator

21st June 2017 In previous week, I worked on visualizing the document-topic distribution in Tensorboard projector (PR1396) . It basically used the topic distribution of the document as it’s embedding vector and hence ends up forming clusters of documents belonging to same topics. Now, in order to understand and interpret about the theme of those topics, I used pyLDAvis to explore the …

Google Summer of Code 2017 – Performance improvement in Gensim and fastText

Prakhar Pratyush gensim, Student Incubator

June 21, 2017 In the last blog, I mentioned about a memory trade-off for speed by applying unicode to utf8 conversions (any2utf8) only before saving and not on every incoming word. But apparently, memory is more critical here, therefore to handle this speed bottleneck, we now apply this conversion on entire sentence in one go (by using a delimiter), and …

Google Summer of Code 2017 – Week 1 of Integrating Gensim with scikit-learn and Keras

Chinmaya Pancholi gensim, Student Incubator

This is my first post as part of Google Summer of Code 2017 working with Gensim. I would be working on the project ‘Gensim integration with scikit-learn and Keras‘ this summer. I stumbled upon Gensim while working on a project which utilized the Word2Vec model. I was looking for a functionality to suggest words semantically similar to the given input word and Gensim’s …

header-1

Text Summarization in Python: Extractive vs. Abstractive techniques revisited

Pranay, Aman and Aayush gensim, Student Incubator, summarization

This blog is a gentle introduction to text summarization and can serve as a practical summary of the current landscape. It describes how we, a team of three students in the RaRe Incubator programme, have experimented with existing algorithms and Python tools in this domain. We compare modern extractive methods like LexRank, LSA, Luhn and Gensim’s existing TextRank summarization module …

img_20170129_105446_hdr-1

Gensim switches to semantic versioning

Lev Konstantinovskiy gensim, Open Source

Starting with release 1.0.0, Gensim adopts semantic versioning. The time went in a flash, but Gensim has reached maturity. It's been cited in nearly 500 academic papers, used commercially in dozens of companies, organized many coding sprints and meetups and generally withstood the test of time. Between the continued Gensim support by our parent company, rare-technologies.com, and our open Student ...
tw_king_table

WordRank embedding: “crowned” is most similar to “king”, not word2vec’s “Canute”

Parul Sethi gensim, Student Incubator

Comparisons to Word2Vec and FastText with TensorBoard visualizations. With various embedding models coming up recently, it could be a difficult task to choose one. Should you simply go with the ones widely used in NLP community such as Word2Vec, or is it possible that some other model could be more accurate for your use case? There are some evaluation metrics …

atmodel_plot

New Gensim feature: Author-topic modeling. LDA with metadata.

Ólavur Mortensen gensim

The author-topic model is an extension of Latent Dirichlet Allocation that allows data scientists to build topic representations of attached author labels. These author labels can represent any kind of discrete metadata attached to documents, for example, tags on posts on the web. In December of 2016, I wrote a blog post explaining that a Gensim implementation was on its …

20156116-data-concept-computer-keyboard-with-word-data-processing-selected-focus-on-enter-button-background-3-stock-photo

Topic Modelling with Latent Dirichlet Allocation: How to pre-process data and tune your model. New tutorial.

Ólavur Mortensen gensim, Machine Learning, Open Source, programming, Student Incubator

If you’ve learned how to train topic models in Gensim, but aren’t able to get satisfying results, then we have a new tutorial that will help you get on the right track on GitHub. Primarily, you will learn some things about pre-processing text data for the LDA model. You will also get some tips about how to set the parameters …

Author Topic Model

Author-topic models: why I am working on a new implementation

Ólavur Mortensen gensim, Machine Learning, Open Source, programming, Student Incubator

Author-topic models promise to give data scientists a tool to simultaneously gain insight about authorship and content in terms of latent topics. The model is closely related to Latent Dirichlet Allocation (LDA). Basically, each author can be associated with multiple documents, and each document can be attributed to multiple authors. The model learns topic representations for each author, so that …