open source | RARE Technologies

Gensim Survey 2018

Radim Řehůřek 2018-04-30 gensim, Machine Learning, Open Source

Last month, we ran a survey among Gensim users to get a better idea what delights and annoys you. The ~7 minute survey was completed by 448 people. That’s a great juicy sample, big thanks to all who participated! Full detailed statistics here; in this post I’ll summarize what we found and what it means for Gensim.

Counting Efficiently with Bounter pt. 2: CountMinSketch

Filip Štefaňák 2018-01-31 Machine Learning, Open Source 2 Comments

In my previous post on the new open source Python Bounter library we discussed how we can use its HashTable to quickly count approximate item frequencies in very large item sequences. Now we turn our attention to the second algorithm in Bounter, CountMinSketch (CMS), which is also optimized in C for top performance.

The Mummy Effect: Bridging the gap between academia and industry (PyData keynote)

Radim Řehůřek 2017-11-19 Machine Learning, Open Source, Student Incubator

Last month, I gave a keynote at PyData Warsaw about the existing (and growing) gap between academia and industry, specifically when it comes to machine learning / data science. This is a topic close to my heart, since we’ve operated in that no-man’s land where academia and industry collide for a living for 7 years now. Between running our Student …

Counting Efficiently with Bounter pt. 1: HashTable

Filip Štefaňák 2017-11-10 Machine Learning, Open Source Leave a Comment

Have you heard about the new open source Bounter Python library in town? In case you can’t wait to use it in practice but are wary of its “frequency estimation”, and what kind of results you can expect, this series of blog posts will help you develop the right intuition. It is split into two parts, one for each of …

Dealing mergeytocin: how to run an open source sprint. Based on 8 gensim sprints in 5 countries in 12 months.

Lev Konstantinovskiy 2017-05-24 Open Source

In this blog I want to tell you what it takes to organize an open source coding sprint – find a venue, set an agenda and then actually run it.

New Gensim feature: Author-topic modeling. LDA with metadata.

Ólavur Mortensen 2017-01-18 gensim

The author-topic model is an extension of Latent Dirichlet Allocation that allows data scientists to build topic representations of attached author labels. These author labels can represent any kind of discrete metadata attached to documents, for example, tags on posts on the web. In December of 2016, I wrote a blog post explaining that a Gensim implementation was on its …