Counting Efficiently with Bounter pt. 2: CountMinSketch

Filip Štefaňák Machine Learning, Open Source 2 Comments

In my previous post on the new open source Python Bounter library we discussed how we can use its HashTable to quickly count approximate item frequencies in very large item sequences. Now we turn our attention to the second algorithm in Bounter, CountMinSketch (CMS), which is also optimized in C for top performance.

The Mummy Effect: Bridging the gap between academia and industry (PyData keynote)

Radim Řehůřek Machine Learning, Open Source, Student Incubator

Last month, I gave a keynote at PyData Warsaw about the existing (and growing) gap between academia and industry, specifically when it comes to machine learning / data science. This is a topic close to my heart, since we’ve operated in that no-man’s land where academia and industry collide for a living for 7 years now. Between running our Student …

Counting Efficiently with Bounter pt. 1: HashTable

Filip Štefaňák Machine Learning, Open Source Leave a Comment

Have you heard about the new open source Bounter Python library in town? In case you can’t wait to use it in practice but are wary of its “frequency estimation”, and what kind of results you can expect, this series of blog posts will help you develop the right intuition. It is split into two parts, one for each of …

New Gensim feature: Author-topic modeling. LDA with metadata.

Ólavur Mortensen gensim

The author-topic model is an extension of Latent Dirichlet Allocation that allows data scientists to build topic representations of attached author labels. These author labels can represent any kind of discrete metadata attached to documents, for example, tags on posts on the web. In December of 2016, I wrote a blog post explaining that a Gensim implementation was on its …