My second Google Summer of Code blog post is going to be a wee bit more technical – I’m going to briefly describe what topic models do, before linking to a tutorial I wrote which will teach you how to do some cool stuff with Topic Models and gensim. Very, very briefly – given a collection of documents, topic models …
2016 Student Data Science Programs with RaRe Technologies
RaRe Technologies is deeply rooted in the open source community and we are always seeking out opportunities to dedicate our experience and time to the next generation of computer scientists. Often the first step is to connect ambitious students to the resources they need to truly make an impact with hands-on projects and mentorship. These up and coming students have …
RaRe Technologies Announces New Growth Team and Pycon Participation for 2016
As the demand for solid Machine Learning software development increases, RaRe has realized a need for an equally solid internal growth team and recently added two new hires into the mix. Chris Lakatos has joined the company as the Director of Marketing while Jeff Hoey has joined heading up Business Development. Each bring with them ample experience in technical fields …
2016 Student Incubator – Week 1 Implementing Topic Coherence Metrics in Gensim
Here’s my first post as part of the RaRe Technologies Incubator Programme! Over the course of this summer I will be working on (and hopefully improving) the functionality of gensim, an open source library for topic modelling. My interest in machine learning and natural language processing started when I took an online course on machine learning by BerkeleyX. I was …
Google Summer of Code 2016 – Week 1 on Dynamic Topic Models
It’s been around a month since being selected to participate in Google Summer of Code 2016 with NumFOCUS and Gensim, and it’s been quite exhilarating. My tryst with Gensim started when I was looking for ways to model evolution of topics in Software Engineering research, and Dynamic Topic Models was an obvious choice. While I initially worked with the original …
Go, Games, Strategy and Life: The Big Picture
Does Python Stand a Chance in Today’s World of Data Science? [video]
Earlier this summer, our director Radim Řehůřek, led a talk about the state of Python in today’s world of Data Science. Covered in the talk is how businesses are using Python for commercial success, Python vs Java, and an interesting comparison of the popular latent semantic analysis (SVD) and word2vec algorithms running on with different platforms: Spark MLlib, gensim, scikit-learn …
Text Summarization with Gensim
Text summarization is one of the newest and most exciting fields in NLP, allowing for developers to quickly find meaning and extract key words and phrases from documents. RaRe Technologies’ newest intern, Ólavur Mortensen, walks the user through text summarization features in Gensim.
Making sense of word2vec
One year ago, Tomáš Mikolov (together with his colleagues at Google) made some ripples by releasing word2vec, an unsupervised algorithm for learning the meaning behind words. In this blog post, I’ll evaluate some extensions that have appeared over the past year, including GloVe and matrix factorization via SVD.
Doc2vec tutorial
The latest gensim release of 0.10.3 has a new class named Doc2Vec. All credit for this class, which is an implementation of Quoc Le & Tomáš Mikolov: “Distributed Representations of Sentences and Documents”, as well as for this tutorial, goes to the illustrious Tim Emerick. Doc2vec (aka paragraph2vec, aka sentence embeddings) modifies the word2vec algorithm to unsupervised learning of continuous …