Gensim, the machine learning library for unsupervised learning I started in late 2008, will be celebrating its fifth anniversary this November. Time to reminisce and mull over its successes and failures 🙂
Gensim started off as a collection of sundry scripts to generate similarities (“gensim”) across documents in the DML-CZ project. It has since been released as open source and extended, I’ve written tutorials, answered thousands of support threads on the mailing list, issued 31 official releases. In 2011, I moved gensim to github, which reports 1,111 project commits from 21 contributors. Several companies have used gensim. The workshop paper we encourage researchers to cite when they use gensim has been cited 62 times. Gensim has been used as a teaching tool at universities.
Would I do things differently if I were to start gensim today? You bet. Partly because I’ve learned a lot in the past five years, both about programming and project management. But mostly because the field itself has matured tremendously.
There’s no comparing the machine learning landscape in 2008 vs now. Both with respect to the quality of the available open source tools and the theoretical advances in the field. Machine learning has finally reached mainstream, with automated classification, prediction, fraud detection, recommendation systems etc. getting on company radars. With machine learning startups popping up like mushrooms, these are great times to be working in this domain.
Having said that, how did some of my initial design decision pan out?
Distributed computing. I rolled my own system for dispatching and processing worker jobs across heterogeneous computer clusters, down to thread locking and naive job management. This seems totally insane in retrospect. I’d definitely go with a dedicated library like Disco, joblib or indeed anything but rolling my own. Big data and distributed computation are all the rage now, but keep in mind the actual tools were not nearly as mature (or existent) as they are today. And I needed to get my similarities faster, so what is a poor programmer to do? It was a good exercise though.
In fact, in retrospect I should have dropped cluster computing altogether and focus simply on multicore machines. Gensim occupies a niche that targets academia/small-medium businesses: it can process data that is too big for naive in-memory implementations (the gigabyte-terabyte range), but there’s no way for it to tempt the enterprise. That will always go to the Java ecosystem and complexity, whether they need it or not. Targeting single, powerful machines in gensim instead of hypothetical clusters would probably make more sense.
Another part I feel guilty about is the (lack of) community building. While I offer support on the public mailing list, I never really tried promoting gensim. The development has always been “When I need something, I add it. You need something, you add it.” Which is perfectly fine for open source, but other packages managed to catch and ride the machine learning popularity wave much better. For example, I tried entering gensim into NumFOCUS, a non-profit whose sole purpose is to “promote the use of accessible and reproducible computing in science and technology” in Python. Right up gensim’s street, you say? Turned down. I suck at/didn’t have time to enter this circle of “boys who talk and sponsor each other”, to the detriment of gensim.
The rest went surprisingly well 🙂 The scalable, online algorithms in gensim have been useful to lots of people, both in academia and commerce. They still hold their ground, and if anything are even more relevant now than they were five years ago.
I’m also happy with the initial API direction I took: gensim uses standard Python generators and iterators to represent corpora as document streams. This is a somewhat unique proposition, as it avoids loading the entire dataset into RAM. It also avoids constraining data with a fixed API (such as NumPy arrays) and means gensim can process data sets of unlimited size. As long as
for document in my_corpus: ...
yields your data points one after another, you’re good to go. It doesn’t matter how my_corpus is implemented internally, whether you read your data from RAM, disk, network or compute it on the fly. Some algorithms in gensim even support once-only (non-repeatable) data streams, so that the processing is fully online.
But the luckiest choice of all was sticking with Python. It’s a joy to work with, its ecosystem is fantastic and getting better all the time. I’m also glad I came across NumPy and SciPy when I did, because I would have written gensim in C++ otherwise (my primary language back then). The natural language processing ecosystem of C++ doesn’t compare to Python’s.
Gensim is still my go-to library for many of my own projects (aka dogfooding). Not only because I know it well, but mostly because it’s so simple to use. I’m pretty bad at remembering things (even things I created myself!), so I have to KISS a lot. With gensim I don’t have to remember much, except the iteration pattern above. It’s efficient, robust and field tested. Methods’ parameters default to sane values so that most of the time I don’t have to care nor RTFM.
Thanks to everyone who contributed to, used and abused gensim over the years 🙂 It’s been a pleasure.