MALLET, “MAchine Learning for LanguagE Toolkit” is a brilliant software tool. Unlike gensim, “topic modelling for humans”, which uses Python, MALLET is written in Java and spells “topic modeling” with a single “l”. Dandy.
MALLET’s implementation of Latent Dirichlet Allocation has lots of things going for it.
It’s based on sampling, which is a more accurate fitting method than variational Bayes. Variational methods, such as the online VB inference implemented in gensim, are easier to parallelize and guaranteed to converge… but they essentially solve an approximate, aka more inaccurate, problem.
MALLET is not “yet another midterm assignment implementation of Gibbs sampling”. It contains cleverly optimized code, is threaded to support multicore computers and, importantly, battle scarred by legions of humanity majors applying MALLET to literary studies.
Plus, written directly by David Mimno, a top expert in the field.
I’ve wanted to include a similarly efficient sampling implementation of LDA in gensim for a long time, but never found the time/motivation. Ben Trahan, the author of the recent LDA hyperparameter optimization patch for gensim, is on the job.
In the meanwhile, I’ve added a simple wrapper around MALLET so it can be used directly from Python, following gensim’s API:
model = gensim.models.LdaMallet(path_to_mallet, corpus, num_topics=10, id2word=dictionary) print model[corpus] # calculate & print topics of all documents in the corpus
And that’s it. The API is identical to the LdaModel class already in gensim, except you must specify path to the MALLET executable as its first parameter.
Check the LdaMallet API docs for setting other parameters such as threading (faster training, but consumes more memory), sampling iterations etc.
MALLET on Reuters
Let’s run a full end-to-end example.
NLTK includes several datasets we can use as our training corpus. In particular, the following assumes that the NLTK dataset “Reuters” can be found under /Users/kofola/nltk_data/corpora/reuters/training/:
# set up logging so we see what's going on import logging import os from gensim import corpora, models, utils logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO) def iter_documents(reuters_dir): """Iterate over Reuters documents, yielding one document at a time.""" for fname in os.listdir(reuters_dir): # read each document as one big string document = open(os.path.join(reuters_dir, fname)).read() # parse document into a list of utf8 tokens yield utils.simple_preprocess(document) class ReutersCorpus(object): def __init__(self, reuters_dir): self.reuters_dir = reuters_dir self.dictionary = corpora.Dictionary(iter_documents(reuters_dir)) self.dictionary.filter_extremes() # remove stopwords etc def __iter__(self): for tokens in iter_documents(self.reuters_dir): yield self.dictionary.doc2bow(tokens) # set up the streamed corpus corpus = ReutersCorpus('/Users/kofola/nltk_data/corpora/reuters/training/') # INFO : adding document #0 to Dictionary(0 unique tokens: ) # INFO : built Dictionary(24622 unique tokens: ['mdbl', 'fawc', 'degussa', 'woods', 'hanging']...) from 7769 documents (total 938238 corpus positions) # INFO : keeping 7203 tokens which were in no less than 5 and no more than 3884 (=50.0%) documents # INFO : resulting dictionary: Dictionary(7203 unique tokens: ['yellow', 'four', 'resisted', 'cyprus', 'increase']...) # train 10 LDA topics using MALLET mallet_path = '/Users/kofola/Downloads/mallet-2.0.7/bin/mallet' model = models.LdaMallet(mallet_path, corpus, num_topics=10, id2word=corpus.dictionary) # ... # 0 5 spokesman ec government tax told european today companies president plan added made commission time statement chairman state national union # 1 5 oil prices price production gas coffee crude market brazil international energy opec world petroleum bpd barrels producers day industry # 2 5 trade japan japanese foreign economic officials united countries states official dollar agreement major told world yen bill house international # 3 5 bank market rate stg rates exchange banks money interest dollar central week today fed term foreign dealers currency trading # 4 5 tonnes wheat sugar mln export department grain corn agriculture week program year usda china soviet exports south sources crop # 5 5 april march corp record cts dividend stock pay prior div board industries split qtly sets cash general share announced # 6 5 pct billion year february january rose rise december fell growth compared earlier increase quarter current months month figures deficit # 7 5 dlrs company mln year earnings sale quarter unit share gold sales expects reported results business canadian canada dlr operating # 8 5 shares company group offer corp share stock stake acquisition pct common buy merger investment tender management bid outstanding purchase # 9 5 mln cts net loss dlrs shr profit qtr year revs note oper sales avg shrs includes gain share tax # # <1000> LL/token: -7.5002 # # Total time: 34 seconds # now use the trained model to infer topics on a new document doc = "Don't sell coffee, wheat nor sugar; trade gold, oil and gas instead." bow = corpus.dictionary.doc2bow(utils.simple_preprocess(doc)) print model[bow] # print list of (topic id, topic weight) pairs # [[(0, 0.0903954802259887), # (1, 0.13559322033898305), # (2, 0.11299435028248588), # (3, 0.0847457627118644), # (4, 0.11864406779661017), # (5, 0.0847457627118644), # (6, 0.0847457627118644), # (7, 0.10357815442561205), # (8, 0.09981167608286252), # (9, 0.0847457627118644)]]
Apparently topics #1 (oil&co) and #4 (wheat&co) got the highest weights, so it passes the sniff test.
Note this MALLET wrapper is new in gensim version 0.9.0, and is extremely rudimentary for the time being. It serializes input (training corpus) into a file, calls the Java process to run Mallet, then parses out output from the files that Mallet produces. Not very efficient, not very robust.
Depending on how this wrapper is used/received, I may extend it in the future.
Or even better, try your hand at improving it yourself.