Tutorial on Mallet in Python

Radim Rehurek gensim, programming 20 Comments

MALLET, “MAchine Learning for LanguagE Toolkit” is a brilliant software tool. Unlike gensim, “topic modelling for humans”, which uses Python, MALLET is written in Java and spells “topic modeling” with a single “l”. Dandy.

MALLET’s LDA

MALLET’s implementation of Latent Dirichlet Allocation has lots of things going for it.

It’s based on sampling, which is a more accurate fitting method than variational Bayes. Variational methods, such as the online VB inference implemented in gensim, are easier to parallelize and guaranteed to converge… but they essentially solve an approximate, aka more inaccurate, problem.

MALLET is not “yet another midterm assignment implementation of Gibbs sampling”. It contains cleverly optimized code, is threaded to support multicore computers and, importantly, battle scarred by legions of humanity majors applying MALLET to literary studies.

Plus, written directly by David Mimno, a top expert in the field.

Gensim wrapper

I’ve wanted to include a similarly efficient sampling implementation of LDA in gensim for a long time, but never found the time/motivation. Ben Trahan, the author of the recent LDA hyperparameter optimization patch for gensim, is on the job.

In the meanwhile, I’ve added a simple wrapper around MALLET so it can be used directly from Python, following gensim’s API:

model = gensim.models.LdaMallet(path_to_mallet, corpus, num_topics=10, id2word=dictionary)
print model[corpus]  # calculate & print topics of all documents in the corpus

And that’s it. The API is identical to the LdaModel class already in gensim, except you must specify path to the MALLET executable as its first parameter.

Check the LdaMallet API docs for setting other parameters such as threading (faster training, but consumes more memory), sampling iterations etc.

MALLET on Reuters

Let’s run a full end-to-end example.

NLTK includes several datasets we can use as our training corpus. In particular, the following assumes that the NLTK dataset “Reuters” can be found under /Users/kofola/nltk_data/corpora/reuters/training/:

# set up logging so we see what's going on
import logging
import os
from gensim import corpora, models, utils
logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO)

def iter_documents(reuters_dir):
    """Iterate over Reuters documents, yielding one document at a time."""
    for fname in os.listdir(reuters_dir):
        # read each document as one big string
        document = open(os.path.join(reuters_dir, fname)).read()
        # parse document into a list of utf8 tokens
        yield utils.simple_preprocess(document)

class ReutersCorpus(object):
    def __init__(self, reuters_dir):
        self.reuters_dir = reuters_dir
        self.dictionary = corpora.Dictionary(iter_documents(reuters_dir))
        self.dictionary.filter_extremes()  # remove stopwords etc

    def __iter__(self):
        for tokens in iter_documents(self.reuters_dir):
            yield self.dictionary.doc2bow(tokens)

# set up the streamed corpus
corpus = ReutersCorpus('/Users/kofola/nltk_data/corpora/reuters/training/')
# INFO : adding document #0 to Dictionary(0 unique tokens: [])
# INFO : built Dictionary(24622 unique tokens: ['mdbl', 'fawc', 'degussa', 'woods', 'hanging']...) from 7769 documents (total 938238 corpus positions)
# INFO : keeping 7203 tokens which were in no less than 5 and no more than 3884 (=50.0%) documents
# INFO : resulting dictionary: Dictionary(7203 unique tokens: ['yellow', 'four', 'resisted', 'cyprus', 'increase']...)

# train 10 LDA topics using MALLET
mallet_path = '/Users/kofola/Downloads/mallet-2.0.7/bin/mallet'
model = models.LdaMallet(mallet_path, corpus, num_topics=10, id2word=corpus.dictionary)
# ...
# 0	5	spokesman ec government tax told european today companies president plan added made commission time statement chairman state national union
# 1	5	oil prices price production gas coffee crude market brazil international energy opec world petroleum bpd barrels producers day industry
# 2	5	trade japan japanese foreign economic officials united countries states official dollar agreement major told world yen bill house international
# 3	5	bank market rate stg rates exchange banks money interest dollar central week today fed term foreign dealers currency trading
# 4	5	tonnes wheat sugar mln export department grain corn agriculture week program year usda china soviet exports south sources crop
# 5	5	april march corp record cts dividend stock pay prior div board industries split qtly sets cash general share announced
# 6	5	pct billion year february january rose rise december fell growth compared earlier increase quarter current months month figures deficit
# 7	5	dlrs company mln year earnings sale quarter unit share gold sales expects reported results business canadian canada dlr operating
# 8	5	shares company group offer corp share stock stake acquisition pct common buy merger investment tender management bid outstanding purchase
# 9	5	mln cts net loss dlrs shr profit qtr year revs note oper sales avg shrs includes gain share tax
# 
# <1000> LL/token: -7.5002
# 
# Total time: 34 seconds

# now use the trained model to infer topics on a new document
doc = "Don't sell coffee, wheat nor sugar; trade gold, oil and gas instead."
bow = corpus.dictionary.doc2bow(utils.simple_preprocess(doc))
print model[bow]  # print list of (topic id, topic weight) pairs
# [[(0, 0.0903954802259887),
#   (1, 0.13559322033898305),
#   (2, 0.11299435028248588),
#   (3, 0.0847457627118644),
#   (4, 0.11864406779661017),
#   (5, 0.0847457627118644),
#   (6, 0.0847457627118644),
#   (7, 0.10357815442561205),
#   (8, 0.09981167608286252),
#   (9, 0.0847457627118644)]]

Apparently topics #1 (oil&co) and #4 (wheat&co) got the highest weights, so it passes the sniff test.

Note this MALLET wrapper is new in gensim version 0.9.0, and is extremely rudimentary for the time being. It serializes input (training corpus) into a file, calls the Java process to run Mallet, then parses out output from the files that Mallet produces. Not very efficient, not very robust.

Depending on how this wrapper is used/received, I may extend it in the future.

Or even better, try your hand at improving it yourself.

Note from Radim: Get my latest machine learning tips & articles delivered straight to your inbox (it's free).

 Unsubscribe anytime, no spamming. Max 2 posts per month, if lucky.

Comments 20

  1. Joris

    Another nice update! Keem ’em coming! Suggestion: Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic Compositionality Through Recursive Matrix-Vector Spaces. In Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2012. 😉

    1. Radim Post
      Author
    1. Radim Post
      Author
      1. Artyom

        Ya, decided to clean it up a bit first and put my local version into a forked gensim. Will be ready in next couple of days

        I am also thinking about chancing a direct port of Blei’s DTM implementation, but not sure about it yet.

  2. Alex Simes

    Great! Thanks for putting this together 🙂

    Is there a way to save the model to allow documents to be tested on it without retraining the whole thing?

    Cheers!

    1. Radim Post
      Author
      Radim

      You’re welcome 🙂

      The best way to “save the model” is to specify the `prefix` parameter to LdaMallet constructor:
      http://radimrehurek.com/gensim/models/wrappers/ldamallet.html#gensim.models.wrappers.ldamallet.LdaMallet

      By default, the data files for Mallet are stored in temp under a randomized name, so you’ll lose them after a restart. But when you say `prefix=”/my/directory/mallet/”`, all Mallet files are stored there instead. Then you can continue using the model even after reload.

      The Python model itself is saved/loaded using the standard `load()`/`save()` methods, like all models in gensim.

      Hope that helps,
      Radim

        1. Radim Post
          Author
  3. Kevin

    Do you know why I am getting the output this way?

    [[(0, 0.10000000000000002),
    (1, 0.10000000000000002),
    (2, 0.10000000000000002),
    (3, 0.10000000000000002),
    (4, 0.10000000000000002),
    (5, 0.10000000000000002),
    (6, 0.10000000000000002),
    (7, 0.10000000000000002),
    (8, 0.10000000000000002),
    (9, 0.10000000000000002)],
    [(0, 0.10000000000000002),
    (1, 0.10000000000000002),
    (2, 0.10000000000000002),
    (3, 0.10000000000000002),
    (4, 0.10000000000000002),
    (5, 0.10000000000000002),
    (6, 0.10000000000000002),
    (7, 0.10000000000000002),
    (8, 0.10000000000000002),
    (9, 0.10000000000000002)],

    1. Radim Rehurek Post
      Author
      Radim Rehurek

      Are you using the same input as in tutorial?

      Maybe you passed in two queries, so you got two outputs?

      Send more info (versions of gensim, mallet, input, gist your logs, etc).

      1. Kevin

        texts = [“Human machine interface enterprise resource planning quality processing management. , “,
        ” management processing quality enterprise resource planning systems is user interface management.”,
        “human engineering testing of enterprise resource planning interface processing quality management”,
        “nasty food dry desert poor staff good service cheap price bad location restaurant recommended”,
        “amazing service good food excellent desert kind staff bad service high price good location highly recommended”,
        “restaurant poor service bad food desert not recommended kind staff bad service high price good location”
        ]

        #adapted from Gensim tutorial

        id2word = corpora.Dictionary(texts)
        corpus = [id2word.doc2bow(text) for text in texts]

        path_to_mallet = “/Mallet/bin/mallet”

        model = gensim.models.wrappers.LdaMallet(path_to_mallet, corpus, num_topics=2, id2word=id2word)
        print model[corpus]

        #output
        [[(0, 0.5), (1, 0.5)], [(0, 0.5), (1, 0.5)], [(0, 0.5), (1, 0.5)], [(0, 0.5), (1, 0.5)], [(0, 0.5), (1, 0.5)], [(0, 0.5), (1, 0.5)]]

        I don’t think this output is accurate. Can you identify the issue here? Thanks.

        1. Kevin

          Before creating the dictionary, I did tokenization (of course).
          # tokenize
          texts = [[word for word in document.lower().split() ] for document in texts]

  4. Stefan

    Is this supposed to work with Python 3? After making your sample compatible with Python2/3, it will run under Python 2, but it will throw an exception under Python 3.

    Traceback (most recent call last):
    File “demo.py”, line 56, in
    print(model[bow]) # print list of (topic id, topic weight) pairs
    File “/…/python3.4/site-packages/gensim/models/wrappers/ldamallet.py”, line 173, in __getitem__
    result = list(self.read_doctopics(self.fdoctopics() + ‘.infer’))
    File “/…/python3.4/site-packages/gensim/models/wrappers/ldamallet.py”, line 254, in read_doctopics
    if lineno == 0 and line.startswith(“#doc “):
    TypeError: startswith first arg must be bytes or a tuple of bytes, not str

    1. Radim Řehůřek Post
      Author

Leave a Reply

Your email address will not be published. Required fields are marked *