Tutorial on Mallet in Python

Radim Řehůřek gensim, programming 28 Comments

MALLET, “MAchine Learning for LanguagE Toolkit” is a brilliant software tool. Unlike gensim, “topic modelling for humans”, which uses Python, MALLET is written in Java and spells “topic modeling” with a single “l”. Dandy.

MALLET’s LDA

MALLET’s implementation of Latent Dirichlet Allocation has lots of things going for it.

It’s based on sampling, which is a more accurate fitting method than variational Bayes. Variational methods, such as the online VB inference implemented in gensim, are easier to parallelize and guaranteed to converge… but they essentially solve an approximate, aka more inaccurate, problem.

MALLET is not “yet another midterm assignment implementation of Gibbs sampling”. It contains cleverly optimized code, is threaded to support multicore computers and, importantly, battle scarred by legions of humanity majors applying MALLET to literary studies.

Plus, written directly by David Mimno, a top expert in the field.

Gensim wrapper

I’ve wanted to include a similarly efficient sampling implementation of LDA in gensim for a long time, but never found the time/motivation. Ben Trahan, the author of the recent LDA hyperparameter optimization patch for gensim, is on the job.

In the meanwhile, I’ve added a simple wrapper around MALLET so it can be used directly from Python, following gensim’s API:

model = gensim.models.LdaMallet(path_to_mallet, corpus, num_topics=10, id2word=dictionary)
print model[corpus]  # calculate & print topics of all documents in the corpus

And that’s it. The API is identical to the LdaModel class already in gensim, except you must specify path to the MALLET executable as its first parameter.

Check the LdaMallet API docs for setting other parameters such as threading (faster training, but consumes more memory), sampling iterations etc.

MALLET on Reuters

Let’s run a full end-to-end example.

NLTK includes several datasets we can use as our training corpus. In particular, the following assumes that the NLTK dataset “Reuters” can be found under /Users/kofola/nltk_data/corpora/reuters/training/:

# set up logging so we see what's going on
import logging
import os
from gensim import corpora, models, utils
logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO)

def iter_documents(reuters_dir):
    """Iterate over Reuters documents, yielding one document at a time."""
    for fname in os.listdir(reuters_dir):
        # read each document as one big string
        document = open(os.path.join(reuters_dir, fname)).read()
        # parse document into a list of utf8 tokens
        yield utils.simple_preprocess(document)

class ReutersCorpus(object):
    def __init__(self, reuters_dir):
        self.reuters_dir = reuters_dir
        self.dictionary = corpora.Dictionary(iter_documents(reuters_dir))
        self.dictionary.filter_extremes()  # remove stopwords etc

    def __iter__(self):
        for tokens in iter_documents(self.reuters_dir):
            yield self.dictionary.doc2bow(tokens)

# set up the streamed corpus
corpus = ReutersCorpus('/Users/kofola/nltk_data/corpora/reuters/training/')
# INFO : adding document #0 to Dictionary(0 unique tokens: [])
# INFO : built Dictionary(24622 unique tokens: ['mdbl', 'fawc', 'degussa', 'woods', 'hanging']...) from 7769 documents (total 938238 corpus positions)
# INFO : keeping 7203 tokens which were in no less than 5 and no more than 3884 (=50.0%) documents
# INFO : resulting dictionary: Dictionary(7203 unique tokens: ['yellow', 'four', 'resisted', 'cyprus', 'increase']...)

# train 10 LDA topics using MALLET
mallet_path = '/Users/kofola/Downloads/mallet-2.0.7/bin/mallet'
model = models.LdaMallet(mallet_path, corpus, num_topics=10, id2word=corpus.dictionary)
# ...
# 0	5	spokesman ec government tax told european today companies president plan added made commission time statement chairman state national union
# 1	5	oil prices price production gas coffee crude market brazil international energy opec world petroleum bpd barrels producers day industry
# 2	5	trade japan japanese foreign economic officials united countries states official dollar agreement major told world yen bill house international
# 3	5	bank market rate stg rates exchange banks money interest dollar central week today fed term foreign dealers currency trading
# 4	5	tonnes wheat sugar mln export department grain corn agriculture week program year usda china soviet exports south sources crop
# 5	5	april march corp record cts dividend stock pay prior div board industries split qtly sets cash general share announced
# 6	5	pct billion year february january rose rise december fell growth compared earlier increase quarter current months month figures deficit
# 7	5	dlrs company mln year earnings sale quarter unit share gold sales expects reported results business canadian canada dlr operating
# 8	5	shares company group offer corp share stock stake acquisition pct common buy merger investment tender management bid outstanding purchase
# 9	5	mln cts net loss dlrs shr profit qtr year revs note oper sales avg shrs includes gain share tax
# 
# <1000> LL/token: -7.5002
# 
# Total time: 34 seconds

# now use the trained model to infer topics on a new document
doc = "Don't sell coffee, wheat nor sugar; trade gold, oil and gas instead."
bow = corpus.dictionary.doc2bow(utils.simple_preprocess(doc))
print model[bow]  # print list of (topic id, topic weight) pairs
# [[(0, 0.0903954802259887),
#   (1, 0.13559322033898305),
#   (2, 0.11299435028248588),
#   (3, 0.0847457627118644),
#   (4, 0.11864406779661017),
#   (5, 0.0847457627118644),
#   (6, 0.0847457627118644),
#   (7, 0.10357815442561205),
#   (8, 0.09981167608286252),
#   (9, 0.0847457627118644)]]

Apparently topics #1 (oil&co) and #4 (wheat&co) got the highest weights, so it passes the sniff test.

Note this MALLET wrapper is new in gensim version 0.9.0, and is extremely rudimentary for the time being. It serializes input (training corpus) into a file, calls the Java process to run Mallet, then parses out output from the files that Mallet produces. Not very efficient, not very robust.

Depending on how this wrapper is used/received, I may extend it in the future.

Or even better, try your hand at improving it yourself.

Note from Radim: Get my latest machine learning tips & articles delivered straight to your inbox (it's free).

 Unsubscribe anytime, no spamming. Max 2 posts per month, if lucky.

Comments 28

  1. Joris

    Another nice update! Keem ’em coming! Suggestion: Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic Compositionality Through Recursive Matrix-Vector Spaces. In Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2012. 😉

    1. Radim Post
      Author
    1. Radim Post
      Author
      1. Artyom

        Ya, decided to clean it up a bit first and put my local version into a forked gensim. Will be ready in next couple of days

        I am also thinking about chancing a direct port of Blei’s DTM implementation, but not sure about it yet.

  2. Alex Simes

    Great! Thanks for putting this together 🙂

    Is there a way to save the model to allow documents to be tested on it without retraining the whole thing?

    Cheers!

    1. Radim Post
      Author
      Radim

      You’re welcome 🙂

      The best way to “save the model” is to specify the `prefix` parameter to LdaMallet constructor:
      http://radimrehurek.com/gensim/models/wrappers/ldamallet.html#gensim.models.wrappers.ldamallet.LdaMallet

      By default, the data files for Mallet are stored in temp under a randomized name, so you’ll lose them after a restart. But when you say `prefix=”/my/directory/mallet/”`, all Mallet files are stored there instead. Then you can continue using the model even after reload.

      The Python model itself is saved/loaded using the standard `load()`/`save()` methods, like all models in gensim.

      Hope that helps,
      Radim

        1. Radim Post
          Author
  3. Kevin

    Do you know why I am getting the output this way?

    [[(0, 0.10000000000000002),
    (1, 0.10000000000000002),
    (2, 0.10000000000000002),
    (3, 0.10000000000000002),
    (4, 0.10000000000000002),
    (5, 0.10000000000000002),
    (6, 0.10000000000000002),
    (7, 0.10000000000000002),
    (8, 0.10000000000000002),
    (9, 0.10000000000000002)],
    [(0, 0.10000000000000002),
    (1, 0.10000000000000002),
    (2, 0.10000000000000002),
    (3, 0.10000000000000002),
    (4, 0.10000000000000002),
    (5, 0.10000000000000002),
    (6, 0.10000000000000002),
    (7, 0.10000000000000002),
    (8, 0.10000000000000002),
    (9, 0.10000000000000002)],

    1. Radim Rehurek Post
      Author
      Radim Rehurek

      Are you using the same input as in tutorial?

      Maybe you passed in two queries, so you got two outputs?

      Send more info (versions of gensim, mallet, input, gist your logs, etc).

      1. Kevin

        texts = [“Human machine interface enterprise resource planning quality processing management. , “,
        ” management processing quality enterprise resource planning systems is user interface management.”,
        “human engineering testing of enterprise resource planning interface processing quality management”,
        “nasty food dry desert poor staff good service cheap price bad location restaurant recommended”,
        “amazing service good food excellent desert kind staff bad service high price good location highly recommended”,
        “restaurant poor service bad food desert not recommended kind staff bad service high price good location”
        ]

        #adapted from Gensim tutorial

        id2word = corpora.Dictionary(texts)
        corpus = [id2word.doc2bow(text) for text in texts]

        path_to_mallet = “/Mallet/bin/mallet”

        model = gensim.models.wrappers.LdaMallet(path_to_mallet, corpus, num_topics=2, id2word=id2word)
        print model[corpus]

        #output
        [[(0, 0.5), (1, 0.5)], [(0, 0.5), (1, 0.5)], [(0, 0.5), (1, 0.5)], [(0, 0.5), (1, 0.5)], [(0, 0.5), (1, 0.5)], [(0, 0.5), (1, 0.5)]]

        I don’t think this output is accurate. Can you identify the issue here? Thanks.

        1. Kevin

          Before creating the dictionary, I did tokenization (of course).
          # tokenize
          texts = [[word for word in document.lower().split() ] for document in texts]

  4. Stefan

    Is this supposed to work with Python 3? After making your sample compatible with Python2/3, it will run under Python 2, but it will throw an exception under Python 3.

    Traceback (most recent call last):
    File “demo.py”, line 56, in
    print(model[bow]) # print list of (topic id, topic weight) pairs
    File “/…/python3.4/site-packages/gensim/models/wrappers/ldamallet.py”, line 173, in __getitem__
    result = list(self.read_doctopics(self.fdoctopics() + ‘.infer’))
    File “/…/python3.4/site-packages/gensim/models/wrappers/ldamallet.py”, line 254, in read_doctopics
    if lineno == 0 and line.startswith(“#doc “):
    TypeError: startswith first arg must be bytes or a tuple of bytes, not str

    1. Radim Řehůřek Post
      Author
  5. Sandy

    Hi Radim, thanks for the article .

    I am new to topic modelling and mallet.

    May i ask Gensim wrapper and MALLET on Reuters together? Or they are two different things in this tutorial?

    1. Radim Řehůřek Post
      Author
    2. Sandy

      Sorry , i meant do i need to run it at 2 different files. or should i put the two things together and run as a whole?

      1. Sandy

        I run this python file, which i took from your post.

        # set up logging so we see what’s going on
        import logging
        import os
        from gensim import corpora, models, utils
        logging.basicConfig(format=”%(asctime)s : %(levelname)s : %(message)s”, level=logging.INFO)

        def iter_documents(reuters_dir):
        “””Iterate over Reuters documents, yielding one document at a time.”””
        for fname in os.listdir(reuters_dir):
        # read each document as one big string
        document = open(os.path.join(reuters_dir, fname)).read()
        # parse document into a list of utf8 tokens
        yield utils.simple_preprocess(document)

        class ReutersCorpus(object):
        def __init__(self, reuters_dir):
        self.reuters_dir = reuters_dir
        self.dictionary = corpora.Dictionary(iter_documents(reuters_dir))
        self.dictionary.filter_extremes() # remove stopwords etc

        def __iter__(self):
        for tokens in iter_documents(self.reuters_dir):
        yield self.dictionary.doc2bow(tokens)

        # set up the streamed corpus
        corpus = ReutersCorpus(‘/Users/kofola/nltk_data/corpora/reuters/training/’)
        # INFO : adding document #0 to Dictionary(0 unique tokens: [])
        # INFO : built Dictionary(24622 unique tokens: [‘mdbl’, ‘fawc’, ‘degussa’, ‘woods’, ‘hanging’]…) from 7769 documents (total 938238 corpus positions)
        # INFO : keeping 7203 tokens which were in no less than 5 and no more than 3884 (=50.0%) documents
        # INFO : resulting dictionary: Dictionary(7203 unique tokens: [‘yellow’, ‘four’, ‘resisted’, ‘cyprus’, ‘increase’]…)

        # train 10 LDA topics using MALLET
        mallet_path = ‘/Users/kofola/Downloads/mallet-2.0.7/bin/mallet’
        model = models.LdaMallet(mallet_path, corpus, num_topics=10, id2word=corpus.dictionary)
        # …
        # 0 5 spokesman ec government tax told european today companies president plan added made commission time statement chairman state national union
        # 1 5 oil prices price production gas coffee crude market brazil international energy opec world petroleum bpd barrels producers day industry
        # 2 5 trade japan japanese foreign economic officials united countries states official dollar agreement major told world yen bill house international
        # 3 5 bank market rate stg rates exchange banks money interest dollar central week today fed term foreign dealers currency trading
        # 4 5 tonnes wheat sugar mln export department grain corn agriculture week program year usda china soviet exports south sources crop
        # 5 5 april march corp record cts dividend stock pay prior div board industries split qtly sets cash general share announced
        # 6 5 pct billion year february january rose rise december fell growth compared earlier increase quarter current months month figures deficit
        # 7 5 dlrs company mln year earnings sale quarter unit share gold sales expects reported results business canadian canada dlr operating
        # 8 5 shares company group offer corp share stock stake acquisition pct common buy merger investment tender management bid outstanding purchase
        # 9 5 mln cts net loss dlrs shr profit qtr year revs note oper sales avg shrs includes gain share tax
        #
        # LL/token: -7.5002
        #
        # Total time: 34 seconds

        # now use the trained model to infer topics on a new document
        doc = “Don’t sell coffee, wheat nor sugar; trade gold, oil and gas instead.”
        bow = corpus.dictionary.doc2bow(utils.simple_preprocess(doc))
        print model[bow] # print list of (topic id, topic weight) pairs
        # [[(0, 0.0903954802259887),
        # (1, 0.13559322033898305),
        # (2, 0.11299435028248588),
        # (3, 0.0847457627118644),
        # (4, 0.11864406779661017),
        # (5, 0.0847457627118644),
        # (6, 0.0847457627118644),
        # (7, 0.10357815442561205),
        # (8, 0.09981167608286252),
        # (9, 0.0847457627118644)]]

      2. Sandy

        And i got this as error. So i not sure, do i include the gensim wrapper in the same python file or what should i do next ?

        C:\Python27\lib\site-packages\gensim\utils.py:1167: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
        warnings.warn(“detected Windows; aliasing chunkize to chunkize_serial”)
        2018-02-28 23:08:15,959 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
        2018-02-28 23:08:15,984 : INFO : built Dictionary(1131 unique tokens: [u’stock’, u’all’, u’concept’, u’managed’, u’forget’]…) from 20 documents (total 4006 corpus positions)
        2018-02-28 23:08:15,986 : INFO : discarding 1050 tokens: [(u’ad’, 2), (u’add’, 3), (u’agains’, 1), (u’always’, 4), (u’and’, 14), (u’annual’, 1), (u’ask’, 3), (u’bad’, 2), (u’bar’, 1), (u’before’, 3)]…
        2018-02-28 23:08:15,987 : INFO : keeping 81 tokens which were in no less than 5 and no more than 10 (=50.0%) documents
        2018-02-28 23:08:15,989 : INFO : resulting dictionary: Dictionary(81 unique tokens: [u’all’, u’since’, u’help’, u’just’, u’then’]…)
        Traceback (most recent call last):
        File “Topic.py”, line 37, in
        model = models.LdaMallet(mallet_path, corpus, num_topics=10, id2word=corpus.dictionary)
        AttributeError: ‘module’ object has no attribute ‘LdaMallet’

        1. Joshua

          Sandy,
          First to answer your question:
          I had the same error (AttributeError: ‘module’ object has no attribute ‘LdaMallet’). I looked in gensim/models and found that ldamallet.py is in the wrappers directory (https://github.com/RaRe-Technologies/gensim/tree/develop/gensim/models/wrappers). So, instead use the following:
          from gensim.models import wrappers
          (I used gensim.models.wrappers import LdaMallet)

          Next, I noticed that your number of kept tokens is very small (81), since you’re using a small corpus. This may be appropriate since those would be the most confident distinctive words, but I’d use a lower no_below (to keep infrequent tokens) and possibly a higher no_above ratio. .filter_extremes(no_below=1, no_above=.7)

          Finally, use self.model.save(model_filename) to save the model (you can then use load()) and self.model.show_topics(num_topics=-1) to get a list of all topics so that you can see what each number corresponds to, and what words represent the topics.

  6. Raniem

    Hello.

    I would like to thank you for your great efforts.

    I have a question if you don’t mind? Is it normal that I get completely different topics models when using Mallet LDA and gensim LDA?!

    I expect differences but they seem to be very different when I tried them on my corpus. I have also compared with the Reuters corpus and below are my models definitions and the top 10 topics for each model.

    model = models.wrappers.LdaMallet(mallet_path, corpus, num_topics=10, id2word=corpus.dictionary)
    gensim_model= gensim.models.ldamodel.LdaModel(corpus,num_topics=10,id2word=corpus.dictionary)

    there are some different parameters like alpha I guess, but I am not sure if there is any other parameter that I have missed and made the results so different?!

    ======================Mallet Topics====================

    0’0.176*”dlr” + 0.041*”sale” + 0.041*”mln” + 0.032*”april” + 0.030*”march” + 0.027*”record” + 0.027*”quarter” + 0.026*”year” + 0.024*”earn” + 0.023*”dividend”‘)
    1’0.016*”spokesman” + 0.014*”sai” + 0.013*”franc” + 0.012*”report” + 0.012*”state” + 0.012*”govern” + 0.011*”plan” + 0.011*”union” + 0.010*”offici” + 0.010*”todai”‘)
    2’0.125*”pct” + 0.078*”billion” + 0.062*”year” + 0.030*”februari” + 0.030*”januari” + 0.024*”rise” + 0.021*”rose” + 0.019*”month” + 0.016*”increas” + 0.015*”compar”‘)
    3’0.045*”trade” + 0.020*”japan” + 0.017*”offici” + 0.014*”countri” + 0.013*”meet” + 0.011*”japanes” + 0.011*”agreement” + 0.011*”import” + 0.011*”industri” + 0.010*”world”‘)
    4’0.047*”compani” + 0.036*”corp” + 0.029*”unit” + 0.018*”sell” + 0.016*”approv” + 0.016*”acquisit” + 0.015*”complet” + 0.015*”busi” + 0.014*”merger” + 0.013*”agreement”‘)
    5’0.076*”share” + 0.040*”stock” + 0.037*”offer” + 0.028*”group” + 0.027*”compani” + 0.016*”board” + 0.016*”sharehold” + 0.016*”common” + 0.016*”invest” + 0.015*”pct”‘)
    6’0.056*”oil” + 0.043*”price” + 0.028*”product” + 0.014*”ga” + 0.013*”barrel” + 0.012*”crude” + 0.012*”gold” + 0.011*”year” + 0.011*”cost” + 0.010*”increas”‘)
    7’0.041*”tonn” + 0.032*”export” + 0.023*”price” + 0.017*”produc” + 0.016*”wheat” + 0.013*”agricultur” + 0.013*”sugar” + 0.012*”grain” + 0.011*”week” + 0.011*”coffe”‘)
    8’0.221*”mln” + 0.117*”ct” + 0.092*”net” + 0.087*”loss” + 0.067*”shr” + 0.056*”profit” + 0.044*”oper” + 0.038*”dlr” + 0.033*”qtr” + 0.033*”rev”‘)
    9’0.067*”bank” + 0.039*”rate” + 0.030*”market” + 0.023*”dollar” + 0.017*”stg” + 0.016*”exchang” + 0.014*”currenc” + 0.013*”monei” + 0.011*”yen” + 0.011*”reserv”‘)]

    010*”grain” + 0.010*”tonn” + 0.010*”corn” + 0.009*”year” + 0.009*”ton” + 0.008*”strike” + 0.008*”union” + 0.008*”report” + 0.008*”compani” + 0.008*”wheat”

    =======================Gensim Topics====================
    0’0.028*”oil” + 0.015*”price” + 0.011*”meet” + 0.010*”dlr” + 0.008*”mln” + 0.008*”opec” + 0.008*”stock” + 0.007*”tax” + 0.007*”bpd” + 0.007*”product”‘)
    1’0.062*”ct” + 0.031*”april” + 0.031*”record” + 0.023*”div” + 0.022*”pai” + 0.021*”qtly” + 0.021*”dividend” + 0.019*”prior” + 0.015*”march” + 0.014*”set”‘)
    2’0.066*”mln” + 0.061*”dlr” + 0.060*”loss” + 0.051*”ct” + 0.049*”net” + 0.038*”shr” + 0.030*”year” + 0.028*”profit” + 0.026*”pct” + 0.020*”rev”‘)
    3’0.032*”mln” + 0.031*”dlr” + 0.022*”compani” + 0.012*”bank” + 0.012*”stg” + 0.011*”year” + 0.010*”sale” + 0.010*”unit” + 0.009*”corp” + 0.008*”market”‘)
    4’0.049*”bank” + 0.025*”rate” + 0.022*”pct” + 0.011*”billion” + 0.010*”reserv” + 0.009*”market” + 0.008*”central” + 0.008*”gold” + 0.008*”monei” + 0.007*”februari”‘)
    5’0.023*”share” + 0.022*”dlr” + 0.015*”compani” + 0.015*”stock” + 0.011*”offer” + 0.011*”trade” + 0.009*”billion” + 0.008*”pct” + 0.006*”agreement” + 0.006*”debt”‘)
    6’0.016*”trade” + 0.015*”pct” + 0.011*”year” + 0.009*”price” + 0.009*”export” + 0.008*”market” + 0.007*”japan” + 0.007*”industri” + 0.007*”govern” + 0.006*”import”‘)
    7’0.109*”mln” + 0.048*”billion” + 0.028*”net” + 0.025*”year” + 0.025*”dlr” + 0.020*”ct” + 0.017*”shr” + 0.013*”profit” + 0.011*”sale” + 0.009*”pct”‘)
    8’0.030*”mln” + 0.029*”pct” + 0.024*”share” + 0.024*”tonn” + 0.011*”dlr” + 0.010*”year” + 0.010*”stock” + 0.010*”offer” + 0.009*”tender” + 0.009*”corp”‘)
    9’0.010*”grain” + 0.010*”tonn” + 0.010*”corn” + 0.009*”year” + 0.009*”ton” + 0.008*”strike” + 0.008*”union” + 0.008*”report” + 0.008*”compani” + 0.008*”wheat”‘)]

  7. Shiks

    hey, i am getting an error:

    “Error: Could not find or load main class cc.mallet.classify.tui.Csv2Vectors.java”

    how to correct this error? please help me out with it. thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *