Topic Modelling: 1-Day Intensive


The “Topic Modelling” 1-Day Intensive teaches teams how to extract information from unstructured, plain text documents using Python’s powerful data ecosystem.

Teams are taught smart, efficient practices for building, improving and deploying scalable natural language processing systems (NLP) using Python, using existing software libraries to avoid wasting time or trying to reinvent the wheel.

All course materials can be customised to focus on your business’ real challenges and products in development.

A combination of teaching and hands-on programming exercises will give learners the opportunity to apply, test and refine their knowledge, improving retention and building confidence through real-time feedback.

Who Should Attend?

The course will appeal to programmers and scientists who seek to improve their proficiency in Python and streamline the NLP process.


By the end of this 1-day intensive, participants will have the necessary skills to: 

  • Write robust document processing pipelines using best industry practices
  • Understand capabilities and limitations of existing NLP tools and algorithms for semantic analysis
  • Apply algorithms for entity extraction, chunking, semantic indexing and document retrieval
  • Communicate modelling results to stakeholders and provide meaningful insights into the data
  • Optimize CPU and memory of existing or developing Topic Modelling systems
  • Integrate with non-Python data mining tools and services


Attendees are expected to be familiar with basic programming concepts and terminology (command line, shell, filesystem navigation, basic data structures and algorithms such as list or dictionary and basic Python syntax.

In addition…

  • Each participant must have their own laptop, with a system that supports Python (OS X, Linux, Windows…), to participate in the interactive exercises throughout training.
  • Every participant is expected to have downloaded and installed the necessary software libraries, as instructed by RaRe’s “Before You Arrive – Setup Sheet” in advance. Delays due to installation issues on-site may affect the day’s training schedule.


Please note: This syllabus can be customised to your specific needs, projects or areas of focus.
We are happy to tailor course content and exercises to meet your specific needs.

DAY 1 

  • Course Introduction
    • Administration, setup and course materials distribution
    • Course structure and agenda
    • Participants and trainer introductions
  • Session 1: Text Processing
    • NLP ecosystem: gensim, NLTK, spaCy
    • Streamed corpora: generators, lazy processing
    • Semantic text transformations: LSI, LDA, word2vec, doc2vec
    • Named entity extraction, entity linking, knowledge bases (KBs)
    • Model quality evaluation and tuning
    • Performance gotchas and tips
  • Interactive Programming Exercise
  • Session 2: Indexing and Retrieval
    • Indexing documents
    • Retrieving related documents with semantic queries
    • Scaling up, approximate document search
    • Searching Wikipedia
  • Interactive Programming Exercise
  • Session 3: Integration, APIs
    • Presenting text collections to stake holders: pyLDAviz, D3.js
    • Microservices, web APIs: Flask, CherryPy
    • Interactive charts & graphs: matplotlib, seaborn, bokeh
    • Non-Python ecosystems: Spark, AWS, EC2, S3, EMR, HDF5, Elasticsearch
  • Interactive Programming Exercise


Contact us today to discuss available dates and how we can customise this training for your team.