In [1]:
import pandas as pd

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

We start by loading the well known 20 newsgroups data set. The data has already been split to train and test subsets.

In [2]:
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

Convert both train and test data to vectors using the TfidfVectorizer.

In [3]:
vectorizer = TfidfVectorizer()
vectors_train = vectorizer.fit_transform(newsgroups_train.data)
vectors_test = vectorizer.transform(newsgroups_test.data)

Train a Naive Bayes classifier on the train set and get the predictions for the test set.

In [4]:
clf = MultinomialNB()
clf.fit(vectors_train, newsgroups_train.target)
predicted = clf.predict(vectors_test)
expected = newsgroups_test.target

The predicted and expected now contain numeric IDs of the categories. We'll use a simple helper method that converts the numeric IDs to labels.

In [5]:
def name_targets(target_names, targets):
    return [target_names[t] for t in targets]

Now that we have the classification results, we'll want to get them visualized. We start by loading the data into Pandas data frame. This will allow us to quickly inspect and sanity check the results. We'll also be able to easily get data into Facets Dive in the next step (though it can of course also be done without Pandas).

In [6]:
df = pd.DataFrame({
    'expected': name_targets(newsgroups_test.target_names, expected),
    'predicted': name_targets(newsgroups_test.target_names, predicted),
    # shorten the texts to 1000 chars to reduce the volume of data to be sent to Facets Dive
    'data': [text[:1000] + '...' for text in newsgroups_test.data],
    # add text lengths
    'length': [len(text) for text in newsgroups_test.data],
}, columns=['expected', 'predicted', 'length', 'data'])
In [7]:
# sample of records to be visualized
df.head()
Out[7]:
expected predicted length data
0 rec.autos rec.autos 695 From: [email protected] (NEIL B. ...
1 comp.windows.x sci.crypt 939 From: Rick Miller <[email protected]>\nSubject: ...
2 alt.atheism alt.atheism 453 From: mathew <[email protected]>\nSubject: R...
3 talk.politics.mideast talk.politics.mideast 5239 From: [email protected] (Dave Bakken)\nSub...
4 talk.religion.misc alt.atheism 1007 From: [email protected] (Jon Livesey...

The code snipped below is now all that's needed to load the classification results into Facets Dive.

A few comments about the snippet:

<style>.container { width:100% !important; }</style>

is optional, but it helps and I recommend it. It makes Facets Dive fill the full width of the browser window, which makes it easier to use.

The height of the widget can be customized by setting <facets-dive height="1234">.

The following code presets the view. This is also optional and included only for illustration:

fd['verticalFacet'] = 'predicted';
fd['verticalBuckets'] = 8;
fd['horizontalFacet'] = 'expected';
fd['horizontalBuckets'] = 8;
fd['colorBy'] = 'expected';
In [8]:
from IPython.core.display import display, HTML

HTML_TEMPLATE = """<link rel="import" href="/nbextensions/facets-dist/facets-jupyter.html">
        <facets-dive id="fd" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          var fd = document.querySelector("#fd");
          fd.data = data;
          fd['verticalFacet'] = 'predicted';
          fd['verticalBuckets'] = 8;
          fd['horizontalFacet'] = 'expected';
          fd['horizontalBuckets'] = 8;
          fd['colorBy'] = 'expected';
        </script>
        <style>.container {{ width:100% !important; }}</style>"""
html = HTML_TEMPLATE.format(jsonstr=df.to_json(orient='records'))
display(HTML(html))