import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
We start by loading the well known 20 newsgroups data set. The data has already been split to train and test subsets.
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
Convert both train and test data to vectors using the TfidfVectorizer
.
vectorizer = TfidfVectorizer()
vectors_train = vectorizer.fit_transform(newsgroups_train.data)
vectors_test = vectorizer.transform(newsgroups_test.data)
Train a Naive Bayes classifier on the train set and get the predictions for the test set.
clf = MultinomialNB()
clf.fit(vectors_train, newsgroups_train.target)
predicted = clf.predict(vectors_test)
expected = newsgroups_test.target
The predicted
and expected
now contain numeric IDs of the categories. We'll use a simple helper method that converts the numeric IDs to labels.
def name_targets(target_names, targets):
return [target_names[t] for t in targets]
Now that we have the classification results, we'll want to get them visualized. We start by loading the data into Pandas data frame. This will allow us to quickly inspect and sanity check the results. We'll also be able to easily get data into Facets Dive in the next step (though it can of course also be done without Pandas).
df = pd.DataFrame({
'expected': name_targets(newsgroups_test.target_names, expected),
'predicted': name_targets(newsgroups_test.target_names, predicted),
# shorten the texts to 1000 chars to reduce the volume of data to be sent to Facets Dive
'data': [text[:1000] + '...' for text in newsgroups_test.data],
# add text lengths
'length': [len(text) for text in newsgroups_test.data],
}, columns=['expected', 'predicted', 'length', 'data'])
# sample of records to be visualized
df.head()
The code snipped below is now all that's needed to load the classification results into Facets Dive.
A few comments about the snippet:
<style>.container { width:100% !important; }</style>
is optional, but it helps and I recommend it. It makes Facets Dive fill the full width of the browser window, which makes it easier to use.
The height of the widget can be customized by setting <facets-dive height="1234">
.
The following code presets the view. This is also optional and included only for illustration:
fd['verticalFacet'] = 'predicted';
fd['verticalBuckets'] = 8;
fd['horizontalFacet'] = 'expected';
fd['horizontalBuckets'] = 8;
fd['colorBy'] = 'expected';
from IPython.core.display import display, HTML
HTML_TEMPLATE = """<link rel="import" href="/nbextensions/facets-dist/facets-jupyter.html">
<facets-dive id="fd" height="600"></facets-dive>
<script>
var data = {jsonstr};
var fd = document.querySelector("#fd");
fd.data = data;
fd['verticalFacet'] = 'predicted';
fd['verticalBuckets'] = 8;
fd['horizontalFacet'] = 'expected';
fd['horizontalBuckets'] = 8;
fd['colorBy'] = 'expected';
</script>
<style>.container {{ width:100% !important; }}</style>"""
html = HTML_TEMPLATE.format(jsonstr=df.to_json(orient='records'))
display(HTML(html))