Skip to main content

Introduction to Text Analysis with sklearn

Introduction to pandas and sklearn

Recommendation System

We live in a world surrounded by recommendation systems - our shopping habbits, our reading habits, political opinions are heavily influenced by recommendation algorithms. So lets take a closer look at how to build a basic recommendation system.

Simply put a recommendation system learns from your previous behavior and tries to recommend items that are similar to your previous choices. While there are a multitude of approaches for building recommendation systems, we will take a simple approach that is easy to understand and has a reasonable performance.

For this exercise we will build a recommendation system that predicts which talks you'll enjoy at a conference - specifically our favorite conference Pycon!

Before you proceed

This project is still in alpha stage. Bugs, typos, spelling, grammar, terminologies - there's every scope of finding bugs. If you have found one - open an issue on github. Pull Requests with corrections, fixes and enhancements will be received with open arms! Don't forget to add yourself to the list of contributors to this project.

Recommendation for Pycon talks

Take a look at 2018 schedule. With 32 tuotorials, 12 sponsor workshops, 16 talks at the education summit, and 95 talks at the main conference - Pycon has a lot to offer. Reading through all the talk descriptions and filtering out the ones that you should go to is a tedious process. Lets build a recommendation system that recommends talks from Pycon 2018, based on the ones that a person went to in 2017. This way the attendee does not waste any time deciding which talk to go to and spend more time making friends on the hallway track!

We will be using pandas and scikit-learn to build the recommnedation system using the text description of talks.



In our example the talk descriptions make up the documents


We have two classes to classify our documents

  • The talks that the attendee would like to see "in person". Denoted by 1
  • The talks that the attendee would watch "later online". Denoted by 0

A talk description is labeled 0 would mean the user has chosen to watch it later and a label 1 would mean the user has chose to watch it in person.

Supervised Learning

In Supervised learning we inspect each observation in a given dataset and manually label them. These manually labeled data is used to construct a model that can predict the labels on new data. We will use a Supervised Learning technique called Support Vector Machines.

In unsupervised learning we do not need any manual labeling. The recommendation system finds the pattern in the data to build a model that can be used for recommendation.


The dataset contains the talk description and speaker details from Pycon 2017 and 2018. All the 2017 talk data has been labeled by a user who has been to Pycon 2017.

Required packages installation

The following packages are needed for this project. Execute the cell below to install them.

In [ ]:
!pip install -r requirements.txt

Exercise A: Load the data

The data directory contains the snapshot of one such user's labeling - lets load that up and start with our analysis.

In [3]:
import pandas as pd
import numpy as np
id title description presenters date_created date_modified location talk_dt year label
0 1 5 ways to deploy your Python web app in 2017 You’ve built a fine Python web application and... Andrew T. Baker 2018-04-19 00:59:20.151875 2018-04-19 00:59:20.151875 Portland Ballroom 252–253 2017-05-08 15:15:00.000000 2017 0.0
1 2 A gentle introduction to deep learning with Te... Deep learning's explosion of spectacular resul... Michelle Fullwood 2018-04-19 00:59:20.158338 2018-04-19 00:59:20.158338 Oregon Ballroom 203–204 2017-05-08 16:15:00.000000 2017 0.0
2 3 aiosmtpd - A better asyncio based SMTP server has been in the standard library for ... Barry Warsaw 2018-04-19 00:59:20.161866 2018-04-19 00:59:20.161866 Oregon Ballroom 203–204 2017-05-08 14:30:00.000000 2017 1.0
3 4 Algorithmic Music Generation Music is mainly an artistic act of inspired cr... Padmaja V Bhagwat 2018-04-19 00:59:20.165526 2018-04-19 00:59:20.165526 Portland Ballroom 251 & 258 2017-05-08 17:10:00.000000 2017 0.0
4 5 An Introduction to Reinforcement Learning Reinforcement learning (RL) is a subfield of m... Jessica Forde 2018-04-19 00:59:20.169075 2018-04-19 00:59:20.169075 Portland Ballroom 252–253 2017-05-08 13:40:00.000000 2017 0.0

Here is a brief description of the interesting fields.

variable description
title Title of the talk
description Description of the talk
year Is it a 2017 talk or 2018
label 1 indicates the user preferred seeing the talk in person,
0 indicates they would schedule it for later.

Note all 2018 talks are set to 1. However they are only placeholders, and are not used in training the model. We will use 2017 data for training, and predict the labels on the 2018 talks.

Lets start by selecting the 2017 talk descriptions that were labeled by the user for watching in person.

df[(df.year==2017) & (df.label==1)]['description']

Print the description of the talks that the user preferred watching in person. How many such talks are there?

Exercise 1: Exploring the dataset

Exercise 1.1: Select 2017 talk description and labels from the Pandas dataframe. How many of them are present? Do the same for 2018 talks.

In [ ]:

The 2017 talks will be used for training and the 2018 talks will we used for predicting. Set the values of year_labeled and year_predict to appropriate values and print out the values of description_labeled and description_predict.

In [ ]:
description_labeled = df[df.year==year_labeled]['description']
description_predict = df[df.year==year_predict]['description']

Quick Introduction to Text Analysis


Lets have a quick overview of text analysis. Our end goal is to train a machine learning algorithm by making it go through enough documents from each class to recognize the distingusihing characteristics in documents from a particular class.

  1. Labeling - This is the step where the user (i.e. a human) reviews a set of documents and manually classifies them. For our problem, here a Pycon attendee is labeling a talk description from 2017 as "watch later"(0) or "watch now" (1).
  2. Training/Testing split - In order to test our algorithm, we split parts of our labeled data into training (used to train the algorithm) and testing set (used to test the algorithm).
  3. Vectorization & feature extraction - Since machine learning algorithms deal with numbers rather than words, we vectorize our documents - i.e. we split the documents into individual unique words and count the frequency of their occurance across documents. There are different data normalization is possible at this stage like stop words removal, lemmatization - but we will skip them for now. Each individual token occurrence frequency (normalized or not) is treated as a feature.
  4. Model training - This is where we build the model.
  5. Model testing - Here we test out the model to see how it is performing against label data as we subject it to the previously set aside test set.
  6. Tweak and train - If our measures are not satisfactory, we will change the parameters that define different aspects of the machine learning algorithm and we will train the model again.
  7. Once satisfied with the results from the previous step, we are now ready to deploy the model and have new unlabled documents be classified by it.

Exercise 2: Vectorize and Feature Extraction

In this step we build the feature set by tokenization, counting and normalization of the bi-grams from the text descriptions of the talk.

tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators

counting the occurrences of tokens in each document

normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents

You can find more information on text feature extraction here and TfidfVectorizer here.

In [ ]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words="english")
Extra Credit

Note that we are choosing default value on all parameters for TfidfVectorizer. While this is a starting point, for better results we would want to come back and tune them to reduce noise. You can try that after you have taken a first pass through all the exercises. You might consider using spacy to fine tune the input to TfidfVectorizer.

Exercise 2.1 Fit_transform

We will use the fit_transform method to learn the vocabulary dictionary and return term-document matrix. What should be the input to fit_transform?

In [32]:
vectorized_text_labeled = vectorizer.fit_transform( ... )

Exercise 2.2 Inspect the vocabulary

Take a look at the vocabulary dictionary that is accessible by calling vocabulary_ on the vectorizer. The stopwords can be accessed using stop_words_ attribute.

In [ ]:

Use the get_feature_names function on the Tfidf vectorizer to get the features (terms).

In [ ]:
occurrences = np.asarray(vectorized_text_labeled.sum(axis=0)).ravel()
terms = ( ... )
counts_df = pd.DataFrame({'terms': terms, 'occurrences': occurrences}).sort_values('occurrences', ascending=False)

Exercise 2.3 Transform documents for prediction into document-term matrix

For the data on which we will do our predictions, we will use the transform method to get the document-term matrix. We will use this later, once we have our model ready. What should be the input to the transform function?

In [29]:
vectorized_text_predict = vectorizer.transform( ... )

Exercise 3: Split into training and testing set

Next we split our data into training set and testing set. This allows us to do cross validation and avoid overfitting. Use the train_test_split method from sklearn.model_selection to split the vectorized_text_labeled into training and testing set with the test size as one third of the size (0.3) of the labeled.

Here is the documentation for the function. The example usage should be helpful for understanding what X_train, X_test, y_train, y_test tuple represents.

In [ ]:
from sklearn.model_selection import train_test_split
labels = df[df.year == 2017]['label']
test_size= ...
X_train, X_test, y_train, y_test = train_test_split(vectorized_text_labeled, labels, test_size=test_size, random_state=1)

Exercise 3.1 Inspect the shape of each output of train_test_split

For each of the output above, get the shape of the matrices.

In [ ]:

Exercise 4: Train the model

Finally we get to the stage for training the model. We are going to use a linear support vector classifier and check its accuracy by using the classification_report function. Note that we have not done any parameter tuning done yet, so your model might not give you the best results. Like TfIdfVectorizer you can come back and tune these parameters later.

In [49]:
import sklearn
from sklearn.svm import LinearSVC
classifier = LinearSVC(verbose=1), y_train)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,

Exercise 5: Evaluate the model

Evaluate the model by using the the classification_report method from the classification_report. What are the values of precision, recall and f1-scores? They are defined here.

In [ ]:
y_pred = classifier.predict( ... )
report = sklearn.metrics.classification_report( ... , ... )

Exercise 6: Make Predictions

Use the model to predict which 2018 talks the user should go to. Plugin vectorized_text_predict from exercise 2.3 to get the predicted_talks_vector into the predict function.

In [ ]:
predicted_talks_vector = classifier.predict( ... )

Using the predicted_talk_indexes get the talk id, description, presenters, title and location and talk date. How many talks should the user go to according to your model?

In [53]:
df_2018 = df[df.year==2018]
predicted_talk_indexes = predicted_talks_vector.nonzero()[0] + len(df[df.year==2017])

df_2018_talks = df_2018.loc[predicted_talk_indexes]

Next Steps:

You might not be very happy with the results. You might want to reduce the manual steps for tuning the parameters. So where do you go from here? There are three specific next steps that can make this better.

  • Spacy - This is an industrial strength natural language processeing libray that has a friendly api. This would be useful in your feature extraction steps.
  • Try using a different algorithm. There is a lot to choose from.
  • Pipeline and GridSearchCV together make a great combination for automating the process of searching for the best models and parameters that accurately represent the patterns in your data.
In [ ]:


Comments powered by Disqus