Chipmunks Data Science

April 17, 2019

Project Night Purpose¶

Many people assume data scientists spend all day visualizing data and making impressive predictive models. While this isn’t untrue, the luckiest and most productive data scientists spend a lot of their time communicating. They communicate their model results - as well as their assumptions and limitations when making their models and doing analysis - in a way that is digestible to their stakeholders and colleagues.

Tonight’s project is aimed towards that aspect of communication. You will be asked to make assumptions as a team - particularly as they pertain to this problem and what the stakeholders need. There are no exactly correct assumptions or answers for this project night. There may be assumptions and answers that clearly don’t have evidence to support them, but do not feel bogged down by getting the “right” answer.

Most importantly - have fun. While this project night covers serious concepts, it is ridiculously silly and meant to be taken with a bit of lighthearted exploration and plenty of opportunities to make mistakes.

Oh, no! We've had a data crash.¶

As ChiPy leadership was preparing for PyCon at the end of this month, they found that the dataset on our infamous ChiPy chipmunks has disapeared. While they transition from Oracle to Postgres, the leadership team has enlisted your help as data scientists to analyze some salvaged chipmunk data. The PyCon organizers had a few questions about coding in Chicago, ChiPy, and chipmunks that need answers. We will get to those questions shortly, but first let's get to the data.

Reading in the Data¶

The salvaged chipmunk dataset is chipmunk.csv. The wonderful pandas library, built on numpy, will let the team read in the data.

ChiPy Check-in

Now is a good time to check in with the team. Is anyone familiar with pandas and numpy? Discuss with your team what these libraries are, what they allow data scientists to do, and then decide on what pandas function will read in our data.

Setting up your environment¶

This project is contained in a jupyter notebook and is assuming you have Python 3.+ installed on your machine. If this is your fisrt project night, we recommend creating a folder for the project night repo: mkdir chipy_projects && cd chipy_projects. If you already have the project night repository on your machine, go to that directory and pull from master.

If you are using Linux or OS X, run the following to create a new virtualenv:

python3 -m venv chipmunk
source chipmunk/bin/activate

On Windows, run the following

python3 -m venv chipmunk 
chipmunk\Scripts\activate

Getting the project¶

The project is in the ChiPy project night repo. If you do not have the repository already, run

git clone https://github.com/chicagopython/CodingWorkshops.git

Now we will:

Go to the project:

cd CodingWorkshops/problems/data_science/chipmunks

Install the packages we need into our environment:

pip install -r requirements.txt

Run our jupyter notebook server for the project:

jupyter notebook

The dataset is in the csv file chipmunks.csv.

In [ ]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [ ]:

# Read in the data

Exploring the data¶

We need to be familiar with our data before we can answer questions about ChiPy and our chipmunks. Let's start with some questions we would ask of any dataset:

How many rows are in this dataset? What does each row represent?
What does the data look like? Check the first 5 rows
Is there missing data? If so, how much is missing?
What columns are categorical?
What are the unique number of observations for each column?

In [ ]:

## Check the number of rows

In [ ]:

## See first 5 rows of data

In [ ]:

## Check for missing data

In [ ]:

## Check for categorical data and unique number of values

Was there missing data?¶

We will keep exploring the data and start answering questions soon, but first let's address missing data (if there is any). What columns have missing data? What kind of data is missing?

ChiPy Check-in

This a great point for discussion. If there is missing data - why might it be missing? Discuss some possible reasons with your team and decide on a reason that makes sense.

Imputation) is the process of replacing missing data with some estimated value. The process can be as complicated (or simple) as you would like it to be! Given the possible reason for our missing data, what is an acceptable imputation?

Impute any missing data in your dataset and note what assumptions you made as a team. If you are not sure how to replace data in pandas, feel free to use google like a proper data scientist.

In [ ]:

# Replace any missing data here

In [ ]:

# Check your data for missing values to see if it worked!

Stakeholder Questions¶

Question #1¶

The great folks at PyCon want to know all about ChiPy and our chipmunks. They have heard that ChiPy is an inclusive and open community. Can we support that claim with our data? Given that the ChiPy column takes a value of 1 for a ChiPy chipmunk and a value of 0 for chipmunks not in ChiPy, start to explore this question.

Some ideas to get you started:

Are chipmunks of different species represented in ChiPy?
Are chipmunks of different sizes represented in ChiPy?
Are chipmunks of different careers represented in ChiPy?
Are spotted and not spotted chipmunks represented in ChiPy?

ChiPy Check-in

There are no right or wrong answers here, only well supported or poorly supported ones! Discuss as a group the aspects of the data you have looked at and if it constitutes enough evidence to justify an answer.

In [ ]:

## Exploration of species

In [ ]:

## Exploration of size

In [ ]:

## Exploration of careers

In [ ]:

## Exploration of spotted vs non-spotted

Question #2¶

The word on the street at PyCon is that chipmunks that live in Chicago enjoy coding more than those that don't. Is this not true? Given that the chicago column takes a value of 1 for chipmunks that live in Chicago and a value of 0 for chipmunks that do not, explore this question.

Visualize the distributions of coding_enjoyment for chipmunks that do and do not live in Chicago.
Come up with a way to test our question.

ChiPy Check-in

Coming up with a proper way to test stakeholder questions can be an artform as well as a science. We have imported a few statistical tests below that may (or may not) be appropriate for our question. First consider a way to frame our question as something to disprove (those familiar with jargon, let's construct a null hypothesis) - then conduct a test that may disprove it. Reading the documentation for the imported tests below may prove to be very helpful!

In [ ]:

from scipy.stats import ttest_ind, levene, chisquare

In [ ]:

## Beautiful plot

In [ ]:

## Statistical Test

Question #2, Continued¶

We have now compared two groups of chipmunks - those that live in Chicago and those that do not - and have either rejected or failed to reject a null hypothesis. What values did the statistical test return and what do they mean? Can we be confident in our results? How confident?

Regardless of our test results, what are the limitations of the test? One limitation is that we have information in our data that is related to being in Chicago and might also have an effect on enjoyment of coding. Regression analysis will allow us to examine the relationship between living in Chicago and enjoyment of coding while controlling for membership in ChiPy. Use the statsmodels package to regress chicago and ChiPy on coding_enjoyment. See this example for assistance.

ChiPy Check-in

This regression model still has limitations, and there could be an entire project night on this task alone. What steps would need to be taken if we controlled for more characteristics of our data?

This is also a good time to discuss what kind of information we are looking for in our regression model. What are coefficients and what do they mean? What is a p-value? Is it similar to a p-value from the statistical tests above?

Lastly, modeling is fun, but don't forget the original question! Do chipmunks that live in Chicago enjoy coding more than those that don't?

In [ ]:

import statsmodels.api as sm
import statsmodels.formula.api as smf

In [ ]:

# Regression model and summary

Question #3¶

ChiPy leadership wants to send 20 lucky ChiPy chipmunks to cheer the lovely folks at PyCon. However, it's unlikely that the data recovery efforts will be able to recover who is/isn't a member of ChiPy! ChiPy leadership has asked us to develop a predictive model to identify members as part of the process to allocate the 20 free tickets. To do this we will:

Make a train/test split to evaluate our model
Scale our data
Fit several models
Decide on an evaluation metric
Select this best model

ChiPy Check-in

The cell below transforms our data so that every feature (jargon for column) is numeric. Discuss with your team why this is could be an important step. Engineering features could also be an entire project night!

In [ ]:

wide_data = pd.get_dummies(df, drop_first=True)

In [ ]:

wide_data.head()

In [ ]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier

In [ ]:

X_train, X_test, y_train, y_test = train_test_split(wide_data.drop('ChiPy', axis=1), 
                                                    wide_data.ChiPy, 
                                                    test_size=0.33, 
                                                    random_state=42)

In [ ]:

### Scale data

In [ ]:

### Train models

ChiPy Check-in

Having the proper evaluation metric is the most important process in predictive modeling. Below we have imported accuracy, precision, and recall. What are each of these metrics and when should they be used? Given that we want to give 20 PyCon tickets to only ChiPy chipmunks, which metric is most appropriate here? Black box evaluation methods like classification_report will not be helpful here given the constraint of only having 20 tickets.

In [ ]:

from sklearn.metrics import precision_score, recall_score, accuracy_score, confusion_matrix

In [ ]:

### Get predictions...

In [ ]:

### Evaluate models, optimizing your predictions for 20 chipmunks!

In [ ]:

Project Night Purpose¶

Oh, no! We've had a data crash.¶

Reading in the Data¶

ChiPy Check-in

Setting up your environment¶

Getting the project¶

Exploring the data¶

Was there missing data?¶

ChiPy Check-in

Stakeholder Questions¶

Question #1¶

ChiPy Check-in

Question #2¶

ChiPy Check-in

Question #2, Continued¶

ChiPy Check-in

Question #3¶

ChiPy Check-in

ChiPy Check-in

Comments