Project Night Purpose¶
Many people assume data scientists spend all day visualizing data and making impressive predictive models. While this isn’t untrue, the luckiest and most productive data scientists spend a lot of their time communicating. They communicate their model results - as well as their assumptions and limitations when making their models and doing analysis - in a way that is digestible to their stakeholders and colleagues.
Tonight’s project is aimed towards that aspect of communication. You will be asked to make assumptions as a team - particularly as they pertain to this problem and what the stakeholders need. There are no exactly correct assumptions or answers for this project night. There may be assumptions and answers that clearly don’t have evidence to support them, but do not feel bogged down by getting the “right” answer.
Most importantly - have fun. While this project night covers serious concepts, it is ridiculously silly and meant to be taken with a bit of lighthearted exploration and plenty of opportunities to make mistakes.
Oh, no! We've had a data crash.¶
As ChiPy leadership was preparing for PyCon at the end of this month, they found that the dataset on our infamous ChiPy chipmunks has disapeared. While they transition from Oracle to Postgres, the leadership team has enlisted your help as data scientists to analyze some salvaged chipmunk data. The PyCon organizers had a few questions about coding in Chicago, ChiPy, and chipmunks that need answers. We will get to those questions shortly, but first let's get to the data.
Reading in the Data¶
Now is a good time to check in with the team. Is anyone familiar with
numpy? Discuss with your team what these libraries are, what they allow data scientists to do, and then decide on what
pandas function will read in our data.
Setting up your environment¶
This project is contained in a jupyter notebook and is assuming you have Python 3.+ installed on your machine. If this is your fisrt project night, we recommend creating a folder for the project night repo:
mkdir chipy_projects && cd chipy_projects. If you already have the project night repository on your machine, go to that directory and pull from master.
If you are using Linux or OS X, run the following to create a new virtualenv:
python3 -m venv chipmunk source chipmunk/bin/activate
On Windows, run the following
python3 -m venv chipmunk chipmunk\Scripts\activate
Getting the project¶
The project is in the ChiPy project night repo. If you do not have the repository already, run
git clone https://github.com/chicagopython/CodingWorkshops.git
Now we will:
Go to the project:
Install the packages we need into our environment:
pip install -r requirements.txt
Run our jupyter notebook server for the project:
The dataset is in the
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
# Read in the data
Exploring the data¶
We need to be familiar with our data before we can answer questions about ChiPy and our chipmunks. Let's start with some questions we would ask of any dataset:
- How many rows are in this dataset? What does each row represent?
- What does the data look like? Check the first 5 rows
- Is there missing data? If so, how much is missing?
- What columns are categorical?
- What are the unique number of observations for each column?
## Check the number of rows
## See first 5 rows of data
## Check for missing data
## Check for categorical data and unique number of values
Was there missing data?¶
We will keep exploring the data and start answering questions soon, but first let's address missing data (if there is any). What columns have missing data? What kind of data is missing?
This a great point for discussion. If there is missing data - why might it be missing? Discuss some possible reasons with your team and decide on a reason that makes sense.
Imputation) is the process of replacing missing data with some estimated value. The process can be as complicated (or simple) as you would like it to be! Given the possible reason for our missing data, what is an acceptable imputation?
Impute any missing data in your dataset and note what assumptions you made as a team. If you are not sure how to replace data in
pandas, feel free to use google like a proper data scientist.
# Replace any missing data here
# Check your data for missing values to see if it worked!
The great folks at PyCon want to know all about ChiPy and our chipmunks. They have heard that ChiPy is an inclusive and open community. Can we support that claim with our data? Given that the
ChiPy column takes a value of
1 for a ChiPy chipmunk and a value of
0 for chipmunks not in ChiPy, start to explore this question.
Some ideas to get you started:
- Are chipmunks of different species represented in ChiPy?
- Are chipmunks of different sizes represented in ChiPy?
- Are chipmunks of different careers represented in ChiPy?
- Are spotted and not spotted chipmunks represented in ChiPy?
There are no right or wrong answers here, only well supported or poorly supported ones! Discuss as a group the aspects of the data you have looked at and if it constitutes enough evidence to justify an answer.
## Exploration of species
## Exploration of size
## Exploration of careers
## Exploration of spotted vs non-spotted
The word on the street at PyCon is that chipmunks that live in Chicago enjoy coding more than those that don't. Is this not true? Given that the
chicago column takes a value of
1 for chipmunks that live in Chicago and a value of
0 for chipmunks that do not, explore this question.
- Visualize the distributions of
coding_enjoymentfor chipmunks that do and do not live in Chicago.
- Come up with a way to test our question.
Coming up with a proper way to test stakeholder questions can be an artform as well as a science. We have imported a few statistical tests below that may (or may not) be appropriate for our question. First consider a way to frame our question as something to disprove (those familiar with jargon, let's construct a null hypothesis) - then conduct a test that may disprove it. Reading the documentation for the imported tests below may prove to be very helpful!
from scipy.stats import ttest_ind, levene, chisquare
## Beautiful plot
## Statistical Test
Question #2, Continued¶
We have now compared two groups of chipmunks - those that live in Chicago and those that do not - and have either rejected or failed to reject a null hypothesis. What values did the statistical test return and what do they mean? Can we be confident in our results? How confident?
Regardless of our test results, what are the limitations of the test? One limitation is that we have information in our data that is related to being in Chicago and might also have an effect on enjoyment of coding. Regression analysis will allow us to examine the relationship between living in Chicago and enjoyment of coding while controlling for membership in ChiPy. Use the
statsmodels package to regress
coding_enjoyment. See this example for assistance.
This regression model still has limitations, and there could be an entire project night on this task alone. What steps would need to be taken if we controlled for more characteristics of our data?
This is also a good time to discuss what kind of information we are looking for in our regression model. What are coefficients and what do they mean? What is a p-value? Is it similar to a p-value from the statistical tests above?
Lastly, modeling is fun, but don't forget the original question! Do chipmunks that live in Chicago enjoy coding more than those that don't?
import statsmodels.api as sm import statsmodels.formula.api as smf
# Regression model and summary
ChiPy leadership wants to send 20 lucky ChiPy chipmunks to cheer the lovely folks at PyCon. However, it's unlikely that the data recovery efforts will be able to recover who is/isn't a member of ChiPy! ChiPy leadership has asked us to develop a predictive model to identify members as part of the process to allocate the 20 free tickets. To do this we will:
- Make a train/test split to evaluate our model
- Scale our data
- Fit several models
- Decide on an evaluation metric
- Select this best model
The cell below transforms our data so that every feature (jargon for column) is numeric. Discuss with your team why this is could be an important step. Engineering features could also be an entire project night!
wide_data = pd.get_dummies(df, drop_first=True)
from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import BernoulliNB from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = train_test_split(wide_data.drop('ChiPy', axis=1), wide_data.ChiPy, test_size=0.33, random_state=42)
### Scale data
### Train models
Having the proper evaluation metric is the most important process in predictive modeling. Below we have imported accuracy, precision, and recall. What are each of these metrics and when should they be used? Given that we want to give 20 PyCon tickets to only ChiPy chipmunks, which metric is most appropriate here? Black box evaluation methods like
classification_report will not be helpful here given the constraint of only having 20 tickets.
from sklearn.metrics import precision_score, recall_score, accuracy_score, confusion_matrix
### Get predictions...
### Evaluate models, optimizing your predictions for 20 chipmunks!