This tutorial was presented by Kevin Markham at PyCon on May 10, 2018.
The tutorial code is available as a Jupyter notebook. The notebook includes 4 additional exercises that were not covered during the tutorial.
The pandas library is a powerful tool for multiple phases of the data science workflow, including data cleaning, visualization, and exploratory data analysis. However, proper data science requires careful coding, and pandas will not stop you from creating misleading plots, drawing incorrect conclusions, ignoring relevant data, including misleading data, or executing incorrect calculations.
In this tutorial, you'll perform a variety of data science tasks on a handful of real-world datasets using pandas. With each task, you'll learn how to avoid either a pandas pitfall or a data science pitfall. By the end of the tutorial, you'll be more confident that you're using pandas for good rather than evil!
You will get the most out of this tutorial if you are an intermediate pandas user, since the tutorial will not cover pandas basics. If you are new to pandas or just need a refresher, I recommend watching some videos from my free pandas course. Alternatively, you can review all of the code from my pandas course in this Jupyter notebook.
Here are three options that will work equally well:
Move the CSV files into your working directory. (This is usually the directory where you create Python scripts or notebooks.)
Open the Python environment of your choice.
If you're using Jupyter notebook, run the following code:
If you're using any other Python environment, run the following code:
If you don't get any error messages, and a plot appears on your screen, then it's very likely that pandas and matplotlib are installed correctly.
Kevin Markham is the founder of Data School, an online school for learning data science with Python. He is passionate about teaching data science to people who are new to the field, regardless of their educational and professional backgrounds. Previously, Kevin was the lead data science instructor for General Assembly in Washington, DC. Currently, he teaches machine learning and data analysis to over 10,000 students each month through the Data School YouTube channel. He has a degree in Computer Engineering from Vanderbilt University and lives in Asheville, North Carolina with his wife and son.
Sure! You can email kevin@dataschool.io.