This is a guide to many pandas tutorials, geared mainly for new users. # Internal Guides. Pandas’ own 10 Minutes to pandas (opens new window). More complex recipes are in the Cookbook (opens new window). A handy pandas cheat sheet (opens new window). # Community Guides # pandas Cookbook by Julia Evans. Quick and Dirty Pandas Cheat Sheet. I don't regularly write new scripts in pandas. This repo is to help me when I need to tweak a script I haven't changed in a while.
Table of Contents¶
- unique - find unique rows
- Working with time series data
unique - find unique rows¶¶
Find unique rows in dataset
Company | Person | Sales | |
---|---|---|---|
0 | GOOG | Sam | 200 |
1 | GOOG | Charlie | 120 |
2 | MSFT | Amy | 340 |
3 | MSFT | Vanessa | 124 |
4 | FB | Carl | 243 |
5 | FB | Sarah | 350 |
nunique - find number of unique rows¶¶
More efficient than finding the unique array and finding the length of it.
value_counts - find unique values and number of occurrences¶¶
apply - batch process column values¶¶
Calling the apply() is similar to calling the map()
in Python. It can apply an operation on all records of a selected column. For instance, to find the squared sales, do the following
Company | Person | Sales | sq_sales | |
---|---|---|---|---|
0 | GOOG | Sam | 200 | 40000 |
1 | GOOG | Charlie | 120 | 14400 |
2 | MSFT | Amy | 340 | 115600 |
3 | MSFT | Vanessa | 124 | 15376 |
4 | FB | Carl | 243 | 59049 |
5 | FB | Sarah | 350 | 122500 |
Pandas Functions Cheat Sheet
We can also define a function and call that within the apply()
method. This can accept values of one or more columns to calculate a new column.
Company | Person | Sales | sq_sales | cu_sales | |
---|---|---|---|---|---|
0 | GOOG | Sam | 200 | 40000 | 8000000 |
1 | GOOG | Charlie | 120 | 14400 | 1728000 |
2 | MSFT | Amy | 340 | 115600 | 39304000 |
3 | MSFT | Vanessa | 124 | 15376 | 1906624 |
4 | FB | Carl | 243 | 59049 | 14348907 |
5 | FB | Sarah | 350 | 122500 | 42875000 |
Github Search Cheat Sheet
Company | Person | Sales | sq_sales | cu_sales | |
---|---|---|---|---|---|
1 | GOOG | Charlie | 120 | 14400 | 1728000 |
3 | MSFT | Vanessa | 124 | 15376 | 1906624 |
0 | GOOG | Sam | 200 | 40000 | 8000000 |
4 | FB | Carl | 243 | 59049 | 14348907 |
2 | MSFT | Amy | 340 | 115600 | 39304000 |
5 | FB | Sarah | 350 | 122500 | 42875000 |
Note how the index remains attached to the original rows.
Company | Person | Sales | sq_sales | cu_sales | |
---|---|---|---|---|---|
4 | FB | Carl | 243 | 59049 | 14348907 |
5 | FB | Sarah | 350 | 122500 | 42875000 |
1 | GOOG | Charlie | 120 | 14400 | 1728000 |
0 | GOOG | Sam | 200 | 40000 | 8000000 |
3 | MSFT | Vanessa | 124 | 15376 | 1906624 |
2 | MSFT | Amy | 340 | 115600 | 39304000 |
isnull - finding null values throughout the DataFrame¶¶
Company | Person | Sales | sq_sales | cu_sales | |
---|---|---|---|---|---|
0 | False | False | False | False | False |
1 | False | False | False | False | False |
2 | False | False | False | False | False |
3 | False | False | False | False | False |
4 | False | False | False | False | False |
5 | False | False | False | False | False |
Working with time series data¶¶
This section explains how to specify datatypes of columns while reading data and how to define column converters to ease certain data types.
Unnamed: 0 | Registration Date | Country | Organization | Current customer? | What would you like to learn? | |
---|---|---|---|---|---|---|
0 | 0 | 11/08/2019 06:09 PM EST | Jamaica | The University of the West Indies | NaN | NaN |
1 | 1 | 11/08/2019 06:09 PM EST | Japan | iLand6 Co.,Ltd. | no | I am interested ArcGIS. |
2 | 2 | 11/08/2019 05:56 PM EST | Canada | Safe Software Inc | yes | data science workflos |
3 | 3 | 11/08/2019 05:51 PM EST | Canada | Le Groupe GeoInfo Inc | yes | general information |
4 | 4 | 11/08/2019 05:26 PM EST | Canada | Safe Software Inc. | NaN | NaN |
The Registration Date
should be of type datetime
and the Current customer?
should be of bool
. However, are they?
Everything is a generic object
. Let us re-read, this time knowing what their data types should be.
Plotting time series¶¶
Now that the Registration date is datetime, we can plot the number of registrants by time. But before that, we need to set it as the index.
Country | Organization | Current customer? | What would you like to learn? | |
---|---|---|---|---|
Registration Date | ||||
2019-11-08 18:09:00 | Jamaica | The University of the West Indies | False | NaN |
2019-11-08 18:09:00 | Japan | iLand6 Co.,Ltd. | False | I am interested ArcGIS. |
2019-11-08 17:56:00 | Canada | Safe Software Inc | True | data science workflos |
2019-11-08 17:51:00 | Canada | Le Groupe GeoInfo Inc | True | general information |
2019-11-08 17:26:00 | Canada | Safe Software Inc. | False | NaN |
Add a counter column to the dataframe¶¶
Pandas Cheat Sheet Github Free
Note. It is important to count up only after sorting. Else the numbers are going to be all over the place.