Stats Python in a Hurry Part 1: Data Wrangling

Transferring my R data-munging knowledge back to my first language.
Author

Thadryan

Published

December 11, 2022

update: 12-17-22 - added sorting and table.

Data Manipulation

import pandas as pd
import numpy as np
import statsmodels.api as sm
# get a practice dataset
df = sm.datasets.get_rdataset("mtcars", "datasets", cache = True).data

df.head()
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

Basics

How to get basic information about the df.

Dimensions of a df

This is like dim in R.

# rows by columns 
df.shape
(32, 11)

Number of rows

This is like nrow in R.

# just the rows
len(df)
32

Number of columns

This is like ncol in R.

# just the columns
len(df.columns)
11

You could also just index the results of df.shape.

Data types

# a summary of types by column
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 32 entries, Mazda RX4 to Volvo 142E
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   mpg     32 non-null     float64
 1   cyl     32 non-null     int64  
 2   disp    32 non-null     float64
 3   hp      32 non-null     int64  
 4   drat    32 non-null     float64
 5   wt      32 non-null     float64
 6   qsec    32 non-null     float64
 7   vs      32 non-null     int64  
 8   am      32 non-null     int64  
 9   gear    32 non-null     int64  
 10  carb    32 non-null     int64  
dtypes: float64(5), int64(6)
memory usage: 3.0+ KB

Summary

# summary of distributional information
df.describe()
mpg cyl disp hp drat wt qsec vs am gear carb
count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.0000
mean 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.437500 0.406250 3.687500 2.8125
std 6.026948 1.785922 123.938694 68.562868 0.534679 0.978457 1.786943 0.504016 0.498991 0.737804 1.6152
min 10.400000 4.000000 71.100000 52.000000 2.760000 1.513000 14.500000 0.000000 0.000000 3.000000 1.0000
25% 15.425000 4.000000 120.825000 96.500000 3.080000 2.581250 16.892500 0.000000 0.000000 3.000000 2.0000
50% 19.200000 6.000000 196.300000 123.000000 3.695000 3.325000 17.710000 0.000000 0.000000 4.000000 2.0000
75% 22.800000 8.000000 326.000000 180.000000 3.920000 3.610000 18.900000 1.000000 1.000000 4.000000 4.0000
max 33.900000 8.000000 472.000000 335.000000 4.930000 5.424000 22.900000 1.000000 1.000000 5.000000 8.0000

Selections

How to specify which column you want.

Select a column

Note that I’m using head() to get the first few rows to keep the output small. It’s not part of the selection of columns.

# access a single column like an object property
df.mpg.head()
Mazda RX4            21.0
Mazda RX4 Wag        21.0
Datsun 710           22.8
Hornet 4 Drive       21.4
Hornet Sportabout    18.7
Name: mpg, dtype: float64

Or…

# select a column (as a series)
df["mpg"].head()
Mazda RX4            21.0
Mazda RX4 Wag        21.0
Datsun 710           22.8
Hornet 4 Drive       21.4
Hornet Sportabout    18.7
Name: mpg, dtype: float64
# select column (as a dataframe)
df[["mpg"]].head()
mpg
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
Hornet 4 Drive 21.4
Hornet Sportabout 18.7

Select several columns by string name

# select a column by several names
df[["mpg", "wt"]].head()
mpg wt
Mazda RX4 21.0 2.620
Mazda RX4 Wag 21.0 2.875
Datsun 710 22.8 2.320
Hornet 4 Drive 21.4 3.215
Hornet Sportabout 18.7 3.440

Select from one column through another with names

# select a range by name of start and end "mph through hp"
df.loc[:, "mpg":"hp"].head()
mpg cyl disp hp
Mazda RX4 21.0 6 160.0 110
Mazda RX4 Wag 21.0 6 160.0 110
Datsun 710 22.8 4 108.0 93
Hornet 4 Drive 21.4 6 258.0 110
Hornet Sportabout 18.7 8 360.0 175

Select arbitrarily

How to get based on a condition of your choosing. We might want to get only columns that start with “d”, for example.

# get a list
starts_with_d = [i for i in list(df.columns) if i.startswith("d")]

# pass the list to pandas
df[starts_with_d].head()
disp drat
Mazda RX4 160.0 3.90
Mazda RX4 Wag 160.0 3.90
Datsun 710 108.0 3.85
Hornet 4 Drive 258.0 3.08
Hornet Sportabout 360.0 3.15

Filtering

Operations where we’d used dplyr in R. In Pandas, filtering refers to operations by index, so what I’m thinking of is more like “querying” in Pandas terms.

Single condition

Pass logic in strings to the query method.

df.query("cyl == 6")
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6

Multiple conditions

df.query("cyl == 6 & hp > 105")
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6

Sorting

Ascending

df.mpg.sort_values().head(10)
Lincoln Continental    10.4
Cadillac Fleetwood     10.4
Camaro Z28             13.3
Duster 360             14.3
Chrysler Imperial      14.7
Maserati Bora          15.0
Merc 450SLC            15.2
AMC Javelin            15.2
Dodge Challenger       15.5
Ford Pantera L         15.8
Name: mpg, dtype: float64

Descending

There is no ‘descending’ per se, just set ascending to False.

df.mpg.sort_values(ascending = False).head(10)
Toyota Corolla    33.9
Fiat 128          32.4
Lotus Europa      30.4
Honda Civic       30.4
Fiat X1-9         27.3
Porsche 914-2     26.0
Merc 240D         24.4
Datsun 710        22.8
Merc 230          22.8
Toyota Corona     21.5
Name: mpg, dtype: float64

Applying functions

Map a function with no arguments to a single column

# some function to apply
def increment(x):
    return x+1

# could also be done as a lambda expression
df["ApplyResult"] = df.mpg.map(increment)

df[["mpg", "ApplyResult"]].head()
mpg ApplyResult
Mazda RX4 21.0 22.0
Mazda RX4 Wag 21.0 22.0
Datsun 710 22.8 23.8
Hornet 4 Drive 21.4 22.4
Hornet Sportabout 18.7 19.7

Apply a function with arguments using kwargs

# some function yadda
def incrementBy(x, by):
    return x + by

# could also be done as a lambda expression
df["ApplyResult"] = df.mpg.apply(incrementBy, by = 5)

df[["mpg", "ApplyResult"]].head()
mpg ApplyResult
Mazda RX4 21.0 26.0
Mazda RX4 Wag 21.0 26.0
Datsun 710 22.8 27.8
Hornet 4 Drive 21.4 26.4
Hornet Sportabout 18.7 23.7

Missing values

Expanding on an example from here. Pandas uses np.nan instead of NA.

df2 = pd.DataFrame({
    "x": [1,2,3,4,5,np.nan,7,8,np.nan,10,11,12,np.nan],
    "y": [1,np.nan,3,4,5,6,np.nan,8,9,10,11,12,13]
})
df2
x y
0 1.0 1.0
1 2.0 NaN
2 3.0 3.0
3 4.0 4.0
4 5.0 5.0
5 NaN 6.0
6 7.0 NaN
7 8.0 8.0
8 NaN 9.0
9 10.0 10.0
10 11.0 11.0
11 12.0 12.0
12 NaN 13.0

Missing in the whole df

# no NA the whole thing
df2.isnull()
x y
0 False False
1 False True
2 False False
3 False False
4 False False
5 True False
6 False True
7 False False
8 True False
9 False False
10 False False
11 False False
12 True False

Missing in a single column

# just the x column
df2.x.isnull()
0     False
1     False
2     False
3     False
4     False
5      True
6     False
7     False
8      True
9     False
10    False
11    False
12     True
Name: x, dtype: bool

Replace NAs in a column

# use the fillna function 
df2.x = df2.x.fillna("MISSING")
df2
x y
0 1.0 1.0
1 2.0 NaN
2 3.0 3.0
3 4.0 4.0
4 5.0 5.0
5 MISSING 6.0
6 7.0 NaN
7 8.0 8.0
8 MISSING 9.0
9 10.0 10.0
10 11.0 11.0
11 12.0 12.0
12 MISSING 13.0

Replace the NAs in the full dataset

# files the other even though I didn't specify a column
df2.fillna("MISSING!!!")
x y
0 1.0 1.0
1 2.0 MISSING!!!
2 3.0 3.0
3 4.0 4.0
4 5.0 5.0
5 MISSING 6.0
6 7.0 MISSING!!!
7 8.0 8.0
8 MISSING 9.0
9 10.0 10.0
10 11.0 11.0
11 12.0 12.0
12 MISSING 13.0

Drop NAs

The official docs are great on this.

# create another sample df
df3 = pd.DataFrame({
    "x": [1,np.nan,3,4,5,np.nan,np.nan,8,np.nan,10,11,12,np.nan],
    "y": [1,np.nan,3,4,5,6,np.nan,8,9,10,11,12,np.nan],
    "z": [1,2,np.nan,4,5,6,np.nan,8,9,10,11,12,np.nan]
})
df3
x y z
0 1.0 1.0 1.0
1 NaN NaN 2.0
2 3.0 3.0 NaN
3 4.0 4.0 4.0
4 5.0 5.0 5.0
5 NaN 6.0 6.0
6 NaN NaN NaN
7 8.0 8.0 8.0
8 NaN 9.0 9.0
9 10.0 10.0 10.0
10 11.0 11.0 11.0
11 12.0 12.0 12.0
12 NaN NaN NaN

Drop all across the df

# no NAs period 
df3.dropna()
x y z
0 1.0 1.0 1.0
3 4.0 4.0 4.0
4 5.0 5.0 5.0
7 8.0 8.0 8.0
9 10.0 10.0 10.0
10 11.0 11.0 11.0
11 12.0 12.0 12.0

Drop if they’re all na in that row

# drop if the whole row is NA
df3.dropna(how = "all")
x y z
0 1.0 1.0 1.0
1 NaN NaN 2.0
2 3.0 3.0 NaN
3 4.0 4.0 4.0
4 5.0 5.0 5.0
5 NaN 6.0 6.0
7 8.0 8.0 8.0
8 NaN 9.0 9.0
9 10.0 10.0 10.0
10 11.0 11.0 11.0
11 12.0 12.0 12.0

Drop all in certain columns

This is so clutch! Much simpler than filtering across in R (though there might be a cleaner way for that)

df3.dropna(subset = ["x", "y"])
x y z
0 1.0 1.0 1.0
2 3.0 3.0 NaN
3 4.0 4.0 4.0
4 5.0 5.0 5.0
7 8.0 8.0 8.0
9 10.0 10.0 10.0
10 11.0 11.0 11.0
11 12.0 12.0 12.0

Group & Summarize

Group by a factor level

df[["mpg", "disp", "cyl"]].groupby(by = "cyl")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2094b7cb10>

Groupby factor level and get the mean

df[["mpg", "disp", "cyl"]].groupby(by = "cyl").mean()
mpg disp
cyl
4 26.663636 105.136364
6 19.742857 183.314286
8 15.100000 353.100000

Melting a DataFrame

# use the melt function specifying ids and value vars
df_long = pd.melt(df, id_vars = "cyl", value_vars = ["mpg", "wt"])
df_long
cyl variable value
0 6 mpg 21.000
1 6 mpg 21.000
2 4 mpg 22.800
3 6 mpg 21.400
4 8 mpg 18.700
... ... ... ...
59 4 wt 1.513
60 8 wt 3.170
61 6 wt 2.770
62 8 wt 3.570
63 4 wt 2.780

64 rows × 3 columns

Misc

Counts of values (like table in R)

# shows how many there are of each factor
df.cyl.value_counts()
8    14
4    11
6     7
Name: cyl, dtype: int64

Formatting long chains

# a little ugly but it works
df[["mpg", "disp", "cyl"]].\
    query("cyl > 4").\
    groupby(by = "cyl").\
    mean()
mpg disp
cyl
6 19.742857 183.314286
8 15.100000 353.100000

Iterating over a df

For mutating the dataframe apply/map is recommended, but I’m showing how to do this for completeness.

for i, j in df.iterrows():
    print(df["mpg"][i])
    # j would give you all the columns with just that row
21.0
21.0
22.8
21.4
18.7
18.1
14.3
24.4
22.8
19.2
17.8
16.4
17.3
15.2
10.4
10.4
14.7
32.4
30.4
33.9
21.5
15.5
15.2
13.3
19.2
27.3
26.0
30.4
15.8
19.7
15.0
21.4

https://www.statsmodels.org/stable/examples/notebooks/generated/glm.html https://www.geeksforgeeks.org/linear-regression-in-python-using-statsmodels/ https://datagy.io/pandas-iterate-over-rows/ https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html