update: 12-17-22 - added sorting and table.

Data Manipulation

import pandas as pd
import numpy as np
import statsmodels.api as sm

# get a practice dataset
df = sm.datasets.get_rdataset("mtcars", "datasets", cache = True).data

df.head()

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2

Basics

How to get basic information about the df.

Dimensions of a df

This is like dim in R.

# rows by columns 
df.shape

(32, 11)

Number of rows

This is like nrow in R.

# just the rows
len(df)

Number of columns

This is like ncol in R.

# just the columns
len(df.columns)

You could also just index the results of df.shape.

Data types

# a summary of types by column
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32 entries, Mazda RX4 to Volvo 142E
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   mpg     32 non-null     float64
 1   cyl     32 non-null     int64  
 2   disp    32 non-null     float64
 3   hp      32 non-null     int64  
 4   drat    32 non-null     float64
 5   wt      32 non-null     float64
 6   qsec    32 non-null     float64
 7   vs      32 non-null     int64  
 8   am      32 non-null     int64  
 9   gear    32 non-null     int64  
 10  carb    32 non-null     int64  
dtypes: float64(5), int64(6)
memory usage: 3.0+ KB

Summary

# summary of distributional information
df.describe()

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
count	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.0000
mean	20.090625	6.187500	230.721875	146.687500	3.596563	3.217250	17.848750	0.437500	0.406250	3.687500	2.8125
std	6.026948	1.785922	123.938694	68.562868	0.534679	0.978457	1.786943	0.504016	0.498991	0.737804	1.6152
min	10.400000	4.000000	71.100000	52.000000	2.760000	1.513000	14.500000	0.000000	0.000000	3.000000	1.0000
25%	15.425000	4.000000	120.825000	96.500000	3.080000	2.581250	16.892500	0.000000	0.000000	3.000000	2.0000
50%	19.200000	6.000000	196.300000	123.000000	3.695000	3.325000	17.710000	0.000000	0.000000	4.000000	2.0000
75%	22.800000	8.000000	326.000000	180.000000	3.920000	3.610000	18.900000	1.000000	1.000000	4.000000	4.0000
max	33.900000	8.000000	472.000000	335.000000	4.930000	5.424000	22.900000	1.000000	1.000000	5.000000	8.0000

Selections

How to specify which column you want.

Select a column

Note that I’m using head() to get the first few rows to keep the output small. It’s not part of the selection of columns.

# access a single column like an object property
df.mpg.head()

Mazda RX4            21.0
Mazda RX4 Wag        21.0
Datsun 710           22.8
Hornet 4 Drive       21.4
Hornet Sportabout    18.7
Name: mpg, dtype: float64

Or…

# select a column (as a series)
df["mpg"].head()

Mazda RX4            21.0
Mazda RX4 Wag        21.0
Datsun 710           22.8
Hornet 4 Drive       21.4
Hornet Sportabout    18.7
Name: mpg, dtype: float64

# select column (as a dataframe)
df[["mpg"]].head()

	mpg
Mazda RX4	21.0
Mazda RX4 Wag	21.0
Datsun 710	22.8
Hornet 4 Drive	21.4
Hornet Sportabout	18.7

Select several columns by string name

# select a column by several names
df[["mpg", "wt"]].head()

	mpg	wt
Mazda RX4	21.0	2.620
Mazda RX4 Wag	21.0	2.875
Datsun 710	22.8	2.320
Hornet 4 Drive	21.4	3.215
Hornet Sportabout	18.7	3.440

Select from one column through another with names

# select a range by name of start and end "mph through hp"
df.loc[:, "mpg":"hp"].head()

	mpg	cyl	disp	hp
Mazda RX4	21.0	6	160.0	110
Mazda RX4 Wag	21.0	6	160.0	110
Datsun 710	22.8	4	108.0	93
Hornet 4 Drive	21.4	6	258.0	110
Hornet Sportabout	18.7	8	360.0	175

Select arbitrarily

How to get based on a condition of your choosing. We might want to get only columns that start with “d”, for example.

# get a list
starts_with_d = [i for i in list(df.columns) if i.startswith("d")]

# pass the list to pandas
df[starts_with_d].head()

	disp	drat
Mazda RX4	160.0	3.90
Mazda RX4 Wag	160.0	3.90
Datsun 710	108.0	3.85
Hornet 4 Drive	258.0	3.08
Hornet Sportabout	360.0	3.15

Filtering

Operations where we’d used dplyr in R. In Pandas, filtering refers to operations by index, so what I’m thinking of is more like “querying” in Pandas terms.

Single condition

Pass logic in strings to the query method.

df.query("cyl == 6")

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6

Multiple conditions

df.query("cyl == 6 & hp > 105")

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6

Sorting

Ascending

df.mpg.sort_values().head(10)

Lincoln Continental    10.4
Cadillac Fleetwood     10.4
Camaro Z28             13.3
Duster 360             14.3
Chrysler Imperial      14.7
Maserati Bora          15.0
Merc 450SLC            15.2
AMC Javelin            15.2
Dodge Challenger       15.5
Ford Pantera L         15.8
Name: mpg, dtype: float64

Descending

There is no ‘descending’ per se, just set ascending to False.

df.mpg.sort_values(ascending = False).head(10)

Toyota Corolla    33.9
Fiat 128          32.4
Lotus Europa      30.4
Honda Civic       30.4
Fiat X1-9         27.3
Porsche 914-2     26.0
Merc 240D         24.4
Datsun 710        22.8
Merc 230          22.8
Toyota Corona     21.5
Name: mpg, dtype: float64

Applying functions

Map a function with no arguments to a single column

# some function to apply
def increment(x):
    return x+1

# could also be done as a lambda expression
df["ApplyResult"] = df.mpg.map(increment)

df[["mpg", "ApplyResult"]].head()

	mpg	ApplyResult
Mazda RX4	21.0	22.0
Mazda RX4 Wag	21.0	22.0
Datsun 710	22.8	23.8
Hornet 4 Drive	21.4	22.4
Hornet Sportabout	18.7	19.7

Apply a function with arguments using kwargs

# some function yadda
def incrementBy(x, by):
    return x + by

# could also be done as a lambda expression
df["ApplyResult"] = df.mpg.apply(incrementBy, by = 5)

df[["mpg", "ApplyResult"]].head()

	mpg	ApplyResult
Mazda RX4	21.0	26.0
Mazda RX4 Wag	21.0	26.0
Datsun 710	22.8	27.8
Hornet 4 Drive	21.4	26.4
Hornet Sportabout	18.7	23.7

Missing values

Expanding on an example from here. Pandas uses np.nan instead of NA.

df2 = pd.DataFrame({
    "x": [1,2,3,4,5,np.nan,7,8,np.nan,10,11,12,np.nan],
    "y": [1,np.nan,3,4,5,6,np.nan,8,9,10,11,12,13]
})
df2

	x	y
0	1.0	1.0
1	2.0	NaN
2	3.0	3.0
3	4.0	4.0
4	5.0	5.0
5	NaN	6.0
6	7.0	NaN
7	8.0	8.0
8	NaN	9.0
9	10.0	10.0
10	11.0	11.0
11	12.0	12.0
12	NaN	13.0

Missing in the whole df

# no NA the whole thing
df2.isnull()

	x	y
0	False	False
1	False	True
2	False	False
3	False	False
4	False	False
5	True	False
6	False	True
7	False	False
8	True	False
9	False	False
10	False	False
11	False	False
12	True	False

Missing in a single column

# just the x column
df2.x.isnull()

0     False
1     False
2     False
3     False
4     False
5      True
6     False
7     False
8      True
9     False
10    False
11    False
12     True
Name: x, dtype: bool

Replace NAs in a column

# use the fillna function 
df2.x = df2.x.fillna("MISSING")
df2

	x	y
0	1.0	1.0
1	2.0	NaN
2	3.0	3.0
3	4.0	4.0
4	5.0	5.0
5	MISSING	6.0
6	7.0	NaN
7	8.0	8.0
8	MISSING	9.0
9	10.0	10.0
10	11.0	11.0
11	12.0	12.0
12	MISSING	13.0

Replace the NAs in the full dataset

# files the other even though I didn't specify a column
df2.fillna("MISSING!!!")

	x	y
0	1.0	1.0
1	2.0	MISSING!!!
2	3.0	3.0
3	4.0	4.0
4	5.0	5.0
5	MISSING	6.0
6	7.0	MISSING!!!
7	8.0	8.0
8	MISSING	9.0
9	10.0	10.0
10	11.0	11.0
11	12.0	12.0
12	MISSING	13.0

Drop NAs

The official docs are great on this.

# create another sample df
df3 = pd.DataFrame({
    "x": [1,np.nan,3,4,5,np.nan,np.nan,8,np.nan,10,11,12,np.nan],
    "y": [1,np.nan,3,4,5,6,np.nan,8,9,10,11,12,np.nan],
    "z": [1,2,np.nan,4,5,6,np.nan,8,9,10,11,12,np.nan]
})
df3

	x	y	z
0	1.0	1.0	1.0
1	NaN	NaN	2.0
2	3.0	3.0	NaN
3	4.0	4.0	4.0
4	5.0	5.0	5.0
5	NaN	6.0	6.0
6	NaN	NaN	NaN
7	8.0	8.0	8.0
8	NaN	9.0	9.0
9	10.0	10.0	10.0
10	11.0	11.0	11.0
11	12.0	12.0	12.0
12	NaN	NaN	NaN

Drop all across the df

# no NAs period 
df3.dropna()

	x	y	z
0	1.0	1.0	1.0
3	4.0	4.0	4.0
4	5.0	5.0	5.0
7	8.0	8.0	8.0
9	10.0	10.0	10.0
10	11.0	11.0	11.0
11	12.0	12.0	12.0

Drop if they’re all na in that row

# drop if the whole row is NA
df3.dropna(how = "all")

	x	y	z
0	1.0	1.0	1.0
1	NaN	NaN	2.0
2	3.0	3.0	NaN
3	4.0	4.0	4.0
4	5.0	5.0	5.0
5	NaN	6.0	6.0
7	8.0	8.0	8.0
8	NaN	9.0	9.0
9	10.0	10.0	10.0
10	11.0	11.0	11.0
11	12.0	12.0	12.0

Drop all in certain columns

This is so clutch! Much simpler than filtering across in R (though there might be a cleaner way for that)

df3.dropna(subset = ["x", "y"])

	x	y	z
0	1.0	1.0	1.0
2	3.0	3.0	NaN
3	4.0	4.0	4.0
4	5.0	5.0	5.0
7	8.0	8.0	8.0
9	10.0	10.0	10.0
10	11.0	11.0	11.0
11	12.0	12.0	12.0

Group & Summarize

Group by a factor level

df[["mpg", "disp", "cyl"]].groupby(by = "cyl")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2094b7cb10>

Groupby factor level and get the mean

df[["mpg", "disp", "cyl"]].groupby(by = "cyl").mean()

	mpg	disp
cyl
4	26.663636	105.136364
6	19.742857	183.314286
8	15.100000	353.100000

Melting a DataFrame

# use the melt function specifying ids and value vars
df_long = pd.melt(df, id_vars = "cyl", value_vars = ["mpg", "wt"])
df_long

	cyl	variable	value
0	6	mpg	21.000
1	6	mpg	21.000
2	4	mpg	22.800
3	6	mpg	21.400
4	8	mpg	18.700
...	...	...	...
59	4	wt	1.513
60	8	wt	3.170
61	6	wt	2.770
62	8	wt	3.570
63	4	wt	2.780

64 rows × 3 columns

Misc

Counts of values (like `table` in R)

# shows how many there are of each factor
df.cyl.value_counts()

8    14
4    11
6     7
Name: cyl, dtype: int64

Formatting long chains

# a little ugly but it works
df[["mpg", "disp", "cyl"]].\
    query("cyl > 4").\
    groupby(by = "cyl").\
    mean()

	mpg	disp
cyl
6	19.742857	183.314286
8	15.100000	353.100000

Iterating over a df

For mutating the dataframe apply/map is recommended, but I’m showing how to do this for completeness.

for i, j in df.iterrows():
    print(df["mpg"][i])
    # j would give you all the columns with just that row

21.0
21.0
22.8
21.4
18.7
18.1
14.3
24.4
22.8
19.2
17.8
16.4
17.3
15.2
10.4
10.4
14.7
32.4
30.4
33.9
21.5
15.5
15.2
13.3
19.2
27.3
26.0
30.4
15.8
19.7
15.0
21.4

https://www.statsmodels.org/stable/examples/notebooks/generated/glm.html https://www.geeksforgeeks.org/linear-regression-in-python-using-statsmodels/ https://datagy.io/pandas-iterate-over-rows/ https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html