# select column (as a dataframe)df[["mpg"]].head()
mpg
Mazda RX4
21.0
Mazda RX4 Wag
21.0
Datsun 710
22.8
Hornet 4 Drive
21.4
Hornet Sportabout
18.7
Select several columns by string name
# select a column by several namesdf[["mpg", "wt"]].head()
mpg
wt
Mazda RX4
21.0
2.620
Mazda RX4 Wag
21.0
2.875
Datsun 710
22.8
2.320
Hornet 4 Drive
21.4
3.215
Hornet Sportabout
18.7
3.440
Select from one column through another with names
# select a range by name of start and end "mph through hp"df.loc[:, "mpg":"hp"].head()
mpg
cyl
disp
hp
Mazda RX4
21.0
6
160.0
110
Mazda RX4 Wag
21.0
6
160.0
110
Datsun 710
22.8
4
108.0
93
Hornet 4 Drive
21.4
6
258.0
110
Hornet Sportabout
18.7
8
360.0
175
Select arbitrarily
How to get based on a condition of your choosing. We might want to get only columns that start with “d”, for example.
# get a liststarts_with_d = [i for i inlist(df.columns) if i.startswith("d")]# pass the list to pandasdf[starts_with_d].head()
disp
drat
Mazda RX4
160.0
3.90
Mazda RX4 Wag
160.0
3.90
Datsun 710
108.0
3.85
Hornet 4 Drive
258.0
3.08
Hornet Sportabout
360.0
3.15
Filtering
Operations where we’d used dplyr in R. In Pandas, filtering refers to operations by index, so what I’m thinking of is more like “querying” in Pandas terms.
Single condition
Pass logic in strings to the query method.
df.query("cyl == 6")
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb
Mazda RX4
21.0
6
160.0
110
3.90
2.620
16.46
0
1
4
4
Mazda RX4 Wag
21.0
6
160.0
110
3.90
2.875
17.02
0
1
4
4
Hornet 4 Drive
21.4
6
258.0
110
3.08
3.215
19.44
1
0
3
1
Valiant
18.1
6
225.0
105
2.76
3.460
20.22
1
0
3
1
Merc 280
19.2
6
167.6
123
3.92
3.440
18.30
1
0
4
4
Merc 280C
17.8
6
167.6
123
3.92
3.440
18.90
1
0
4
4
Ferrari Dino
19.7
6
145.0
175
3.62
2.770
15.50
0
1
5
6
Multiple conditions
df.query("cyl == 6 & hp > 105")
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb
Mazda RX4
21.0
6
160.0
110
3.90
2.620
16.46
0
1
4
4
Mazda RX4 Wag
21.0
6
160.0
110
3.90
2.875
17.02
0
1
4
4
Hornet 4 Drive
21.4
6
258.0
110
3.08
3.215
19.44
1
0
3
1
Merc 280
19.2
6
167.6
123
3.92
3.440
18.30
1
0
4
4
Merc 280C
17.8
6
167.6
123
3.92
3.440
18.90
1
0
4
4
Ferrari Dino
19.7
6
145.0
175
3.62
2.770
15.50
0
1
5
6
Sorting
Ascending
df.mpg.sort_values().head(10)
Lincoln Continental 10.4
Cadillac Fleetwood 10.4
Camaro Z28 13.3
Duster 360 14.3
Chrysler Imperial 14.7
Maserati Bora 15.0
Merc 450SLC 15.2
AMC Javelin 15.2
Dodge Challenger 15.5
Ford Pantera L 15.8
Name: mpg, dtype: float64
Descending
There is no ‘descending’ per se, just set ascending to False.
df.mpg.sort_values(ascending =False).head(10)
Toyota Corolla 33.9
Fiat 128 32.4
Lotus Europa 30.4
Honda Civic 30.4
Fiat X1-9 27.3
Porsche 914-2 26.0
Merc 240D 24.4
Datsun 710 22.8
Merc 230 22.8
Toyota Corona 21.5
Name: mpg, dtype: float64
Applying functions
Map a function with no arguments to a single column
# some function to applydef increment(x):return x+1# could also be done as a lambda expressiondf["ApplyResult"] = df.mpg.map(increment)df[["mpg", "ApplyResult"]].head()
mpg
ApplyResult
Mazda RX4
21.0
22.0
Mazda RX4 Wag
21.0
22.0
Datsun 710
22.8
23.8
Hornet 4 Drive
21.4
22.4
Hornet Sportabout
18.7
19.7
Apply a function with arguments using kwargs
# some function yaddadef incrementBy(x, by):return x + by# could also be done as a lambda expressiondf["ApplyResult"] = df.mpg.apply(incrementBy, by =5)df[["mpg", "ApplyResult"]].head()
mpg
ApplyResult
Mazda RX4
21.0
26.0
Mazda RX4 Wag
21.0
26.0
Datsun 710
22.8
27.8
Hornet 4 Drive
21.4
26.4
Hornet Sportabout
18.7
23.7
Missing values
Expanding on an example from here. Pandas uses np.nan instead of NA.