import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
from numpy.random import default_rng
from scipy import stats
Statistics
= sm.datasets.get_rdataset("mtcars", "datasets", cache = True).data df
Summary Stats
# get the data as a vector
= df.mpg
v
np.mean(v)# (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html)
= False)
stats.mode(v, keepdims
np.median(v) np.var(v)
35.188974609374995
Simulating Data
After trying a few things, it looks like numpy is king here.
Create an RNG Object
To get started, create an RNG object that will make sure everything is initialized correctly.
= default_rng() rng
We can now go through a list of common distributions, simulate from them, and visualize to make sure we’re on the right track.
Normal/Guassian
= 1000
n
# because calling the mean and sd the "mean" and "sd" would be too obvious
= rng.normal(loc = 5, scale = 2.5, size = n)
v_norm sns.kdeplot(v_norm)
<AxesSubplot: ylabel='Density'>
Uniform
# this makes sense lol
= rng.uniform(low = 0, high = 1, size = n)
v_unif sns.kdeplot(v_unif)
<AxesSubplot: ylabel='Density'>
Poisson
# lamda was too long I guess
= rng.poisson(lam = 1, size = n)
v_poisson sns.histplot(v_poisson)
<AxesSubplot: ylabel='Count'>
Bernoulli
# bernoulli
# https://stackoverflow.com/questions/47012474/bernoulli-random-number-generator
= rng.binomial(n = 1, p = 0.5, size = n)
v_bernoulli sns.histplot(v_bernoulli)
<AxesSubplot: ylabel='Count'>
Binomial
# binomial
= rng.binomial(n = 4, p = 0.15, size = n)
v_binomial sns.histplot(v_binomial)
<AxesSubplot: ylabel='Count'>
Negative Binomial
# this makes sense
= rng.negative_binomial(n = 1, p = 0.25, size = n)
v_negative_binomial sns.histplot(v_negative_binomial)
<AxesSubplot: ylabel='Count'>
Modeling
Linear Regression
Another routine thing that will come up a lot is linear regression. It’s not as obvious as R, but it’s pretty straight-forward.
# you can use formula is you use smf
= smf.ols(formula = 'wt ~ disp + mpg', data = df).fit() linear_model
We can get the results using sume summary()
method, and it will look pretty familiar to R users.
linear_model.summary()
Dep. Variable: | wt | R-squared: | 0.836 |
Model: | OLS | Adj. R-squared: | 0.824 |
Method: | Least Squares | F-statistic: | 73.65 |
Date: | Fri, 27 Jan 2023 | Prob (F-statistic): | 4.31e-12 |
Time: | 10:21:49 | Log-Likelihood: | -15.323 |
No. Observations: | 32 | AIC: | 36.65 |
Df Residuals: | 29 | BIC: | 41.04 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 3.5627 | 0.699 | 5.094 | 0.000 | 2.132 | 4.993 |
disp | 0.0043 | 0.001 | 3.818 | 0.001 | 0.002 | 0.007 |
mpg | -0.0663 | 0.023 | -2.878 | 0.007 | -0.113 | -0.019 |
Omnibus: | 0.431 | Durbin-Watson: | 1.028 |
Prob(Omnibus): | 0.806 | Jarque-Bera (JB): | 0.579 |
Skew: | 0.187 | Prob(JB): | 0.749 |
Kurtosis: | 2.458 | Cond. No. | 2.52e+03 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.52e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Logistic Regression
The next logical step is logistic regression. No surprises here.
# fit the model
= smf.glm("am ~ wt + mpg", data = df,
logistic_model = sm.families.Binomial()).fit()
family
logistic_model.summary()
Dep. Variable: | am | No. Observations: | 32 |
Model: | GLM | Df Residuals: | 29 |
Model Family: | Binomial | Df Model: | 2 |
Link Function: | Logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -8.5921 |
Date: | Fri, 27 Jan 2023 | Deviance: | 17.184 |
Time: | 10:21:49 | Pearson chi2: | 32.7 |
No. Iterations: | 7 | Pseudo R-squ. (CS): | 0.5569 |
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
Intercept | 25.8866 | 12.194 | 2.123 | 0.034 | 1.988 | 49.785 |
wt | -6.4162 | 2.547 | -2.519 | 0.012 | -11.407 | -1.425 |
mpg | -0.3242 | 0.239 | -1.354 | 0.176 | -0.794 | 0.145 |
ANOVA
The next basic anaylsis I wanted to recreate was ANOVA. This is handled nicely by statsmodels
, looking more or less like the previous models.
# fit the initial model
= smf.ols("cyl ~ mpg + disp", data = df).fit()
anova_model
= sm.stats.anova_lm(anova_model)
anova anova
df | sum_sq | mean_sq | F | PR(>F) | |
---|---|---|---|---|---|
mpg | 1.0 | 71.801048 | 71.801048 | 132.393794 | 2.496891e-12 |
disp | 1.0 | 11.346399 | 11.346399 | 20.921600 | 8.274021e-05 |
Residual | 29.0 | 15.727553 | 0.542329 | NaN | NaN |
Tukey Test
After an ANOVA, some sort of post-hoc test is usually preformed. This isn’t as obvious as the ones above, requiring us to specify monovariate vectors instead of using a formla.
# specify the groups without the formual
= sm.stats.multicomp.pairwise_tukeyhsd(endog = df["mpg"],
tukey_results = df["cyl"])
groups
print(tukey_results)
Multiple Comparison of Means - Tukey HSD, FWER=0.05
=====================================================
group1 group2 meandiff p-adj lower upper reject
-----------------------------------------------------
4 6 -6.9208 0.0003 -10.7693 -3.0722 True
4 8 -11.5636 0.0 -14.7708 -8.3565 True
6 8 -4.6429 0.0112 -8.3276 -0.9581 True
-----------------------------------------------------