Using statistics¶
Anaconda Enterprise supports statistical work using the R language and Python libraries such as NumPy, SciPy, Pandas, Statsmodels, and scikit-learn.
The following Jupyter notebook Python examples show how to use these libraries to calculate correlations, distributions, regressions, and principal component analysis.
These examples also include plots produced with the libraries seaborn and Matplotlib.
We thank these sites, from whom we have adapted some code:
Start by importing necessary libraries and functions, including Pandas, SciPy, scikit-learn, Statsmodels, seaborn, and Matplotlib.
This code imports load_boston
to provide the Boston housing dataset from the
datasets included with scikit-learn.
Load the Boston housing data into a Pandas DataFrame:
In the Boston housing dataset, the target variable is MEDV, the median home value.
Print the dataset description:
Show the first five records of the dataset:
Show summary statistics for each variable: count, mean, standard deviation, minimum, 25th 50th and 75th percentiles, and maximum.
Correlation matrix¶
The correlation matrix lists the correlation of each variable with each other variable.
Positive correlations mean one variable tends to be high when the other is high, and negative correlations mean one variable tends to be high when the other is low.
Correlations close to zero are weak and cause a variable to have less influence in the model, and correlations close to one or negative one are strong and cause a variable to have more influence in the model.
Format with asterisks¶
Format the correlation matrix by rounding the numbers to two decimal places and adding asterisks to denote statistical significance:
Heatmap¶
Heatmap of the correlation matrix:
Pairwise distributions with seaborn¶
Target variable distribution¶
Histogram showing the distribution of the target variable. In this dataset this is “Median value of owner-occupied homes in $1000’s”, abbreviated MEDV.
Simple linear regression¶
The variable MEDV is the target that the model predicts. All other variables are used as predictors, also called features.
The target variable is continuous, so use a linear regression instead of a logistic regression.
Split the dataset into a training set and a test set:
A linear regression consists of a coefficient for each feature and one intercept.
To make a prediction, each feature is multiplied by its coefficient. The intercept and all of these products are added together. This sum is the predicted value of the target variable.
The residual sum of squares (RSS) is calculated to measure the difference between the prediction and the actual value of the target variable.
The function fit
calculates the coefficients and intercept that minimize the
RSS when the regression is used on each record in the training set.
Now check the accuracy when this linear regression is used on new data that it was not trained on. That new data is the test set.
This scatter plot shows that the regression is a good predictor of the data in the test set.
The mean squared error quantifies this performance:
Ordinary least squares (OLS) regression with Statsmodels¶
Principal component analysis¶
The initial dataset has a number of feature or predictor variables and one target variable to predict.
Principal component analysis (PCA) converts these features into a set of principal components, which are linearly uncorrelated variables.
The first principal component has the largest possible variance and therefore accounts for as much of the variability in the data as possible.
Each of the other principal components is orthogonal to all of its preceding components, but has the largest possible variance within that constraint.
Graphing a dataset by showing only the first two or three of the principal components effectively projects a complex dataset with high dimensionality into a simpler image that shows as much of the variance in the data as possible.
PCA is sensitive to the relative scaling of the original variables, so begin by scaling them:
Calculate the first three principal components and show them for the first five rows of the housing dataset:
row |
principal component 1 |
principal component 2 |
principal component 3 |
---|---|---|---|
0 |
-2.097842 |
0.777102 |
0.335076 |
1 |
-1.456412 |
0.588088 |
-0.701340 |
2 |
-2.074152 |
0.602185 |
0.161234 |
3 |
-2.611332 |
-0.005981 |
-0.101940 |
4 |
-2.457972 |
0.098860 |
-0.077893 |
Show a 2D graph of this data:
Show a 3D graph of this data:
Measure how much of the variance is explained by each of the three components:
Each value will be less than or equal to the previous value, and each value will be in the range from 0 through 1.
The sum of these three values shows the fraction of the total variance explained by the three principal components, in the range from 0 (none) through 1 (all):
Predict the target variable using only the three principal components:
Plot the predictions from the linear regression in green again, and the new predictions in blue:
The blue points are somewhat more widely scattered, but similar.
Calculate the mean squared error: