Get startedGet started for free

Descriptive and Inferential Statistics

1. Descriptive and Inferential Statistics

Welcome back! Statistical tools are extremely helpful because they aid us in analyzing our data for meaningful insights, and discovering the meaning of our results.

2. Descriptive statistics

Descriptive statistics are the basic measures used to describe survey data, which includes the mean, median, mode, range, standard deviation and more. One function that helps us to do this is the describe function.

3. .describe() function

The pandas describe function computes a descriptive statistics summary of the numerical data types in a DataFrame. It calculates the count of values, mean, standard deviation, minimum and maximum values, and the percentiles of data, 25% 50% 75% by default. For non-numerical data types, the parameter "include" will equal NumPy's dot-object function, and will calculate the count of values, the number of unique values, the top or most frequent value, and the frequency of our top value.

4. Interpreting .describe()

When using dot-describe on numerical columns, there are some things we need to look out for and address when interpreting our data. If the maximum value is greater than the mean and median value, then outliers, or data points that are significantly different from the rest, exist in the dataset. If a value doesn't make any sense, like a negative "Age" value, then the data needs to be double-checked or cleaned.

5. Interpreting .describe()

When using dot-describe on categorical columns, we can infer useful information from the highest occurring class and the number of times it occurred.

6. .describe() on electric_satisfaction

Let's use the dot-describe function on a dataset from Austin Energy that surveys customers quarterly on their satisfaction with their electricity company. On the survey, we will call dot-describe to get a summary statistic of the numerical columns of the data.

7. .describe() on electric_satisfaction

Notice that our satisfaction_rating column shows the presence of outliers since the maximum value is significantly greater than both the mean and median value. Remember, that the 50th percentile is the same as the median of the data.

8. .describe() on electric_satisfaction

As we can see from the top and frequency columns, we can infer that more residential respondents took this survey.

9. Inferential statistics

Once we've summarized our data using descriptive statistics, inferential statistics help us to make predictions about the larger population. Since sample size is always smaller than the population size, sampling error occurs. Sampling errors are statistical errors that arise when a sample does not represent the whole population. We don't know the real population parameters, such as the mean satisfaction rating for the population, but we can estimate using inferential statistics. One way to make this estimate is through confidence intervals. The norm-dot-interval function helps us to calculate this.

10. The norm.interval() function

The norm-dot-interval function is used on large datasets. It assumes the sample is normally distributed, due to the central limit theorem. A function from the scipy-dot-stats library, the norm-dot-interval method, accepts the confidence level as alpha, sample mean as loc, sample standard error as scale, and returns a confidence interval as a result.

11. Interpreting norm.interval() on electric_satisfaction

Let's practice this. On our electric_satisfaction survey, let's calculate the confidence interval for the true population mean satisfaction_rating, using a 99% confidence level. First, call the norm-dot-interval function with alpha equal to 0-point-99, loc equal to the mean satisfaction rating, and scale equal to the sample standard error. Generalizing to the larger population, there is a 99% chance that the confidence interval of 6817-point-19 to 7568-point-53 years contains the true population mean satisfaction rating during the years this survey was taken.

12. Let's practice!

Now it's your turn to describe and make inferences from survey data.