1. Inferential statistics in survey analysis
Let's now look at inferential statistics. Descriptive statistics uses summary statistics, graphs, and tables to describe a data set while inferential statistics use samples to draw conclusions about larger populations.
2. Introduction to inferential statistics in survey analysis
When we have data from a sample survey, we can use inferential statistics to understand the larger population from which the sample is taken.
Inferential statistics enable us to draw conclusions from the data. More specifically in surveys, inferential statistics includes associations between variables, how well our sample represents a larger population, and cause-and-effect relationships.
3. Sample scenarios
If for example, we want to estimate the mean driving age for all teens in the US, we use inferential statistics. Similarly, drawing conclusions based on the relationship between ranking and job satisfaction requires inferential statistics.
4. Z-score
One method we use to perform inferential statistics on survey data is with the z-score. The z-score tells us how many standard deviations below or above the population mean is a value. A positive z-value is higher than the mean and a negative z-value is less than the mean. Also, the z-score allows us to easily compare datapoints for a record across features, especially when the different features have significantly different ranges. Since the z-score must be used with a normal distribution, we know that over 99% of values fall within three standard deviations from the mean. Therefore we can assume that if a z-score returned is larger than the absolute value of three, that the value is quite unusual.
In Python, z-score is calculated using the scipy-dot-stats module. The zscore function takes an array of values and returns an array containing their z-scores, implicitly calculating the mean and standard deviation.
5. Survey example on demographics
Let's see this with an example. Here we have a survey that explores the demographics of young Slovakians, aged 15-30 years.
6. Visualizing age column
Let's visualize the distribution of the Age column using a histogram. By calling on the Age column of the survey and the plot function, we specify the kind to hist, showing us bars to represent frequencies of each age value in the Age column.
7. Calculating z-score on age column
If we wanted to find the age outliers of this group, we can create a new column called Age_zscore, and use the scipy dot stats dot zscore function on the Age column. This will give us normalized values of the Age column.
8. Calculating z-score on age column
If we were to subset the survey data to show us respondents that are greater than the absolute value of three, we see that our results show that our 29 and 30 year old respondents are the unusual age group for this survey. Because this age group are outliers for this particular survey, it is up to us and the type of analysis being conducted, whether we want to include or exclude their data in further analysis.
9. Z-score analysis
If, for example, we wanted to analyze this group's spending habits, it is likely that 29- and 30- year olds may have completely different spending habits than the rest of the survey population, potentially skewing our results, and thus, we may consider excluding their data.
10. Let's practice!
Let's practice this on some surveys!