1. Visuals & Distributions
Let’s narrow our focus to learning about the distribution of site level information. Sites like Match dot com study the frequency of users’ logins. Hours since last login is an important metric for business viability & user behavior.
2. Distribution Stats
In the upcoming exercises you will calculate a variable’s variance, standard deviation & even the z-score. Variance is a measurement of a variables spread from mean. It’s a simple formula where the mean is subtracted from each data point then squared and finally another mean is calculated among those differences. However, a more intuitive statistic may be standard deviation which measures the data dispersion in the original data's scale. To calculate a standard deviation just take the square root of the variance. Lastly, for any specific data point you may want to measure its distance from mean in a standard unit of measure. This is where the z-score can be helpful. A z-score is the number of standard deviations from the mean that the data point is.
3. Visualizing Distributions
People are excellent at identifying patterns visually. As a result, its almost always a good idea to visually inspect your data. In this case, creating a histogram of hours since last login makes sense. It can be hard to imagine the shape of a distribution from the statistics alone. The histogram is a helpful way to see the variable distribution. Remember a histogram buckets data points into groups. Low values are grouped together, then middle values & so on. As a result, histograms can be misleading because the visual changes based on the number of bins or buckets used. As the practitioner you need to ensure your histogram represents the distribution fairly.
Other considerations when making histograms are whether you have enough observations to make the visual informative & whether or not you are working with a sample or entire population.
4. Histogram-ing!
In spreadsheets, a histogram is a type of CHART. You will need to navigate to INSERT then click CHART to open the visualization dialogue. After that its straight forward, choose “Histogram chart” from the type and then declare the data range.
5. Comparing Distributions
You will also be visually exploring the important matching feature AGE. Constructing an AGE histogram will illustrate a different type of data distribution compared to hours since last login. As a major matching factor, the company may want to ensure the distribution is “normal”. A non-normal distribution for an important customer characteristic may indicate the site over-indexes with young, middle-aged or old people. For some types of businesses, an over emphasis illustrated in a non-normal distribution could become a risk factor. For example, a mutual fund heavily indexed to large market cap stocks may expose the fund to additional risk where diversification could help. In our case if AGE was non-normal, we could identify new age groups for marketing.
6. Normal Stats
Even though in many cases you can visually inspect a normal versus non-normal distribution, you will still need to calculate the skew & kurtosis for AGE in this lesson. To review, the skew is a measure of a non-symmetrical "tail" to the right or left of the mean. The kurtosis measures how the distribution trails off on either side of a peak in the histrogram. A general practice to test for normalcy is to measure the skew & kurtosis values. When both values are between -2 & 2 you can state the distribution is normally distributed.
7. Let's practice!
So close! Let's make those visuals!