Get startedGet started for free

Exploring the data

1. Exploring the data

It's time to dive deeper into the case study and start exploring the data.

2. Exploratory data analysis

Exploratory data analysis is an important step in any data analysis and can offer valuable insights. We can use it to assess the main characteristics of the data, find relationships, patterns, or groups, and suggest hypotheses for future analysis. How can we do this? We can describe the data numerically or create visualizations to help us investigate the data. We'll discuss some examples in the following slides.

3. Example: age characteristics

Descriptive statistics such as those given in the table here are useful to help us understand the main characteristics of the respondents. For example, the average and median age is about 43; the youngest respondent is 21, and the oldest is 66.

4. Are all age groups represented?

Plotting the counts of each age gives us the histogram of age. This plot tells us more about the distribution of the variable. This can help us check whether all age groups are reasonably well represented in the survey. We can see, for example, that the most common age matches the average age.

5. Are all age groups represented?

You might have also noticed that the ages to the left of the average have slightly higher counts than the older ages. Is this problematic? That depends. If the company has more younger employees overall, the survey's normal to reflect this as well. However, if the company has an older workforce on average, it is possible younger employees are over-represented compared to older employees. In that case, it would be wise to alert management and recommend involving older employees more so their voices can be heard.

6. Example: remote frequency vs. preference

Visualizations are also very helpful in investigating how multiple variables are related. This plot explains the relationship between remote working preferences and how often people work remotely. The circle size reflects the people count for each combination of remote working preference and actual remote working percentage. If both align, we expect to see the largest circles across the diagonal. If they tend to work remotely less frequently than they prefer, larger dots would appear below the diagonal. Vice versa, if they work more at home than preferred, large dots will appear above the diagonal.

7. Example: remote frequency vs. preference

From the plot, both frequency and preference largely align in our case.

8. Example: remote frequency vs. preference

However, some larger dots appear in the 50% group where remote preference is higher than the actual frequency. Note that these are general tendencies in the data and might hide some more specific differences between subgroups. For example, maybe younger or less senior employees are less likely to work remotely according to their preference. We'll dive deeper into this during our main analysis.

9. Let's practice!

That's it for this video. Time for you to explore the data!