1. Welcome to the course!
Welcome! My name is Maarten, and I'll be guiding you through this course.
2. Exploratory Data Analysis (EDA)
In 1977, the statistician John Tukey highlighted the importance of exploratory data analysis, or EDA. EDA allows you to quickly determine the main characteristics of your data, spot extreme values, suggest hypotheses, and assess assumptions for statistical models. The goal of EDA is to get an idea of the overall structure of your data. Tableau has many functionalities to help you with this process.
In this first chapter, we will focus on univariate EDA, or, simply put, looking at one variable at a time. To do this, we'll use tables, bar plots, histograms, and box plots.
3. Tables & bar plots
The simplest way of getting a summary of a single, categorical variable, is to count the occurrences of each category. You can display the counts as a table, or as a bar plot, as shown here. The choice between a table and a plot depends on the context.
4. When to use a table vs. a plot
Sometimes, a non-graphical way to visualize your data is the best option: if your audience is just interested in the snapshot of a particular dataset, just wants to compare a few values, or when the exact values matter and a small difference between values is crucial. It can also be the most precise option if you display your data in a printed form, rather than an interactive visualization.
Take the example on the right: the smaller values of category A and B disappear when plotted against the much larger value of D. In this case, a table gives the most accurate and truthful result.
You will see more examples later where a plot is the better option.
5. Histograms
When you're dealing with a single, continuous variable, there is one main way to look a the distribution of your data: plotting a histogram.
A histogram shows the data's lowest and highest values, and which values are most common. This is achieved by splitting the variable in several bins, or ranges of your data, with the height of the bins corresponding to the number of observations in that range.
In this example, the variable "Quantity", representing a fictional number of items ordered per customer, is split in bins of one item.
By creating a separate bin per item, you can see that the minimum of ordered items is one, and that the maximum is 14. The height of each bar represents how many customers ordered that number of items. In this case, most customers ordered 2 or 3 items, and a minority of customers ordered 10 items or more.
6. Size of bins
How your histogram looks, is strongly dependent on the size of the bins or binwidth. Both histograms use the same data as on the previous slide, but with different bin sizes. Choosing a binwidth that is too narrow can make the distribution of the data look more noisy, but a binwidth that is too wide removes detail. Tableau will suggest a bin size based on the number and spread of your variable you want to create bins on, but it is best practice to experiment with different bin sizes that help you solve the question you are trying to answer.
7. Modality
Another characteristic of histograms is that they allow you to spot the modality, or the number of peaks. A distribution with one peak is called unimodal; a distribution with two peaks is called bimodal, and so on.
The mode of a distribution is therefore the most occurring value, or in the case of histograms, the highest bar when your bin size is one.
8. Let's practice!
Let's put your knowledge to the test.