Get startedGet started for free

What is exploratory data analysis?

1. What is exploratory data analysis?

Welcome to the course! My name is Jacob and I'll be your instructor. We will be using the strengths and flexibility of Microsoft's Power BI to perform Exploratory Data Analysis, or simply EDA. By doing so, we will confidently understand new datasets and answer business questions.

2. What is exploratory data analysis?

Wikipedia describes EDA as "an approach of analyzing data sets to summarize main characteristics, often using statistical graphics and other data visualization methods." It is the foundation of any analytics project. Power BI is a great tool for EDA, especially because of it's visualization capabilities.

3. Six steps to EDA

There are six tasks of EDA: understanding data structure, identifying missing data, describing data with descriptive statistics & distributions, identifying outliers, examining and quantifying relationships between variables, and finally forming hypotheses. This final step will not be covered in this course.

4. Six steps to EDA

In this lesson, we will start with the first three steps.

5. 1. Understanding the data structure

Before diving into any dataset, its important to know the data structures - number of rows, columns, and data types. There are two basic types of variables. Continuous variables are often numerical, taking on infinite set of values. Examples are number of stars, click-through rates, and distance between cities. Categorical variables, or non-numerical, can have two or more groups. Some examples here would be house types, country, and company. Knowing this structure helps influence the next steps of EDA.

6. 2. Identifying missing data

Missing data is expected and can have impacts on conclusions you can draw from the data. It is important to determine if values are missing at random or if clear patterns explain their absence. Analyzing missing values by variables in the dataset, or groups within a variable as shown here, will help reveal possible systemic patterns.

7. 2. Addressing missing data

Pinpointing patterns of missing data can inform the best routes for possible remediation. There are two typical methods. The first is removing rows from the dataset. This may lead to a smaller sample and sometimes not enough data to perform further analysis. The other method is imputation which is the process of replacing data points with another value (such as a median or average). This especially helpful with small numbers of missing values.

8. 3. Describing the data

Next is to describe the distribution of variables with descriptive statistics. Minimum and maximum are the lowest and largest values, respectively. Mean is the sum of all values divided by the number of observations. Median is the value in the center of a range of values. Standard Deviation is the average amount of difference from the mean of a variable observed across all data points. We will dive deeper into this statistic in the next lesson.

9. 3. Describe the data with distributions.

The descriptive statistics are often visualized with graphs such as histograms and box plots. The shape of the resulting visualization tells a lot about the distribution - here we see median and mean are the same value. It is symmetrical in that nearly equal amounts of the distribution lie on both sides of the median.

10. 3. Describing the data with distributions

Here are two skewed distributions. The left histogram is right-skewed because of the long tail towards the right. Likewise, you can see the median is smaller than the mean. The right histogram is left-skewed because of the long tail towards the left. In this one, the median is larger than the mean.

11. The dataset: AirBnB listings

During the exercises, you will put on the hat of an analyst at a well known vacation rental platform. You will perform the first three EDA steps using a dataset of AirBnB listings. It contains information such as location, the host's start date, and ratings. You will aim to understand trends of new hosts coming onto the platform. To do so, first, familiarize yourself with key variables using descriptive statistics. Then, check for completeness of the data and address any missing information.

12. Let's practice!

Now it's your turn to get started!