1. How to visualize data in Python?
Great work grouping data! Now let's talk about what to do if you are asked how to visualize data in Python.
2. matplotlib
The usual way to proceed is to use the matplotlib module.
More precisely, to use its pyplot submodule. Usually, we abbreviate it as plt. We'll consider basic plots such as:
scatter plot
histogram
and boxplot.
3. Dataset
We'll keep working on the diabetes dataset you've seen before. The last column indicates the output of the test on diabetes
4. Scatter plot
Let's start with the scatter plot. It's a simple representation of data points in two-dimensional space, given that each point has valid coordinates. Scatter plot is very useful for examining how two numeric variables relate to each other.
5. Create a scatter plot
Given a DataFrame,
we can create a scatter plot simply by inserting columns of interest in the scatter() function. The order is important: the first argument corresponds to the horizontal axis, the second - to the vertical one. To show the plot, we have to supply our script with the show() function. This function should always be at the end to complete our plotting activity. Now we see our scatter plot. Can you notice what's wrong with it? Right, it doesn't have neither a title nor labels, which is a very bad practice!
6. Create a scatter plot
To add a title, we can use the title() function.
7. Create a scatter plot
To add labels for horizontal and vertical axes, we can use the xlabel() and ylabel() function, respectively. It's OK to skip the title, but NEVER forget to label your axes!
8. Histogram
Let's move on and meet the histogram! It's a special plot showing how our numerical data is distributed. The horizontal space is divided into so-called bins. The height of a bin indicates how many data points are enclosed in the horizontal space spanned by it. Here, for example, we see that the majority of data points is concentrated around 0.
9. Create a histogram
Let's create a histogram showing the distribution of the BMI indices in our diabetes data.
We need to call the hist() function with the chosen column as an argument.
10. Create a histogram
We can also change the amount of bins used to create a histogram. We just need to use the corresponding keyword argument.
11. Boxplot
Let's move on to boxplots! Like a histogram, a boxplot shows how our numerical data is distributed. Here, 50% of data points are located within the box with the orange line indicating the median value. In turn, the whiskers show the spread of our data. What is outside this range is considered as an outlier. As you can see, boxplots are great when we want to show if there is a difference between groups.
12. Create a boxplot
To create a boxplot, it's much easier to use the seaborn module rather than the matplotlib. Usually, we abbreviate it as sns.
To visualize the data, we have to use the boxplot() function. We have to specify the data source with the data keyword argument and the column names from this source. In this case, the first argument corresponds to the column with test results, which, as it is a factor, is responsible for the amount of boxplots we see. The second argument corresponds to the column with BMI indices, which represents the actual data for each boxplot.
13. Create a boxplot
We can precisely define what is plotted against horizontal x axis and against vertical y axis with the corresponding keyword arguments.
14. Create a boxplot
Changing the order of keyword arguments rotates the boxplot. Finally, notice that with the seaborn module we don't need to specify our axis labels. It's done automatically!
15. Let's practice!
Of course, there are many more plot types. For the moment though, let's practice our knowledge on the ones refreshed in this lesson.