Get startedGet started for free

Relationships between continuous variables

1. Relationships between continuous variables

Continuing the fourth step of the EDA process, we will move into exploring the relationships between two continuous variables.

2. What are scatter plots?

Scatter plots expose the relationships between two continuous variables. They are ubiquitous in data analysis and visualization.

3. What are scatter plots?

The basic elements comprise first of a continuous variable on the x-axis and another on the y-axis.

4. What are scatter plots?

Each dot in the chart area represents one observation within the dataset. The placement of the dot is related to the position of the continuous variables on the x-and y-axes. Here, the further to the right a dot is, the larger the "Total Bill". The higher up on the scatter plot, the larger the "Tip".

5. Interpreting a scatter plot

Returning to the previous scatter plot showing tip amount vs. total bill, we would first describe the relationship between these two variables as positive (i.e. the direction). Said another way, as the total bill amount increases, so does the total tip amount.

6. Interpreting a scatter plot

Trend lines are often helpful for making the direction of the relationship more clear. The goal of the line is to go centrally as possible through all of the points, i.e. to model the data.

7. Interpreting a scatter plot

Strength of a relationship can be determined visually by looking at how dispersed the dots in the scatter plot on. Are they tightly packed as they move positively or negative? Or is there more space, i.e. variation, among them? In this graph, there appears to be some slight dispersion. Therefore, we could call the relationship "moderately-strong".

8. Interpreting a scatter plot

For learning purposes, let's take a look at a couple of more scatter plots. First, we see a strong, positive relationship between two variables. The cluster of points are moving upward to the right, or a x increases, so does y. There is also very little dispersion. Next, we see a strong, negative relationship between two variables. The cluster of points are moving downward to the right, or a x increases, y decreases. There is also again very little dispersion. Here, we see a weak, positive relationship. The cluster seems to generally move upwards to the right but there is so much variation in the data points, it is difficult to say. Finally, there is a large blob of data points with no real direction. This would indicate very little to no relationship.

9. Correlation coefficient

Articulating the relationship between two variables from a scatter plot is helpful, but being able to quantify this relationship can bring rigor to your analysis. The correlation coefficient is such a statistic. It ranges in value from -1 to 1, where -1 indicates a strong, negative relationship, 0 indicates no relationship, and 1 a strong-positive relationship. The details about the calculation are beyond the scope of this course.

10. Correlation coefficient and scatter plots

Returning to our example scatter plots, let's add the correlation coefficient for each. 0.9 for the strong-positive relationship. -0.9 for the strong-negative relationship. 0.35 for the weak-positive relationship. 0, of course, for no relationship.

11. Adding context to a scatter plot

Finally, adding further context to scatter plots can help unlock the story being told by the data and the relationship. A common method to do this is through adding other variables as visual elements - color, shapes, size of dots, etc. Here we see the variable "Party Size" representing different colors, where darker colors are a smaller size of party. Intuitively, smaller parties also have smaller bills and therefore tip amounts.

12. Let's practice!

Now it's your turn to build and interpret scatter plots.