1. Scatterplots
In this section we will transition from a histogram which visualizes a single variable to a scatter plot which explores relationships among two variables.
2. Don't be scatter brained
Its named scatterplot because each dot represents a row in your data and the points often appear to be scattered along the coordinates. A scatterplot is great because your brain can identify patterns among the two variables. In tabular form, identifying patterns or relationships may be difficult especially if you have hundreds or thousands of rows. This small data set shows the number of PhDs awarded in computer science and annual revenue for gaming arcades in the US. Each rows represents a year of data sorted from smallest to largest.
3. Aha! There is a correlation!
Now let's visualize this data as a scatterplot. Once in this form, its much easier for you to see an upward and to the right general shape. To put it another way, as the number of PhD computer science degrees increases, so does annual revenue for arcades. I bet you can picture a straight line running from the bottom left to upper right, even if the line has to run in between dots you can identify the positive slope of your imaginary line. The imaginary line, also called a trendline represents the relationship that as degrees increase and so does the revenue.
4. Making your imagination real
In fact, in sheets you can add that imaginary line to a scatter plot to make it real. Its called a trend line to help you see the overall trend relationship. To add a trendline in sheets, click "Insert" then "Chart". Once the chart dialog opens up select "scatter chart" from the drop down. Then ensure your data is declared properly. This will add a scatter plot to your sheet. The default is to leave out the trendline. So you have to click "customize" in the dialog and under "series" check the "trendline" box. You just brought your imagination to life to help you understand variable relationships!
5. How did that work?
The trendline is added or "fit" so that the distance between the line and each point is minimized represented by the green arrows. Other DataCamp courses go into much more detail about this type of line fitting. In fact, its the principle behind linear models used by data scientists. So check those out if you are interested. For now, we'll stick with stats.
6. Stats about the trendline
Speaking of stats, in sheets you can get the slope and intercept of the trend line using the LINEST or line estimate function. It accepts the two ranges and will return the slope then the intercept. The intercept is the point at which the line touches or intersects the vertical Y axis. The slope can be interesting because it declares the trend relationship between the variables. For example a slope of 1.5 means that as the X variable increases 1 the Y variable will increase by 1-point-5. In our previous example a slope of 1 would mean for each new computer science PhD, $1 billion more in arcade revenue would be realized. And if you have a slope of -1 that means that as the X variable decreases by 1 the Y variable will increase by 1. So negative slopes also tell you something about the relationship.
7. Let's practice!
Go make some scatterplots with trendlines! Good luck.