Get startedGet started for free

Background on modeling for explanation

1. Background on modeling for explanation

Hello, welcome to the next course in DataCamp's "Learn the tidyverse" track: "Modeling with data in the tidyverse". In this course, you'll leverage the data wrangling and visualization toolbox you developed in previous courses to learn about modeling. The ideas behind modeling are crucial to many fields, including statistics, causal inference, machine learning, and artificial intelligence.

2. Course overview

You'll start by equipping yourself with some theory and terminology related to modeling. In Chapters 2+3, you'll learn one of the most widely used techniques for modeling: linear regression. You'll end by assessing the quality of models. For example, how well does a model fit given data? OR how good are a model's predictions?

3. General modeling framework formula

Let's start with the general modeling framework as expressed by following formula where you have: y, an outcome variable, the phenomenon you wish to model. x, a set of explanatory or predictor variables used to inform your model. The arrow on the x indicates that x can be a vector, in other words a series of values. f, a mathematical function making explicit the relationship between y and x. f(x) is also called the "signal". And finally epsilon, an unsystematic error component. epsilon is also called the noise. Let's first focus only on y and x, and revisit f and epsilon later.

4. Two modeling scenarios

Previously I called x both explanatory and predictor variables. Which term you use when roughly depends on which modeling scenario you're addressing: -If you want to explain what factors are associated with or cause the outcome variable, you are "modeling for explanation" and thus x are "explanatory" variables. -If you want to make predictions of the outcome variable, you are "modeling for prediction" and thus x are "predictor" variables. Let's start with an example of modeling for explanation.

5. Modeling for explanation example

At the end of academic terms at many universities and colleges, instructors are given teaching evaluation scores by students. A study conducted at the University of Texas Austin investigated whether differences in scores could be explained by differences in instructor attributes. The outcome variable is average teaching score for different courses. Explanatory variables include: -rank -gender, which at the time of this study was recorded as a binary variable: male or female -age -And even the instructor's "beauty score" bty_avg, we'll talk more about that later.

6. Modeling for explanation example

The evals dataframe included in the moderndive package contains this data. The moderndive package is used in ModernDive.com, an open-source written and published electronic textbook on statistical and data sciences that Chester Ismay of DataCamp and I have co-authored. This package includes other data and functions you'll be using in this course. Let's preview the data using the glimpse function from the dplyr package. Observe that there are 463 instructors and 13 variables in this data frame.

7. Exploratory data analysis

A crucial first step is an exploratory data analysis, or EDA. EDA gives you a sense of your data and it can help inform model construction. -There are three basic steps to an EDA: -Most fundamentally, looking at the data, via a spreadsheet viewer or using glimpse as I did earlier. -Creating visualizations. -Computing summary statistics. Let's do this for the outcome variable score.

8. Exploratory data analysis

Since score is numerical, let's construct a histogram to visualize its distribution by using a geom-histogram from the ggplot2 package, where the x-aesthetic is mapped to score. Let's also set a binwidth of 0.25.

9. Exploratory data analysis

Observe... the largest score is 5 and most scores are between about 3-5. But what's the average? Let's perform the third step in our EDA, computing summary statistics.

10. Exploratory data analysis

Summary statistics summarize many values with a single value called a statistic. Let's compute three such summary statistics using the summarize() function. The mean, or average, score is 4.17, whereas the median of 4.3 indicates about half the instructors had scores below 4.3 and about half above. The standard deviation, a measure of spread and variation, is 0.544.

11. Let's practice!

In our first exercise, you'll be performing an EDA on a different numerical variable, this time instructor age.