1
Introduction to R
In this first lab, you'll learn the basics of how to analyze data with R. You are suggested to take this introductory lab if you are not yet familiar with this powerful open-source language.
2
Introduction to data
Some define Statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information - the data. In this lab, we will gain insight into public health by generating simple graphical and numerical summaries of a data set collected by the Centers for Disease Control and Prevention (CDC). As this is a large data set, along the way we'll also learn the indispensable skills of data processing and subsetting.
3
Probability
In this lab, we will investigate the phenomenon of hot hands in basketball, or specifically, whether Kobe Bryant has hot hands. We will make use of simulations in our investigation.
4
Foundations for inference: Sampling distributions
In this two part lab we will investigate sampling distributions and the Central Limit Theorem as well as confidence intervals. We will use housing data from Ames, Iowa (a small town in the US) in our exploration.
5
Foundations for inference: Confidence intervals
In this two part lab we will investigate sampling distributions and the Central Limit Theorem as well as confidence intervals. We will use housing data from Ames, Iowa (a small town in the US) in our exploration.
6
Inference for numerical data
In this two part lab we will work on inference for numerical data. We will use a dataset on births from North Carolina as well as data from the General Social Survey.
7
Inference for categorical data
In this lab we will work on inference for categorical data using data from a world-wide survey on religiosity and atheism.
8
Introduction to linear regression
The movie Moneyball focuses on the "quest for the secret of success in baseball". It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player's ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. In this lab we'll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team's runs scored in a season.
9
Multiple linear regression
Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. The article titled, "Beauty in the classroom: instructors' pulchritude and putative pedagogical productivity" (Hamermesh and Parker, 2005) found that instructors who are viewed to be better looking receive higher instructional ratings. In this lab we will analyze the data from this study in order to learn what goes into a positive professor evaluation.

Confidence intervals

Now let's return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as $\bar{x}$ (here we're calling it sample_mean).

That serves as a good point estimate but it would be useful to also communicate how uncertain we are of that estimate. This can be captured by using a confidence interval.

We can calculate a 95% confidence interval for a sample mean by adding and subtracting 1.96 standard errors to the point estimate.

se <- sd(samp)/sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)

It is an important inference that we make with this: even though we don't know what the full population looks like, we're 95% confident that the true average size of houses in Ames lies between the values lower and upper.

Calculate the 95% confidence interval as described above.

script.R

R Console