Data encoding

Encoding of categorical data makes them useful for machine learning algorithms. R encodes factors internally, but encoding is necessary for the development of your own models.

In this exercise, you'll first build a linear model using lm() and then develop your own model step-by-step.

In one hot encoding, a separate column is created for each of the levels.

Note that one of the columns can be derived based on the others (e.g. 0's in the columns "B" and "C" imply 1 in the "A" column). So, you can drop the first column for the linear regression. We will review linear models in more detail in the next chapter.

For one hot encoding, you can use dummyVars() from the caret package.

To use it, first create the encoder and then transform the dataset:

encoder <- dummyVars(~ category, data = df)
predict(encoder, newdata = df)

The complete cases of the survey dataset from the MASS package are available as survey. The caret package has been preloaded.

This exercise is part of the course

Practicing Statistics Interview Questions in R

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Fit a linear model
lm(___ ~ Exer, data = ___)

Edit and Run Code

This exercise is part of the course

Practicing Statistics Interview Questions in R

AdvancedSkill Level

4.8+

Start Course for Free

Want to increase your odds of acing your job interview? If so, brush up on your knowledge of probability theory. In this chapter, we'll roll dice and shoot baskets to explain probabilities using real-life examples.

Exercise 1: Discrete distributions Exercise 2: Probability functions Exercise 3: Bernoulli trials Exercise 4: Binomial distribution Exercise 5: Continuous distributions Exercise 6: Uniform distribution Exercise 7: Shape of normal distribution Exercise 8: Sample from normal distribution Exercise 9: Central limit theorem Exercise 10: Law of large numbers Exercise 11: Simulating central limit theorem

If the job description appeals to you review descriptive statistics before the interview. In this chapter, you will practice exploratory data analysis (EDA) using natural gas prices and data from a survey analysis.

Exercise 1: Descriptive statistics Exercise 2: Centrality measures Exercise 3: Variability measures Exercise 4: Categorical data Exercise 5: Survey analysis Exercise 6: Data encoding

Current Exercise

Exercise 7: Time series Exercise 8: Time series object Exercise 9: Wrangling time series Exercise 10: Principal Component Analysis Exercise 11: PCA - rotation Exercise 12: PCA - dimension reduction

March confidently into your job interview after reviewing confidence intervals. We'll review the t-test, ANOVA, and normality tests to prepare you for statistics-based coding questions.

Exercise 1: Normality tests Exercise 2: Shapiro-Wilk test Exercise 3: Q-Q plot Exercise 4: Inference for a mean Exercise 5: Confidence interval Exercise 6: One-sample t-test Exercise 7: Comparing two means Exercise 8: Two-sample t-test Exercise 9: Paired test Exercise 10: ANOVA Exercise 11: Comparing groups Exercise 12: ANOVA for plant growth

Is your potential employer planning to test your R skills? Make sure you’re prepared and practice model evaluation beforehand. In this chapter, we will fit and evaluate linear and logistic regression models using various biomedical datasets. By the end of this chapter, you’ll be fully prepared to answer any question the interviewer throws your way!

Exercise 1: Covariance and correlation Exercise 2: Covariance by hand Exercise 3: Linear relationship Exercise 4: Nonlinear relationship Exercise 5: Linear regression model Exercise 6: Fitting linear models Exercise 7: Predicting with linear models Exercise 8: Logistic regression model Exercise 9: Fitting logistic models Exercise 10: Predicting with logistic models Exercise 11: Model evaluation Exercise 12: Validation set approach Exercise 13: Regression evaluation Exercise 14: Classification evaluation Exercise 15: Wrapping up