Get startedGet started for free

Anatomy of a Machine Learning Model

Now, you will reinforce your understanding of how data influences the model performance. You will be working with the Airbnb booking dataset (in the file booking.csv). The dataset is suited for classification tasks to predict if someone would cancel a booking. It contains several numerical and categorical columns. You will split the provided dataset into three mutually exclusive samples - train_A.csv, train_B.csv, and test.csv - using split_dataset.py script. Further, for each training dataset, you'll run the data processing and model training pipeline to train a Random Forest Classifier model and test its performance on the test set by using model_training.py. The hyperparameters defined in params.json are consistent in both runs.

The Python scripts are designed to accept command line arguments and run via shell. Feel free to explore these scripts to enrich your understanding.

This exercise is part of the course

Introduction to Data Versioning with DVC

View Course

Hands-on interactive exercise

Turn theory into action with one of our interactive exercises

Start Exercise