Anatomy of a Machine Learning Model
Now, you will reinforce your understanding of how data influences the model performance. You will be working with the Airbnb booking dataset (in the file booking.csv
). The dataset is suited for classification tasks to predict if someone would cancel a booking. It contains several numerical and categorical columns.
You will split the provided dataset into three mutually exclusive samples - train_A.csv
, train_B.csv
, and test.csv
- using split_dataset.py
script. Further, for each training dataset, you'll run the data processing and model training pipeline to train a Random Forest Classifier model and test its performance on the test set by using model_training.py
. The hyperparameters defined in params.json
are consistent in both runs.
The Python scripts are designed to accept command line arguments and run via shell. Feel free to explore these scripts to enrich your understanding.
This exercise is part of the course
Introduction to Data Versioning with DVC
Hands-on interactive exercise
Turn theory into action with one of our interactive exercises
