Anatomy of a Machine Learning Model

Now, you will reinforce your understanding of how data influences the model performance. You will be working with the Airbnb booking dataset (in the file booking.csv). The dataset is suited for classification tasks to predict if someone would cancel a booking. It contains several numerical and categorical columns. You will split the provided dataset into three mutually exclusive samples - train_A.csv, train_B.csv, and test.csv - using split_dataset.py script. Further, for each training dataset, you'll run the data processing and model training pipeline to train a Random Forest Classifier model and test its performance on the test set by using model_training.py. The hyperparameters defined in params.json are consistent in both runs.

The Python scripts are designed to accept command line arguments and run via shell. Feel free to explore these scripts to enrich your understanding.

This exercise is part of the course

Introduction to Data Versioning with DVC

View Course

Hands-on interactive exercise

Turn theory into action with one of our interactive exercises

Start Exercise

Introduction to Data Versioning with DVC

IntermediateSkill Level

4.8+

258 reviews

This chapter provides a comprehensive introduction to Data Version Control (DVC), a tool essential for data versioning in machine learning. Learners will explore the motivation behind data versioning, understand its differences from code versioning, and experiment with a simple classification problem. They will review basic Git commands, learn about DVC, and practice setting up a repository. The chapter concludes with an overview of DVC’s features and use cases, including versioning data and models, CI/CD for machine learning, experiment tracking, pipelines, and more.

Exercise 1: Data Versioning Motivation Exercise 2: Anatomy of a Machine Learning Model

Current Exercise

Exercise 3: Differences Between Data and Code Versioning Exercise 4: Understanding Hyperparameters Exercise 5: Introduction to DVC Exercise 6: Working with Git CLI Exercise 7: Review DVC CLI Exercise 8: DVC features and use cases Exercise 9: DVC pipelines Exercise 10: CI/CD for machine learning

This chapter delves into the setup of DVC, encompassing aspects such as installation, initialization of the repository, and the utilization of the .dvcignore file. It further navigates through the exploration of DVC cache and staging files, imparting knowledge on how to add and remove files, manage caches, and comprehend the underlying mechanisms using the MD5 hash. The chapter also elucidates on DVC remotes, distinguishing them from Git remotes, and guides you on how to add, list, and modify them. Lastly, it teaches you how to interact with these remotes by pushing and pulling data, checking out specific versions, and fetching data to the cache.

Exercise 1: DVC Setup and Initialization Exercise 2: Setting up DVC Exercise 3: .dvcignore Patterns Exercise 4: DVC Cache and Staging Files Exercise 5: Working with DVC Cache Exercise 6: Understanding .dvc Files Exercise 7: Configuring DVC Remotes Exercise 8: Purpose of DVC Remotes Exercise 9: Setup a DVC Remote Exercise 10: Interacting with DVC Remotes Exercise 11: Versioning Data using DVC Remote Exercise 12: Checking out Versioned Data