Designing a DVC pipeline

Designing a DVC pipeline, or DAG, is fundamental to leveraging DVC in your machine learning workflows. DAGs allow us to codify inputs, outputs, and execution of a certain step. The outputs of one step can serve as input to one or more steps, thereby naturally setting the right dependencies between steps.

In this exercise, you'll work on designing an ML workflow that contains four stages, namely,

Data preprocessing (preprocess_stage)
Data splitting (split_stage)
Model training (train_stage)
Model evaluation (evaluate_stage)

We will exclusively work with the dvc stage add commands. Scroll down to the end of the shell script file (dvc_dag_stages_add.sh) if needed.

This exercise is part of the course

Introduction to Data Versioning with DVC

View Course

Hands-on interactive exercise

Turn theory into action with one of our interactive exercises

Start Exercise

This exercise is part of the course

Introduction to Data Versioning with DVC

IntermediateSkill Level

4.8+

185 reviews

Start Course for Free

This chapter provides a comprehensive introduction to Data Version Control (DVC), a tool essential for data versioning in machine learning. Learners will explore the motivation behind data versioning, understand its differences from code versioning, and experiment with a simple classification problem. They will review basic Git commands, learn about DVC, and practice setting up a repository. The chapter concludes with an overview of DVC’s features and use cases, including versioning data and models, CI/CD for machine learning, experiment tracking, pipelines, and more.

Exercise 1: Data Versioning Motivation Exercise 2: Anatomy of a Machine Learning Model Exercise 3: Differences Between Data and Code Versioning Exercise 4: Understanding Hyperparameters Exercise 5: Introduction to DVC Exercise 6: Working with Git CLI Exercise 7: Review DVC CLI Exercise 8: DVC features and use cases Exercise 9: DVC pipelines Exercise 10: CI/CD for machine learning

This chapter delves into the setup of DVC, encompassing aspects such as installation, initialization of the repository, and the utilization of the .dvcignore file. It further navigates through the exploration of DVC cache and staging files, imparting knowledge on how to add and remove files, manage caches, and comprehend the underlying mechanisms using the MD5 hash. The chapter also elucidates on DVC remotes, distinguishing them from Git remotes, and guides you on how to add, list, and modify them. Lastly, it teaches you how to interact with these remotes by pushing and pulling data, checking out specific versions, and fetching data to the cache.

Exercise 1: DVC Setup and Initialization Exercise 2: Setting up DVC Exercise 3: .dvcignore Patterns Exercise 4: DVC Cache and Staging Files Exercise 5: Working with DVC Cache Exercise 6: Understanding .dvc Files Exercise 7: Configuring DVC Remotes Exercise 8: Purpose of DVC Remotes Exercise 9: Setup a DVC Remote Exercise 10: Interacting with DVC Remotes Exercise 11: Versioning Data using DVC Remote Exercise 12: Checking out Versioned Data

This chapter focuses on automating ML pipelines using DVC. Learners create a configuration file containing settings and hyperparameters. They also learn about pipeline visualization using directed acyclic graphs and use commands to describe dependencies, commands, and outputs. Execution of DVC pipelines is covered, including local model training and how Git tracks DVC metadata. Additionally, learners explore metrics and plots tracking in DVC, including how to print metrics, create plot files, and compare metrics and plots across different pipeline stages.

Exercise 1: Code organization and refactoring Exercise 2: Understanding parameter files in DVC Exercise 3: Write a parameter file Exercise 4: Writing and visualizing DVC pipelines Exercise 5: Designing a DVC pipeline

Current Exercise

Exercise 6: Visualizing a DVC pipeline Exercise 7: Executing DVC pipelines Exercise 8: DVC pipeline execution concepts Exercise 9: Execute a ML model training pipeline Exercise 10: Evaluation: Metrics and plots in DVC Exercise 11: Tracking DVC Metrics Exercise 12: Adding plots to dvc.yaml Exercise 13: Congratulations!