Get startedGet started for free

Data job automation with cron

1. Data job automation with cron

We’ve built out an entire pipeline in previous lessons, from data processing using csvkit, Python installation using pip, and executing pre-written Python code via the command line. Now we are at the last step, automating the entire process, so that we can set it and forget it.

2. What is a scheduler?

What we need is a scheduler. Schedulers runs jobs like Python models on a pre-determined schedule and allows us to automate our data pipeline. There are quite a few commercial and open source solutions on the market. Some of the popular ones are Airflow, Luigi, and Rundeck. However, cron is still a very strong contender among these commercial options because it is simple, free, customizable, 100% command-line, and native to MacOS and Linux systems.

3. What is cron?

cron is a time-based job-scheduler that comes pre-installed in Unix-like operating systems. This means that cron already exists in MacOS and Linux systems, but not in Windows. For Windows users, you can use cron by installing Cygwin. Or, you can use the native Windows Task Scheduler. cron can be used to automate a variety of jobs, from system maintenance tasks, to bash scripts to Python files. This means we can use it to automate our bash scripts for pulling data using csvkit. We can also use it to run our Python models.

4. What is crontab?

To schedule tasks, we need a central file to keep records of all our jobs, when to run them, and other instructions specific to the schedule. This file is called the crontab file. To see what is currently in this central file, type crontab-dash-l This will display all, if any, tasks currently scheduled via cron. It looks like we have no tasks scheduled. To see what other option flags are available with crontab, type man crontab

5. Add a job to crontab

We can add a job to crontab the same way we added a print function to a Python file. By either opening up crontab using a text editor like nano, Vim, or Emacs, or by echoing the scheduler command directly into crontab. Once this is done, we can verify that the command is correctly saved by running crontab dash-l again. This time, we see that the Python job is properly scheduled.

6. Learning to time a cron job

Let's dig deeper into fine-tuning when to schedule a job. First, keep in mind that cron has a 60-second granularity limit, which means that the most frequent job we can schedule is one run every minute. Second, there are five time component locations. Each asterisk is indication of a time component. Starting from left to right, are minutes, hours, day, month, and day of the week.

7. Learning to time a cron job

Let's take a look at an example: Asterisk is a wild card that is synonymous with "always on" in cron scheduler. Placing five asterisks before the Python file create-model-dot-py means we are scheduling Python to run this script every minute of every hour of every day of every month and every day of the week. In short, we run this model every minute until forever. For scheduling jobs in other frequencies, maybe every day, every month, or every 3 hours, check out the website crontab-dot-guru to help you write out the cron jobs.

8. Let's practice!

We have learned how to automate our data processing jobs. Awesome job! Let's get some hands-on practice to build an end-to-end Python data pipeline!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.