Get startedGet started for free

Stacking data and chaining commands with csvkit

1. Stacking data and chaining commands with csvkit

In this lesson, let's explore some more advanced techniques in csvkit, whether it is to chain multiple commands together, or to process more than one file at a time.

2. csvstack: stacking multiple CSV files

The csvstack command stacks the rows from two or more CSV files together. This is often used when you have files with the same schema but have been downloaded in chunks due to downloading restrictions such as API request restrictions. The full documentation can be found using csvstack dash-h. But first, let's go over some finer points.

3. csvstack: stacking multiple CSV files

Suppose we have two files, Spotify rank six dot csv and Spotify rank seven dot csv and we want to stack them together to create one file. csvstack can do this, but first, we need to make sure that the input files have the same number of columns, in the same column order, with the same data types. Let's preview Spotify-rank-six.

4. csvstack: stacking multiple CSV files

Let's preview Spotify-rank-seven. It looks like both files share the same schema. We are ready for csvstack.

5. csvstack: stacking multiple CSV files

The syntax for stacking is: csvstack Spotify underscore six dot csv Spotify underscore seven dot csv redirect Spotify underscore all ranks dot csv. Because both files contain two rows each, this results in a four row file for Spotify-all-ranks.

6. csvstack: stacking multiple CSV files

However, it's not always clear how to trace back which row in the final stacked file came from which source file. To keep a record of the source of the data row, use csvstack's option dash-g followed by a user entered value to create a source column called group. We do this by inserting the following syntax after csvstack: dash-g quotation marks rank six end quotation marks, comma, quotation marks rank seven end quotation marks Please note that the backslash is for line overflow. Now when we preview the new Spotify-AllRanks file, we see that there's a new column, by default, named group, that has the value Rank6 for every row that came from Spotify-rank-six-dot-csv and the value Rank7 for every row that came from Spotify-rank-seven-dot-csv.

7. csvstack: stacking multiple CSV files

Finally, if we want to rename the column from group to something else, we can also do so by adding dash-n followed by the new column name. Here, by adding dash n quotation marks source end quotation marks, we have renamed the column from group to source.

8. chaining command-line commands

The more we use command-line tools, the more we start structuring complex commands. Just like in English where we start combining independent sentences with conjunctions to form complex sentences, in command-line, we use operators to chain together commands in the same line. The semi-colon operator links and runs multiple commands in a single line in the command-line, sequentially. Here, we will get both csvlook and csvstat outputs for the same file. The double AND operator also links commands together, but the second command will only execute if the first command succeeds. Here, csvstat will only run after csvlook succeeds.

9. chaining command-line commands

We are already familiar with the re-direct operator. This re-directs the output from the first command and saves it to the location in the second command.

10. chaining command-line commands

The PIPE operator uses the output of the first command as input to the second command. Here, csvcut filters the data and prints columns track-id and danceability, but notice that the output is not well formatted.

11. chaining command-line commands

By passing the output of csvcut as input to csvlook using the pipe operator, the results are printed much more neatly.

12. Let's put everything together!

We have learned a lot about data processing with various csvkit commands. Let's practice putting everything together!