Get startedGet started for free

Getting started with csvkit

1. Getting started with csvkit

Welcome back! In this lesson, we will explore the basics of csvkit for data processing on the command line.

2. What is csvkit?

Data processing on the command line is computationally efficient and also quite simple once you are familiar with the syntax. Yet, generations of data professionals gravitate toward Python since it comes pre-built with libraries specific for data handling that bash commands lack. csvkit helps to bridge this gap by bringing a suite of data commands to the command line. Developed by Wireservice using Python, csvkit offers a variety of data conversion, processing and cleaning capabilities that rivals Python, R, and even SQL.

3. csvkit installation

Because csvkit is written in Python, it can be installed with the Python package manager, pip. The syntax is: pip install csvkit If csvkit is already installed. To upgrade it, run: pip install dash-dash-upgrade csvkit For full instructions, please refer to the tutorial.

4. Browsing the csvkit manual

Unlike most command-line tools, csvkit, as a whole, does not respond to the man command. Documentation is web-based.

5. in2csv: converting files to CSV

For each command in the csvkit suite, however, this is different. For example, in2csv is a useful command in csvkit suite that converts tabular data files, like text or Excel, into CSV. There is both a web-based documentation as well as command-line manual, use in2csv-dash-dash-help, or in2csv-dash-h, which prints the log below.

6. in2csv: converting files to CSV

The syntax involves calling in2csv, followed by the name of the file you wish to convert, in this case, SpotifyData-dot-xlsx. The redirect operator is followed by the name of the newly created CSV file SpotifyData-dot-csv. Please note that in2csv-SpotifyData-dot-xlsx alone just prints to console the data on the first Excel sheet and does not generate a new file. The redirect operator is crucial for redirecting and saving the output in the new file SpotifyData-dot-csv.

7. in2csv: converting files to CSV

What if the data we want is not in the first sheet? csvkit does let us specify which sheet to convert in an Excel file. First, use the dash-dash-names or dash-n option to print all sheet names in Spotify-data-dot-xlsx. Then, we use dash-dash-sheet to specify that we want to convert worksheet-one-underscore-popularity. Note the quotation marks around the sheet name. We re-direct the output of the conversion to Spotify underscore popularity dot csv.

8. in2csv: converting files to CSV

Please remember that in2csv does not print any logs to console. For sanity check, we run ls to confirm that the new CSV has been created.

9. csvlook: data preview on the command line

There are various ways to preview data on the command line, such as cat, less, or more. csvlook, also in the csvkit suite, prints CSV files to the command line in a Markdown-compatible, fixed-width format that's easier on the eye. For documentation, use csvlook dash-h.

10. csvlook: data preview on the command line

Let's test this out on our newly created CSV file. csvlook Spotify underscore popularity dot csv prints the following output. Pretty great, right?

11. csvstat: descriptive stats on CSV data files

Our last command for this lesson is csvstat. csvstat is similar to the describe() method in Python's Pandas library. It intelligently deciphers the data type in each column of the CSV file and prints descriptive summary statistics for each column according to its data type, such as mean, median, and unique value counts. For documentation, use csvstat dash-h.

12. csvstat: descriptive stats on CSV data files

Using the popularity CSV again, this is a portion of the summary statistics csvstat prints out for the first column, track id.

13. Let's try some csvkit!

In this lesson, we learned three commands in the csvkit suite, in2csv, csvlook, and csvstat. Let's put our new knowledge to practice!