1. Filtering data using csvkit
In the previous lesson, we set ourselves up for success by preparing our data using csvkit. Now we will dig deeper and learn how to use csvkit's data filtering commands.
2. What does it mean to filter data?
Because our data files are tabular, we can filter by creating a subset of the original data by column or row.
We can use the csvcut command to filter data by column.
We can use csvgrep to filter by row.
3. csvcut: filtering data by column
csvcut can filter and truncate CSV files by either column name or column position.
For full documentation, use csvcut-dash-h.
4. csvcut: filtering data by column
If you don't have column names or positions memorized, that's perfectly okay. csvcut has a dash-dash-names or dash-n option, which prints a list of the column names and positions. Using csvcut dash-n Spotify-underscore- music-attributes-dot-csv, we see that Spotify-music-attributes has 3 columns, track-id, danceability, and duration-underscore-ms.
5. csvcut: filtering data by column
By referring to the output of csvcut-dash-n, we can now filter the data more easily. Suppose we want to return only the first column track_id, we can do so by referring to that column by position using:
csvcut dash-c one Spotify-underscore- music-attributes-dot-csv
6. csvcut: filtering data by column
We can generate the same output by referring to the column name by replacing the 1 with the column name, in double parentheses.
Now our code reads:
csvcut dash-c quotation-marks-track- underscore-id-end quotation-marks Spotify-underscore- music-attributes-dot-csv
The output is the same as before.
7. csvcut: filtering data by column
You can specify more than one column for filtering. To return the second and third columns, separate the column numbers by commas. Note that there is no space between the numbers two and three. Inserting a space will generate an error.
The code reads:
csvcut dash-c two-comma-three Spotify-underscore- music-attributes-dot-csv
8. csvcut: filtering data by column
Same as before, we can generate the same output by replacing the column numbers with names.
Note again, that column names are wrapped in double quotations, and the columns are separated by comma, not space.
The output is the same as before.
9. csvgrep: filtering data by row value
csvgrep is our go-to in csvkit for filtering data by row value.
Despite its name, csvgrep can filter by both exact match or regex fuzzy match.
It's important to remember that csvgrep must be paired with one of these three options dash-m, dash-r, or dash-f. We will focus on dash-m in this lesson.
As always, please use the dash-h option for more documentation.
10. csvgrep: filtering data by row value
Let's say, we want to filter by a certain track-id. We can do so by first isolating the column with csvgrep dash-c quotation marks track id end quotation marks.
Then we pass in the exact value that we want to filter by using dash-m followed by the track-id value, with no quotation marks.
Finally we pass in the dataset Spotify underscore popularity dot csv. This will return the entire row or rows that contains this track-id.
Similar to column filtering with csvcut, we can pass the column location instead. Keeping everything else the same, we replace track-id with the column position number 1 instead.
This will give us the same output.
11. Let's do data filtering with csvkit!
We just keep adding more csvkit commands to our data processing toolkit! Let's put these skills to good use with some exercises!