1. Loading and writing CSV files
Getting the data into Julia is the first step in the data manipulation journey. In the Introduction to Julia course, we learned how to load CSV files using the File function from the CSV-dot-jl package. In this video, we'll explore more advanced loading techniques for getting data into a DataFrame.
2. Delimiters
One of the common issues we encounter is different delimiters. Delimiter is a character or a string used to separate values. While CSV stands for Comma-Separated-Values, in practice, sometimes files use different ways of separating the columns. These can include a space, tab, or something different.
If we know the type of delimiter used, we can provide it to the CSV-dot-File function with the delim keyword. Here, we are loading the penguins dataset, setting the delimiter to a space. Delim takes either a character or a string as its argument.
Julia and the CSV package are clever; if we don't provide a delimiter, they'll try to detect the most consistent delimiter themselves.
3. Decimal mark
Another important keyword is decimal. It takes a character that indicates how decimals are separated in floating point numbers. Some countries use the dot, while the comma is popular in others. We must be careful when the dataset uses a comma for the decimal mark, as it can interfere with the delimiter.
4. Loading parts of datasets
Next important knowledge is loading only parts of a dataset. This is especially useful when working with huge datasets that would not fit in the memory.
We need to know two keywords to load only part of the data: skipto and limit. Skipto takes an integer specifying on which row the data we want starts. We must be careful counting the header lines. In the penguins dataset, the header occupies only the first line, so if we set skipto to ten, we'll load records from rows 9 and further.
The limit keyword takes an integer that indicates the number of rows we want to load.
Here, we are loading three rows, starting at line nine.
5. Header
Sometimes, the header is not on the first row; it could be split over multiple rows or be missing altogether. That's where the header keyword comes in.
If the header is on a different row than the first, we pass its row number.
6. Header over multiple lines
If it is split over multiple lines, we pass a vector containing the row numbers like so. In our case, passing a vector containing one and two combines the header and the first row values, such as species-underscore-Adelie.
7. Replacing the header
In case the header is missing, or we want to replace it, we can pass a vector of strings or symbols to be used as the new column names.
If the vector is too short or too long, we get a warning but the file loads.
8. Writing CSV files
Now, what about writing a transformed DataFrame to a CSV file? We can use the CSV-dot-write function. It takes the file name or a file path and the DataFrame name. We can also use the delim and decimal keywords to specify how the data is saved. Here, we save the penguins DataFrame into the temp folder as transformed-penguins-dot-CSV. We also set the delimiter to a space and the decimal separator as a comma because we want to share the dataset with our European colleagues.
9. Cheat sheet
Here is a cheat sheet to help you! Both CSV-dot-File and CSV-dot-write support other keywords that are out of the scope of this course. You can check them out in their documentation.
10. Let's practice!
Are you ready to load some files? Let's practice!