Creating new columns

1. Creating new columns

Sometimes we want to calculate new values depending on the data in each row, not just a single value for the whole column.

2. Columns vs. rows

Luckily, we don't have to loop over the DataFrame, calculating values for each row individually. The DataFrames package includes the ByRow function that does exactly what we need. We can use ByRow with any of the functions we know from previous video. So how do we decide whether we are using a function on the whole column or on individual rows? It depends on the situation. If we want to characterize the column as a whole, for example by calculating its mean, we use the approach from the previous video. However, if we want the new values to depend on the values on each individual row, we need to use the ByRow function. Let's look at two examples using the penguins dataset.

3. Flipper length to inches

Firstly, we would like to convert the flipper length column from millimeters to inches. To do that, we call transform, passing our DataFrame, the name of the column followed by equals-greater-than ByRow. Inside the call of ByRow, we write the function we want to use. In this case, we use an anonymous function that takes x and returns x divided by twenty-five point four. If we want to rename the new column, we continue with equals-greater-than and the name of the new column.

4. Culmen depth and length ratio

In our second example, we would like to calculate the ratio between culmen depth and culmen length. To do that, we include both columns in square brackets. Rather than dot-equals-greater-than signs, we need to use only equals-greater-than, as we want to use both columns in a single function. We follow this by calling ByRow, passing a tuple containing x and y, assigning as x divided by y using the dash-greater sign. Again, we can rename the column using equals-greater signs.

5. New column from a vector

Now imagine that we want to add a new column to our DataFrame. But rather than calculating it from the data, we are given a vector of values we want to add. In our penguin example, it might be a vector of identification numbers, id-vec. To create a new column in the DataFrame, we have several options. We name the new columns using their creation method. We can slice penguins using square brackets, colon, the new column name, id-colon, and setting it to id-vec. We can also subset penguins using exclamation mark, id-exclamation, and setting it to id-vec. Or we can use penguins-dot-id-underscore-dot and setting it to id-vec.

6. Copy or not

So what is the difference? Imagine that the first value in the id-vec is wrong and we change it from twenty-five to the correct value of twenty-seven in the vector. When we print the DataFrame, we can see that the value id-colon column didn't change, while the other two did. That's due to different treatments of the vector by the different assignments.

7. Copy or not

When we use square brackets and colon, we tell Julia to copy the values from the vector to the DataFrame. Therefore, any subsequent changes to the original vector do not affect the DataFrame. On the other hand, the other two methods are referencing the original vector, id-vec, without copying its values to the DataFrame. Therefore any changes to the original vector lead to changes in the corresponding values in the DataFrame. So what approach is better? Unfortunately, there is no single answer to this question. It's up to us to decide on a case-by-case basis what to use.

8. Let's practice!

Are you ready to try it on your own? Let's head to the exercises!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.