Get startedGet started for free

Wide and long data formats

1. Wide and long data formats

Welcome to this course! My name is Eugenia and I will be your guide through learning how to reshape data using pandas.

2. You will learn

In this course, we will start by learning the concept of wide and long data formats, and how to convert between them. Then, we'll deepen our knowledge by stacking and unstacking columns and finally, reshaping and handling complex data, such as string columns or json data.

3. Why it is important

As a data scientist, you will spend a lot of time tidying your dataset. You will format your data to make it appropriate for your analysis. For example, you can encounter a dataset that is in an easy human readable format but that it is not suitable for statistical analysis. You will work with complex structures such as nested data or multi-level index DataFrames. Understanding the tools to reshape data will help you succeed in these situations.

4. Shape of data

First, we need to understand the concept of shape. Shape refers to the way in which a dataset is organized in rows and columns. Let's imagine that we have the following dataset with features of players from the videogame named FIFA. We check its shape. We see that it contains 3 rows and 4 columns.

5. Wide format

Let's take a closer look

6. Wide format

We can see that each feature of a player is in a separated column.

7. Wide format

Also, each row represents many features of the same player. This is a distinctive feature of a wide format.

8. Wide format

The wide format has no repeated records, but this could lead to missing values. This format is preferred to do simple statistics, such as calculating the mean, or imputing missing values.

9. Long format

Now, we'll look at the same dataset but in another format.

10. Long format

We can see that each row shows only one feature for a player.

11. Long format

There are multiple rows for each player. One for each feature. Notice that there is no row for the feature age for the first player. This happens because we had a missing value there.

12. Long format

We have a column, the name column, that identifies the same player through the records.

13. Long format

These are typical characteristics of the long format that is usually seen as the standard for a tidy dataset. It is commonly the preferred format because it can better summarize data, it has a structure of key-value pair, and many advance graphing and analysis techniques required data in this format.

14. Reshaping data

So in a broad sense, reshaping data is transforming a data structure to adjust it for our analysis. This could involve something as simple as transposing the data so columns become rows and rows become columns. Let's see an example. First, we'll set the club column as index.

15. Reshaping data

Then, we select the column name and nationality.

16. Reshaping data

And now, we'll use the transpose function to flip the dataset. We can see that the rows are now columns, and the columns have transpose to being rows. But this is not very helpful.

17. Reshaping data

In this course, we will define reshaping data as converting data from wide to long format and vice versa. To decide between using long or wide format think which is the unit of analysis. For the long format, you are always interested in each characteristic of a player. For the wide format, your interest is each player.

18. Wide to long transformation

Throughout the course, we will perform wide to long transformations using pandas functions such as melt, or wide to long, among others.

19. Long to wide format

Likewise, we will use pandas functions such as pivot, or pivot table to convert DataFrames from long to wide formats.

20. Let's practice!

Now, it's your time to practice.