Combining multiple datasets

1. Combining multiple datasets

So far, you've built pipelines to individual data sources. However, projects commonly draw data from multiple places. In this lesson, we'll step back from JSON to look at pandas methods for combining datasets.

2. Concatenating

You encountered the method for adding rows from one dataframe to another when compiling multiple spreadsheets -- it's concat. Concat is short for concatenate. It's a function that puts together multiple pandas objects, like dataframes or series, along an axis -- that is, along rows or columns. Dataframes do not have to have the same columns to concatenate them. Concat takes a list of objects to combine. By default, it concatenates rows. If the dataframes use pandas' default index of row numbers, you should also set concat's ignore index parameter to True so that the resulting dataframe has relabeled rows, rather than two row 0s, two row 1s, and so on.

3. Concatenating

APIs commonly limit the number of records returned in a single call to manage resource usage. The Yelp API returns 20 records at a time by default,

4. Concatenating

but we can set the offset parameter to get the next 20.

5. Concatenating

From there, we concat the two dataframes, setting ignore index to true to relabel rows. We can print the name column to confirm the resulting dataframe has 40 stores.

6. Merging

Another way to combine data is by adding columns along a shared key column, like in the SQL joins covered last chapter. While SQL joins are only an option for tables in the same database server, pandas has its own way to join data -- merge.

7. Merging

Merge is both a pandas function, like read csv, and a dataframe method. We'll use the dataframe method, which takes the name of the other dataframe to merge. It also needs strings of the columns to merge on. If key column names match in both dataframes, you can specify the name with the on keyword argument. If they're different, use left on and right on to specify keys in the first and second dataframes, respectively. Note that the key columns must be the same data type for the merge to work properly.

8. Merging

Let's illustrate with simplified versions of the call counts and weather datasets. They each have a date column, which we can use as the merge key.

9. Merging

We pass weather to call counts' merge method, specifying to merge on created date in call counts and date in weather, and check the head to confirm the merge worked.

10. Merging

Let's take a closer look at the result. By default, merge returns only rows with key values that exist in both datasets. There aren't call counts for December, so none of the December weather data appears here. The call and weather data have what is called a one-to-one relationship -- each call count record can be linked to only one weather record by date, and vice versa -- so the result has one record for each date in both datasets. If you merge data where a record in one dataframe can be linked to multiple records in the other, the result will have one row for each possible link.

11. Let's practice!

Concat and merge are just the tip of the iceberg when it comes to pandas' data wrangling capabilities, but they're useful functions to start with, especially when combined with all the other tools you have for building data pipelines. But enough from me -- let's practice.