1. Power and flexibility
In this video you'll learn how the flexibility and power of pandas can make you more effective and more productive.
2. Scalability
Pandas is excellent for working with large datasets. Even with a typical laptop you can manipulate millions of rows of data and go well beyond the size limit of spreadsheets.
There's no hard limit on data frame size. If the data is too large for your machine to handle, you can scale up to a machine with more memory and more processing power.
If that's not an option, pandas has built-in functions to work with data in chunks. For example, the pandas .read_csv() function includes a parameter to set the chunk size. You can use this parameter in combination with other functions and Python code to conserve memory.
Even when you reach a limit with Pandas alone, you can combine it with other packages to take advantage of distributed computing and parallel processing. Handling datasets with hundreds of millions of rows is quite possible!
3. Efficiency
Pandas can also save you time when joining data.
You can join datasets by any number of columns if the data logically matches. For instance, if you want to join data by month and day, and both datasets contain those columns, you can join directly on the columns. There's no need to create a new column for the date or combine text columns as you might for a spreadsheet.
The code behind pandas is written to make things as easy as possible.
When two data frames have the same names for overlapping columns, the statement can be super simple.
This merge statement joins two data frames by their common columns and common row indexes. We refer to the first data frame mentioned as the left data frame, and the data frame inside the parentheses as the right data frame. This basic example is just a start - we'll discuss these statements and their structure in detail throughout the rest of the course.
4. Integration
Finally, pandas is well-integrated into the python ecosystem.
You heard earlier about packages designed to improve the speed and scale of manipulating data frames. Pandas also works well as part of an end-to-end pipeline for analytics. Other packages,
especially those focused on visualization
or machine learning, are written to accept pandas data frames as inputs.
5. A word on advanced spreadsheet usage
Of course, modern spreadsheet software has advanced capabilities that make it powerful, too.
Data models and query tools allow users to join data in different ways.
Integration with programming languages can populate cells with the touch of a button.
And formulas using XLOOKUP or index-match have great flexibility for joining data.
For this course we'll still use VLOOKUP as the baseline for joining data in spreadsheets. It's a nearly universal formula familiar to spreadsheet users at all levels. And, it provides a simple, useful concept for building joins in Pandas.
6. Let's practice!
Ok, it's time now to get back to pandas and work with some practical examples.