1. Sorting and summarizing a DataFrame
Great work so far!
Let's continue by learning how to quickly understand our dataset by sorting and summarizing.
2. Sorting
We start with sorting a DataFrame.
We can re-order rows from smallest to largest based on their value in one column with the sort method where we pass the column name that we want to sort by.
Let's say we want to find the most budget-friendly options for our customers.
Sorting by price arranges the rentals from cheapest to most expensive,
making it easy to spot the bargains at a glance.
3. Sorting in descending order
What about customers looking for luxury options?
Setting the descending argument to True reverses the order, showing premium properties first.
Looking at the output, we can immediately spot the high-end options like Tregenna House and Palma Villa, perfect for vacation splurges.
4. Sorting by multiple columns
Sometimes we need to organize data based on multiple criteria.
Imagine a family first wants to know how many bedrooms a property has, and then compare prices within each size category. By passing multiple columns to the .sort() method, we can accomplish exactly this.
Notice something interesting here
- the first row has a null - or missing value - in the bedrooms column. When sorting by a column that has missing values, the nulls appear first, followed by the actual values in ascending order. We'll explore handling missing values later in the course. After that first row, we see the three-bedroom properties arranged from least to most expensive.
5. Finding extreme values
For some data analysis tasks, we're only interested in the extremes.
Rather than sorting the entire DataFrame when we only need a few values, the top_k method offers a faster alternative.
Suppose a travel magazine asks for the three most premium properties for a feature article. Using top_k with the price column instantly gives us what we need.
6. Finding extreme values
Similarly, for budget travel guides, we can use bottom_k
to identify the most affordable options.
7. Summarizing a DataFrame
When preparing a report on our vacation rental portfolio, we need a comprehensive statistical overview.
The describe method delivers this with a single command.
The first rows tell us about data completeness - "count" shows non-missing values in each column, while "null_count" reveals how many entries are missing. This helps us quickly spot data quality issues, like the missing entries in the type and bedroom columns.
The remaining rows provide rich statistical insights for each column. For numerical columns like price, we see everything from means and standard deviations to percentiles that sketch the distribution. String columns like name only show min/max values (based on alphabetical order), while for boolean columns like beach, the mean represents the proportion of True values - showing that 63% of our properties are beachfront.
8. Let's practice!
Time to practice sorting and summarizing!