Data manipulation and analysis
1. Data manipulation and analysis
Now, we'll explore how to manipulate and analyze data using Tablesaw's powerful filtering, sorting, and aggregation capabilities.2. Column selection
Firstly, let's look at the .selectColumns() method. This method creates a new table with only the columns we specify. This is useful for narrowing our view to just the data that matters for our task. We can select by name, like "Name" and "Salary", or use advanced logic using lambda expressions, such as selecting both int and double columns. All rows are retained; only the column set is reduced.3. Filtering data
In the previous video, we filtered on a Selection, but we can also filter on tables using the .where() method. This takes a condition and returns a new table with only matching rows. It gives us precise control over which data points we want to analyze, all while preserving our original data. In the first example, we filter for rows where Age is greater than or equal to 65. In the second, we combine conditions with .and() to find employees between 30 and 50 who earn over 75,000 dollars.4. Sorting data
Our next method is .sortOn(), which helps us organize our data for analysis and presentation. This method can sort single or multiple columns, but it always sorts in ascending order. Like most operations in Tablesaw, it returns a new, sorted table, leaving the original unchanged. For descending order, we use .sortDescendingOn(). We can combine both methods for complex sorting logic - here, we sort by department (ascending) and then by salary (descending).5. Aggregation with summarize
The .summarize() method is a powerful tool for aggregating data. It allows us to compute summary statistics, such as mean, count, or max, for one or more columns. The result is a new table containing the calculated values. To use this, we first need to import the AggregateFunctions class using the import statement below. In our first example, we use the .summarize() method on the employees table to calculate the mean, count, and max statistics across the Salary column. This gives us a concise overview of that column's values, with each statistic shown in its own column in the summary output. Note that we also need to call .apply() on our summarize method to apply it correctly. Note that if we do not specify a column, it will apply the summary functions to all numeric columns in the table.6. Aggregation with summarize
We can also create more complex summaries. In this second example, we are specifying the target columns as Salary and Age, and we are calculating the mean, median, min, and max across those two columns. The result is a summary table where we again have one column for each statistic, so eight columns total, as we have two columns, Salary and Age, and four metrics for each column.7. Core manipulation methods
As a recap, Tablesaw provides a set of powerful manipulation methods. With .select(), we can choose specific columns. The .where() method lets us filter rows using conditions. The .sortOn() method helps us order our data, and .summarize() computes statistics on our dataset, passing in both the statistics that we want to calculate and the column to calculate the statistics on. Remember, all operations return new table objects.8. Let's practice!
Well done getting through this video, let's now practice these skills with a few exercises.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.