1. Introduction to spreadsheets
Welcome! In this chapter, you will learn how to build pipelines to data stored in spreadsheets, plus additional data wrangling techniques.
2. Spreadsheets
Spreadsheets, also called Excel files, are ubiquitous today. If you ever worked with them, the dataframe structure probably looked familiar.
Like dataframes, spreadsheets organize information in tables, with cells of data in rows and columns.
Unlike dataframes and flat files, spreadsheets can have formulas with automatically updating results. Spreadsheets also support formatting that flat files do not. While dataframes can be formatted for presentation purposes, like in Jupyter notebooks, pandas does not import spreadsheet formatting, so check that spreadsheets you want to load do not use formatting to convey important information. Or better yet, export them without formatting.
Finally, an Excel file or workbook can have multiple spreadsheets.
3. Loading Spreadsheets
At its simplest, reading data from a spreadsheet resembles reading a flat file, just with its own function: read Excel.
Let's demonstrate with an Excel file of responses from FreeCodeCamp's New Developer Survey, which asks new developers about their employment, education, job goals, and learning resources used. This file has been modified to make it easier to practice with, but you can grab the full dataset from Kaggle or FreeCodeCamp's GitHub. Previewing the file in a spreadsheet program, we see that it's pretty straightforward: tabular data with no formatting.
4. Loading Spreadsheets
As always, we first import pandas as pd, then pass the spreadsheet path as a string to pd dot read Excel and assign the result to a variable.
When we check the head, the dataframe looks as expected.
5. Loading Select Columns and Rows
Of course, not all spreadsheets are so simple, and formatting features that help human readers can make it trickier to read by machine. Spreadsheets commonly contain non-tabular information, such as metadata headers
or have smaller tables of information, like this invoice. Fortunately, read Excel, like read CSV, has keyword arguments to select columns and skip or limit rows to get only the cells you want.
6. Loading Select Columns and Rows
Read Excel shares many parameters with read CSV, but some work slightly differently, so check pandas documentation if you're unsure.
Nrows is the same, taking an integer of rows to load.
read Excel's skiprows argument is also similar: it takes either a number of rows to skip at the beginning, or a list of row numbers to skip.
Finally, read Excel's usecols argument accepts a list of either column numbers or column names, or a function to decide whether to load a column, just like read CSV's usecols. However, it also accepts an integer of columns to load, or a string of Excel column letter or ranges, like "A:P".
7. Loading Select Columns and Rows
Let's use these arguments to get a subset of survey response data, focusing on employment. The columns are organized alphabetically, with job-related variables in columns W through AB and income in AR. Also note that this version of the data has a two-row header providing links to the source that we'll want to exclude.
8. Loading Select Columns and Rows
We load the data with read Excel as before, but set skiprows to 2 to exclude the metadata. Since we know which column letters we want, we can pass the range and standalone income column as a single string to usecols.
Checking the dataframe, we see there are only a few columns loaded, and no metadata.
9. Let's practice!
In this lesson, you learned about creating dataframes from Excel data. Now, it's your turn to practice. Good luck!