Get startedGet started for free

Modifying flat file imports

1. Modifying flat file imports

You now know how to load a flat file into a dataframe. Sometimes though, you only want part of the file. In this lesson, we'll look at ways to limit the amount of data imported, and how to make that data easier to work with by naming columns.

2. U.S. Tax Data

Let's revisit the U.S. tax data. Checking the data's shape attribute, we see that it has almost 180,000 rows and 147 columns. Unless your computer has very little memory or is running several other programs, pandas can easily handle data this size, but it is more than we need. We have some options to whittle it down.

3. Limiting Columns

The first way is to choose columns to load with the usecols keyword argument. Usecols can take a list of either all column names or all column numbers to import. It can even accept a function to select columns, so that you can, say, import all columns starting with the letter N, but this is a more advanced technique. To focus on the number of tax returns by income band and ZIP code, we can set up a list of either the relevant column names or their numbers. We supply the file path as a string to read CSV as usual. This time, we also pass usecols the list of column names or numbers to load. We can check to make sure the two ways produce the same result, and they do.

4. Limiting Rows

Another option is to reduce the number of rows imported with the nrows argument. When reading the file, we pass an integer of the maximum number of rows we want, 1000 here. Checking the dataframe's shape, we see we have exactly the number of rows we asked for.

5. Limiting Rows

Nrows is especially handy when combined with the skiprows argument to process a file in chunks. Skiprows accepts a list of row numbers to skip, a function to determine whether to skip a row, or the number of rows to skip. Note that pandas automatically makes the first row imported the header, so if you skip the row with column names, you should also specify that header equals none. Let's get rows 1000 to 1500. Like before, we use nrows to specify how many rows we want, but add that skiprows equals 1000. We skipped the header row, so we also specify there isn't one here.

6. Limiting Rows

Checking the head of the dataframe, we see there are no column names. Let's fix this.

7. Assigning Column Names

To assign column names when there aren't any, we use another read CSV argument: names. As you probably guessed, names takes a list of column names to use. The list must include a name for every column in the data -- if you only want to rename some columns, it should be done after import.

8. Assigning Column Names

For datasets with fewer columns, you might manually build the names list with the help of a data dictionary. We want to copy a long list of column names from an existing dataframe, so we'll pass the first tax subset to the list function to get the column labels. Then we read the CSV, using nrows and skiprows to get the next 500 records, setting header to none, and passing the list of column names to names. Now our dataframe has column names!

9. Let's practice!

Whew! You just learned workflows to manage the amount of data loaded from a flat file. Now you'll get a chance to practice those techniques. Good luck!