1. Introduction to DataFrames
In this lesson, you'll learn how to work with tabular data using "DataFrames," a data structure from the "pandas" package.
2. DataFrames
pandas is a Python package that lets you store your data in DataFrames -- tabular structures where each row is a record or observation, and each column can be a different type. If you've used MATLAB's "tables," the concept is the same.
DataFrames are a very convenient way to store data that is mixed, where each observation has multiple values associated with it with different data types: floats, integers, booleans, and strings.
3. DataFrames
For example, in the data shown here, each row is one state from a dataset of pet ownership across the United States. Each record has a unique index, the name of the state, the state's rank in overall pet ownership, the total number of households, and the fraction of households that own dogs and cats.
This kind of data is perfect for storing in Dataframes. Each row is a single record or observation -- the data belongs together. Each column shares the same datatype -- in this case, strings and floats.
4. .head() method
I've loaded the data into the dataframe "pets." We can start to explore this dataframe by calling the method ".head()" to get the first five rows of the dataframe and print them to the console.
This is a convenient way to explore new dataframes and start to understand what data they contain before analyzing.
When printing the dataframe, pandas prints the names of each column of the dataframe, as well as the unique index for each row.
5. .columns attribute
We can get the column names alone with the ".columns" attribute of the dataframe. This returns the column names.
6. .index attribute
Similarly, we can get the labels for each row with the ".index" attribute. The indices of pandas rows are different from the 0-indexing of Python lists and NumPy arrays. They are not guaranteed to start at zero, to increase monotonically, or even to be integers. Pandas indices are only required to be unique.
7. Getting one column out
Individual columns can be accessed using the square brackets and passing the column name into the square brackets, just like dictionaries.
This returns a pandas "Series" object, which is like a one-dimensional DataFrame. Note that the indices here match the indices of the DataFrame. In this case, the indices are the names of the states.
8. NumPy & Matplotlib compatible
Because pandas is built on top of NumPy, you can apply NumPy functions like mean() and max() to DataFrame columns or pass columns directly into Matplotlib for plotting, just like lists and NumPy arrays.
9. Let's practice!
Now that you've learned a bit about pandas DataFrames let's get some practice.