1. Vectorization with NumPy arrays using .values()
To close this chapter, we will discuss another vectorization method to efficiently iterate over a pandas DataFrame and apply a specific function, using NumPy arrays.
2. NumPy in pandas
The NumPy library, which defines itself as a "fundamental package for scientific computing in Python", performs operations under the hood in optimized, pre-compiled C code.
Similar to pandas working with arrays, NumPy operates on arrays called ndarrays. A major difference between Series and ndarrays is that ndarrays leave out many operations such as indexing, data type checking, etc. As a result, operations on NumPy arrays can be significantly faster than operations on pandas Series.
NumPy arrays can be used in place of pandas Series when the additional functionality offered by pandas Series isn’t critical.
3. How to perform NumPy vectorization
For the problems we explore in this course, we could use NumPy ndarrays instead of pandas series. The question at stake is whether this would be more efficient.
We will show how to perform NumPy vectorization using the same poker dataset as in the previous lessons. Again, we want to calculate the sum of the ranks of all the cards in each hand.
We convert our rank arrays from pandas Series to NumPy arrays simply by using the .values method of pandas Series, which returns a pandas Series as a NumPy ndarray. As with vectorization on the series, passing the NumPy array directly into the function will lead pandas to apply the function to the entire vector.
At this point, we could choose to call it a day; vectorizing over pandas series achieves the overwhelming majority of optimization needs for everyday calculations. However, if speed is of highest priority, we can call in reinforcements in the form of the NumPy Python library.
Comparing to the previous state of the art method, the pandas optimization, we still get an improvement on the operating time.
4. Let's do it!
Now you know all the possible ways to iterate through a pandas DataFrame and implement a simple function for every row. Let's vectorize!