numpy.where: A Case Study in Vectorization

Loops were probably the first thing I learned as a new coder that made me feel like I was really starting to get — and like — this coding thing. It was complex enough that I felt like I was really learning something new, understandable enough that I really felt like I got it, and useful enough that I knew I was learning something I would use throughout my career. Loops were (and still are) one of my best coding friends, so imagine my dismay when I learned about the several other techniques that out-perform them. But being a data scientist is all about learning new things and figuring out the best strategy to use to solve any given problem, and it turns out these other techniques are there for a reason. One of my newer favorite ones is numpy.where, which is a form of vectorization and blows loops out of the water (sorry, old friend).

But first, let’s talk about vectorization in general. Put simply, vectorization allows numpy to perform a function once over an entire array rather than individually to each element within it. For example, if I have a dataframe with 5,000 rows and I want to perform an action on a particular column, I can perform 5,000 individual actions, or just one to the entire column. This may sound similar to broadcasting, if you are familiar with that concept, though vectorization is slightly different.

Now let’s think about this more with the example of numpy.where, which I’ll be referring to moving forward as np.where. np.where uses the syntax: np.where(condition, x, y). “Condition” is a boolean condition where, if True, we want to return x. If False, we want to return y.

Let’s look at this in practice, also comparing to a for loop. I’ll perform the same action twice — with a for loop and with np.where — to show both how much simpler the code is with the latter, as well as how much quicker its runtime is. To measure runtime I’ve used the time module. Running time.time() marks any given point in time. Doing so immediately before and after a piece of code and finding the difference tells us how long that code took to run. For example:

Adding 1+1 took very little time! Now let’s see how long the others take…

For our example, let’s say we have a dataframe with a column of integers ranging from 0–9 and a blank column labeled “Odd”.

We now want to replace values in the “Odd” column with a binary value to indicate whether the value in the first column is odd or not. If it is, we want the “Odd” column to contain a 1, otherwise a 0.

We can accomplish this by iterating over each row using a for loop in 0.0015 seconds:

I know what you’re thinking: “Great! That took no time at all!”

Now let’s try it using np.where:

Ok, yes, this actually took longer —about 6.2 times as long coming in at a whopping 0.0093 seconds. We’ll come back to that. Remember, whereas the for loop went through every one of the 10 rows in this dataset to identify whether or not the value was even, np.where did so by looking at the entire array at once. This was also accomplished in one line of code instead of 5 with the for loop.

Now, you’re right, np.where ended up taking longer than the for loop. But what happens if, instead of looking at 10 rows of data one-by-one, our computer has to look at 1,000?

Using a for loop, this took 0.09 seconds, or about 60 times as long as the 10-row for loop version. But with np.where…

0.0006 seconds! Not only is this 150 times faster than the for loop for the same dataframe size, it’s also actually shorter than the np.where we used for the same 10-row version earlier. Since it is acting upon the entire array at once, a longer array doesn’t necessarily mean a longer runtime as is the case for a for loop.

Just for fun, let’s try once more using an array of one million observations. First with a for loop:

Yikes, that took over a minute! I bet you can guess whether np.where will take more or less time…

Correct you are, my friend. Our for loop took over 2,071 times as long as np.where, which came in at a whopping 0.03 seconds to operate over one million observations.

This is a very simple use case, but np.where comes up all the time in data cleaning and feature engineering. You can use it if you’re working with medical data and want to re-code diagnosis codes into the actual name of the diagnosis. You can use it if you want to quickly identify the season of an event given the month. Being able to do so quickly, elegantly, and efficiently with np.where is far preferred to a clunky, slow for loop (no disrespect meant to the for loop, of course). Learning more about np.where and other examples of vectorization are key to becoming an effective data scientist.

Former museum professional, current data scientist