Why should we use NumPy array instead of Python list?

Rakib Hasan Bappy
3 min readApr 10, 2023

--

NumPy Logo

When we start exploring Machine Learning or Data Science, we often find that NumPy array is used very often instead of python list. As a beginner, it is really tough to understand why we do it. It looks pretty similar to a beginner. So, today I will try to give a visual representation of why we should use a NumPy array instead of python list.

Let’s say, we have to multiply two lists of numbers and output the sum of the resulting list in python. Then what will we do? A very naive approach is given below:

list1 = [1, 3, 4]
list2 = [4, 5, 6]

result = list1[0]*list2[0] + list1[1]*list2[1] + list1[2]*list2[2]

print(result)

But, what happens if there are 1000 numbers on the list? Now, you can say that we can use a loop to do it.

list1 = [1, 3, 4, 7, 9, 11]
list2 = [4, 5, 6, 8, 12, 14]

result = 0
for i in range(0, len(list1)):
result += (list1[i] * list2[i])

print(result)

While this implementation is a bit better than the first one, this still isn’t that efficient. To make it more efficient, now I will introduce you to NumPy array. To do it, first of all, I will convert the two given lists into a NumPy array.

import numpy as np # it is an unofficial standard to use np for numpy

# this line will convert the list into numpy array
arr1 = np.array(list1)
arr2 = np.array(list2)

Now, we will use NumPy Dot Function to do our task.

result = np.dot(arr1, arr2)
print(result)

This method is called vectorization. This NumPy dot function is a vectorized implementation of the dot product operation between two vectors and especially when n (length of the list) is large, this will run much faster than the two previous code examples. I want to emphasize that vectorization actually has two distinct benefits. First, it makes code shorter, is now just one line of code. Isn’t that cool? Second, it also results in your code running much faster than either of the two previous implementations that did not use vectorization. The reason that the vectorized implementation is much faster is behind the scenes. The NumPy dot function is able to use parallel hardware in your computer and this is true whether you’re running this on a normal computer, that is on a normal computer CPU or if you are using a GPU, a graphics processor unit, that’s often used to accelerate machine learning jobs. The ability of the NumPy dot function to use parallel hardware makes it much more efficient than the for loop or the sequential calculation that we saw previously. Now, this version is much more practical when n is large.

Now, we will see if is vectorization actually faster or not. To see this run the code given below.

a = np.random.rand(10000000)  # very large arrays
b = np.random.rand(10000000)

tic = time.time() # capture start time
c = np.dot(a, b)
toc = time.time() # capture end time

print(f"np.dot(a, b) = {c:.4f}")
print(f"Vectorized version duration: {1000*(toc-tic):.4f} ms ")

tic = time.time() # capture start time
c = my_dot(a,b)
toc = time.time() # capture end time

print(f"my_dot(a, b) = {c:.4f}")
print(f"loop version duration: {1000*(toc-tic):.4f} ms ")

del(a);del(b) #remove these big arrays from memory


# this code is taken from 'Machine Learning Specialization' course of Coursera.

The output of the code is:

np.dot(a, b) =  2501072.5817
Vectorized version duration: 191.1116 ms
my_dot(a, b) = 2501072.5817
loop version duration: 10142.9269 ms

So, vectorization provides a large speed up in this example. This is because NumPy makes better use of available data parallelism in the underlying hardware. This is critical in Machine Learning where the data sets are often very large.

To recap, vectorization makes your code shorter, so hopefully easier to write and easier for you or others to read, and it also makes it run much faster.

There exists more importance of using a NumPy array. I just demonstrated one of them to understand easily.

--

--