Preparation#

Reading material#

In this course, we cover very basics of NumPy and Matplotlib, and each of these libraries might require a course of its own.

A good place to start is the Numpy Absolute Beginners Tutorial. Another good introduction to NumPy is provided by W3Schools Introduction to NumPy.

At W3Schools, you can also find a good Matplotlib Tutorial. To get started with Matplotlib, you can also check the Matplotlib Quick Start. And to see what Matplotlib can do, check the Matplotlib Plot Types.

Copy-and-Run#

Prep 12.1: NumPy Array#

NumPy is a library for data analysis, and you will need to import it using import numpy as np. This week, we focus on numpy.ndarray class, which is designed for storing and manipulating numerical data in vectors (1D), matrices (2D) and higher-dimensional (ND) arrays. We call such arrays as NumPy arrays.

Try running the code below.

import numpy as np

a_list =[13.5, 0.6, 40.1, 20.2, 15.8]
a_array = np.array(a_list)

print(a_list, type(a_list))
print(a_array, type(a_array))

The function np.array() creates a NumPy array from a list. Try to pass a tuple to np.array() and see whether it will create a NumPy array.

Now try creating this NumPy array.

a_list = [1, [2, 3]]
a_array = np.array(a_list)

As you can see, not all lists can be converted to NumPy arrays. NumPy array have to be homogenous. This means that all elements in the array have to be of the same type.

Try now to create this NumPy array. Do you think it will work?

a_list = ['age', 15.5, True]
a_array = np.array(a_list)
print(a_array)

Notice that NumPy converted all elements to strings, just because one of the elements was a string.

Try now predicting the output of the following code.

a_list = [1, 1.2, True]
a_array = np.array(a_list)
print(a_array)

You can check the type of data stored in the NumPy array by accessing its dtype (data type) attribute. Try also printing type(a_array) to see the difference between the type of the object and the data type of arrays elements.

NumPy arrays are designed to store numerical data. The most common data types are int64 for integers, float64 for floating-point numbers, and bool for boolean values.

print(np.array([0, 3, 6]).dtype)
print(np.array([0.0, 3.0, 6.1]).dtype)
print(np.array([False, True, True]).dtype)

It is sometimes necessary to change the data type of the array. Try running this code.

array0 = np.array([4.87, 3.25, 6.15, 0.0])
array1 = array0.astype(np.int64)
array2 = array0.astype(np.bool)

print(array0, array0.dtype)
print(array1, array1.dtype)
print(array2, array2.dtype)

As you can see, the method astype() returns a new array with the specified data type. The original array remains unchanged.

Another important property of the NumPy arrays is its size. You can check the size of the array by accessing its size attribute. Try running the following code.

a = np.array([8, 6, 12])
print(a.size)

For 1D arrays, the size is the same as the length. This will be different when working with 2D and ND arrays.

Prep 12.2: Element-wise Operations#

Try now this code to compare the behavior of NumPy arrays and lists.

a_list =[0, 15, 20, 25, 20]
a_array = np.array(a_list)

b_list = [1, 2, 3, 4, 5]
b_array = np.array(b_list)

c_list = a_list + b_list
c_array = a_array + b_array

print(c_list)
print(c_array)

As you can see, NumPy arrays overload the arithmetic operator + to perform element-wise addition. If you wanted to do the same with lists, you would have to use a for-loop.

Write the code where you try element-wise subtraction -, multiplication *, and division / on arrays a_array and b_array. Try also the integer division // and the power **.

Run now this code.

v = np.array([0.1, 0.5, -0.1, 0.2])
k = 100
a =  v + k
print(a)

As you can see, NumPy allows you to add a scalar to an array. Try whether you also can compute k + v, v - k, k * v, v / k, k // v, and k ** v for a scalar k and an array v.

There are other functions that create Numpy arrays. Try to run the following code.

array1 = np.arange(1, 10, 1.5)
print(array1)

array2 = np.linspace(1, 10, 20)
print(array2)

array3 = np.zeros(5)
print(array3)

array4 = np.ones(6)
print(array4)

The arguments of np.arange are start, stop, and step. The arguments of np.linspace are start, stop, and num. Notice that the endpoint is not included in the np.arange function, whereas it is included in np.linspace.

Prep 12.3: Mathematical Operations#

Try running the following code.

x = np.array([0.5, 1, 1.5, 2])
log_x = np.log(x)
print(log_x)

sqrt_x = np.sqrt(x)
print(sqrt_x)

exp_x = np.exp(x)
print(exp_x)

sin_x = np.sin(x)
print(sin_x)

abs_x = np.abs(x)
print(abs_x)

As you can see, the NumPy library provides functions for many standard mathematical operations. Note that np.log is the natural logarithm. The logarithm with base 10 is np.log10.

Run now this code.

grades = np.array([12, 10, 7, 10, 4, 7])
grades_sum = np.sum(grades)
grades_mean = np.mean(grades)
grades_std = np.std(grades)
print(grades_sum, grades_mean, grades_std)

Note that NumPy std() given an array \(x_1, x_2, ... , x_N\) returns \(\mathrm{std}(x) = \sqrt{\frac{1}{N} \sum_i \left(x_i - \mathrm{mean}(x)\right)^2}\), which is population standard deviation, while in some cases you might want to use sample standard deviation where the denominator is \(N-1\) instead of \(N\).

Many NumPy functions are also available as methods of the array object. Compare the code block above with the code block below.

grades = np.array([12, 10, 7, 10, 4, 7])
grades_sum = grades.sum()
grades_mean = grades.mean()
grades_std = grades.std()
print(grades_sum, grades_mean, grades_std)

You can choose whether to use the function or the method, and you will see both in the code written by others.

Look now the following code, and try to predict the output.

sales_2022 = np.array([3000, 0, 4000, 2000, 5000])
sales_2023 = np.array([3200, 1400, 4200, 1200, 4400])

print(sales_2022 == 2000)
print(sales_2023 > sales_2022)
print(sales_2023 > 2000)

You can see that logical operators are overloaded in NumPy arrays. The result of the comparison is an array of boolean values.

Try now the following code.

sales_2022 = np.array([3000, 0, 4000, 2000, 5000])
sales_2023 = np.array([3200, 1400, 4200, 1200, 4400])

good_sale = (sales_2023 > sales_2022) & (sales_2023 > 2000)
print(good_sale)

The code above shows that operator & performs element-wise and. Similarly, | performs element-wise or, and ~ performs element-wise not. Modify the code above to use | and ~ operators.

If you perform mathematical operations on logical arrays the value False is interpreted as a 0 and True is interpreted as a 1. Perhaps the most important use of this is when np.sum() counts the number of true values. For example, try printing np.sum(good_sale).

Prep 12.4: Indexing#

Run the code below to see how you can access some elements of the NumPy array.

student_heights = np.array([155, 160, 165, 170, 175, 180, 185, 190, 195])

print(student_heights[0])
print(student_heights[3])
print(student_heights[-1])
print(student_heights[-3])
print(student_heights[2:5])
print(student_heights[1:-2:2])
print(student_heights[:-2])
print(student_heights[::2])

As you can see, NumPy arrays can be indexed and sliced just like lists and tuples.

But NumPy arrays can be indexed in more ways than lists. Look at the code below and try to predict the output.

student_heights = np.array([171, 162, 187, 195, 157, 169, 175, 164, 168])
indices = [0, 4, -1, 6]
print(student_heights[indices])

As you can see, you can use a list of indices to access elements of the array. Test whether the code above would work if:

  • student_hights was a list, for example student_hights = [171, 162, 187, 195, 157, 169, 175, 164, 168]?

  • indices was a NumPy array of integers, for example indices = np.array([0, 4, -1, 6])?

  • an element of indices was a float, for example indices = [1, 2.0]?

  • an element of indices was larger than the largest index of the array, for example indices = [1, 2, 9]?

  • elements of indices contained repetitions, for example indices = [1, 2, 1, 2, 1, 2]?

There is yet another way to index NumPy arrays. Look at the code below and try to predict the output.

student_heights = np.array([171, 162, 187, 195, 157, 169, 175, 164, 168])
is_female = [True, True, False, False, True, False, True,True, False]
female_heights = student_heights[is_female]
print(female_heights)

This is called logical indexing. Test whether logical indexing would work if:

  • student_hights was a list, for example student_hights = [171, 162, 187, 195, 157, 169, 175, 164, 168]?

  • is_female was a NumPy array of booleans, for example is_female = np.array([True, True, False, False, True, False, True,True, False])?

  • elements of is_female were integers 0 and 1, for example is_female = [1, 1, 0, 0, 1, 0, 1, 1, 0]?

  • number of elements in is_female was different than the number of elements in student_hights, for example is_female = [True, False] or is_female = 5 * [True, False]?

Using logical indexing may be used to access elements of the array that satisfy some condition. For example, say you want to compute the average of a list of numbers, but you want to exclude the numbers smaller than 10. First consider how you would accomplish this without NumPy. Then check how the code below to see the NumPy solution.

a = np.array([18.9, 12.8, 9.7, 6.8, 15.2, 17.5, 13.7, 11.1])
result = (a[a >= 10]).mean()
print(result)

As you can see, NumPy allows you to write very compact code, but you still need to understand how the code is constructed. For the code above, answer the questions. Add the print statements to the code to check your answers.

  • What is the type of a? What is the dtype of a? How many elements does a have?

  • What is the type of a >= 10? What is the dtype of a >= 10? How many elements does a >= 10 have?

  • What is the type of a[a >= 10]? What is the dtype of a[a >= 10]? How many elements does a[a >= 10] have?

  • What is the type of a[a >= 10].mean()? What is the dtype of a[a >= 10].mean()? How many elements does a[a >= 10].mean() have?

Prep 12.5: Mutability and Preallocation#

Let’s check whether NumPy arrays are mutable. Try to run the code below.

numbers = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
numbers[5] = 1000
print(numbers)
test = numbers
test[0:3] = [190, 188, 159]
print(numbers)

As you can see, you can change the NumPy array after it has been created. This means that NumPy arrays are mutable.

Let’s check whether you also can change the data type of an NumPy array. Try to change an element of a to float, for example a[7] = 3.14. What happens?

As you can see, you can only change the values, but not the data type of the array.

Look at the code below to see what you can do, if you need to keep both the original and the modified array.

heights = np.array([160, 175, 184, 159])
old_heights = heights.copy() 

heights[2] = 190
heights[3] = 187

print('The updated heights are', heights)
print('The old heights were', old_heights)

Another important thing to remember is that you cannot change the size of the NumPy array, after it has been created.

When working with lists we have often used the append method to add elements to the list, for example in a loop. Appending to a list does not create a new list, but appending to a NumPy array would always create a new array.

Therefore, if you know that you need to populate the NumPy array in a loop, it is a good idea to create an array of the correct size before the loop, and then change its values in the loop. This is called preallocation.

Try to run the code below to see how this works.

value_0 = 15
num_steps = 30
values = np.empty(num_steps)
for i in range(num_steps):
    values[i] = value_0
    value_0 *= 0.9

For the code above, answer the questions. Add the print statements to the code, such that you can check your answers.

  • What is the type of values before the loop? What is the dtype of values before the loop? How many elements does values have before the loop? What are the values of elements in values before the loop?

  • What is the type of values after the loop? What is the dtype of values after the loop? How many elements does values have after the loop? What are the values of elements in values after the loop?

Prep 12.6: 2D Arrays#

NumPy arrays can be multidimensional. Below, we define a 2D array (matrix) from a nested list. The elements in the precipitation array are monthly precipitation in mm for Copenhagen, Bergen, Glasgow, and Cologne.

precipitation = np.array([[49, 39, 32, 38, 40, 47, 71, 66, 62, 59, 48, 49],
                [179, 139, 109, 140, 83, 126, 141, 167, 228, 236, 207, 203],
                [111, 85, 69, 67, 63,70, 97, 93, 102, 119, 106, 127],
                [56, 51, 40, 52, 55, 79, 66, 83, 58, 54, 59, 55]])

Indexing in 2D array is similar to 1D, but you need to provide two indices. Try to run the code to see how this works, assuming you have the precipitation array from the previous code block.

print(precipitation[0, :])  # First row
print(precipitation[2, :])  # Third row
print(precipitation[:, 3])  # Fourth column
print(precipitation[3, :6])  # First six elements of the fourth row
print(precipitation[2, 4:])  # Elements from the fifth to the last of the third row
print(precipitation[2, 5])  # Element in the third row and sixth column

Run this code.

print(precipitation.size)
print(precipitation.shape)
print(precipitation.ndim)

The attribute ndim returns the number of dimensions of the array. The attribute shape returns the dimensions of the array. The first element is the number of rows, and the second element is the number of columns. The attribute size returns the total number of elements in the array.

Look now at the following code.

cities = ['Copenhagen', 'Bergen', 'Glasgow', 'Cologne']
yearly_precipitation = precipitation.sum(axis=1)

print('YEARLY PRECIPITATION')
for i in range(len(cities)):
    print(f'{cities[i]:10} {yearly_precipitation[i]:4} mm')   

Print yearly_precipitation. What is its type, data type and size?

Try computing precipitation.sum(). What is the result? What is the type, data type and size of the result?

Try computing precipitation.sum(axis=0). What is the result? What is the type, data type and size of the result?

When working with 2D arrays, many of the NumPy functions can be applied to the entire array, or to a specific axis. For example, you can compute the sum of all elements in the array, or the sum of elements in each row or column. Here, axis=0 means that the function is applied to each column, and axis=1 means that the function is applied to each row.

Here is another similar example.

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 
        'Sep', 'Oct', 'Nov', 'Dec']
months_array = np.array(months)
min_val = precipitation.min(axis=1)
max_val = precipitation.max(axis=1)
min_mon = months_array[precipitation.argmin(axis=1)]
max_mon = months_array[precipitation.argmax(axis=1)]

print('PRECIPITATION RANGE')
for i in range(len(cities)):
    print(f'{cities[i]}: {min_val[i]} ({min_mon[i]}) - {max_val[i]} ({max_mon[i]}) mm')   

It turns out that Copenhagen isn’t that bad when it comes to the amount of rain!

Add the print statements to the code above and print min_val, precipitation.argmin(axis=1), and min_mon.

Prep 12.7: Matplotlib#

Matplotlib is a library for visualization of data. We will only show some basic plots: scatterplot and lineplot. When you run Python code as a script, the plots will be displayed in a separate window. For all exercises you need to import the functions from Matplotlib using import matplotlib.pyplot as plt.

Let’s look back at Problem 4.11: Fish population, where you were given the model for the fish population growth

\[ N_{\text{new}} = N_{\text{old}} + 0.25 N_{\text{old}} - 1000, \]

and you were asked to print the fish population over 10 years, starting with 1500 fish. Below is a solution to the problem (we use now f-strings, which you did not know about when you solved the problem in week 4).

population = 15000
for y in range(1, 10):
    population = population + 0.25 * population - 1000
    print(f'Year {y}: population {population:.2f}')

We’ll now change the code to plot the fish population instead of printing it. Try to run the code below.

import matplotlib.pyplot as plt
population = 15000
for y in range(1, 10):
    population = population + 0.25 * population - 1000
    plt.plot(y, population, 'bo')   
plt.show()

With very small changes, we have created a visualization of the fish population. The three lines of code you changed do the following:

  • Import the Matplotlib library.

  • Plot one point in each iteration of the loop. The argument bo means that the points are blue circles. The point will be added to the current plot, and if there is no plot, a new one will be created.

  • Show the plot when we have finished adding to it.

In Matplotlib you can change the size of the plot window, the font and text size in the title and axis labels, the size and color and shape of the markers, which numbers are displayed on the axes, and many other things. There is no reason to try to remember all the options, you can always look it up, when you need it.

For the next plot, you will need the precipitation array from earlier. Let us make a line plot with a few more options specified, comparing precipitation in the 4 cities

cities = ['Copenhagen', 'Bergen', 'Glasgow', 'Cologne']
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 
        'Sep', 'Oct', 'Nov', 'Dec']

plt.plot(months, precipitation[0], color='red')
plt.plot(precipitation[1], color = 'blue')
plt.plot(precipitation[2], color = 'green')
plt.plot(precipitation[3], color='orange')
plt.ylabel('Monthly precipitation in mm')
plt.xlabel('Month')
plt.legend(cities, loc= 'upper left')
plt.title('Precipitation comparison')
plt.ylim([0, 250])
plt.grid()
plt.show()

Finally, let’s use NumPy and Matplotlib to plot the natural logarithm, square root and sine function in the interval from 0.1 to 10.

x = np.linspace(0.1, 10, 100)
plt.plot(x, np.log(x))
plt.plot(x, np.sqrt(x))
plt.plot(x,np.sin(x))
plt.legend(['log(x)', 'sqrt(x)', 'sin(x)'])
plt.xlabel('x')
plt.grid()
plt.show()

Self quiz#

Assume that you have imported NumPy and Matplotlib as follows:

import numpy as np
import matplotlib.pyplot as plt

Question 12.1#

Which of the following statements is correct?

Question 12.2#

What is the type of heights in the following code?

heights = np.array([156, 167, 178])

Question 12.3#

What is the dtype attribute of heights in the following code?

heights = np.array([156, 167, 178])

Question 12.4#

What is printed by the following code?

a = np.array([12, 14.5, 10])
print(a)

Question 12.5#

What is printed by the following code?

heights = np.array([156, 167, 178])
print(heights + 100)

Question 12.6#

What is printed by the following code?

x = np.array([-2, -1, 0, 1, 2])
y = x**2 + 1
print(y)

Question 12.7#

What is printed by the following code?

values = np.array([10, 20, 30, 40, 50, 60])
indices = [1, 3, 5]
result = values[indices]
print(result)

Question 12.8#

What is printed by the following code?

values = [10, 20, 30, 40, 50, 60]
indices = np.array([1, 3, 5])
result = values[indices]
print(result)

Question 12.9#

What is stored in a after running the following code?

a_bool = np.array([True, False])
a_int = np.array([0, 1])
a = a_int[a_bool]

Question 12.10#

What is the value of N after running the following code?

a = np.array([-2, -2, 4, 3])
N = np.sum(a > 0)

Question 12.11#

Which of these does not return a NumPy array?

Question 12.12#

An array is given by a = np.array([172, -99, 163, -99, 179]) . We want to create a new array b containing only positive values from a. How can we do this?

Question 12.13#

Which line will define y with the same value as the following code?

y = []
for i in range(10):
    y.append(i**2)
y = np.array(y)

Question 12.14#

A variable is defined by a = np.ones(4) . What is an equivalent way of defining a?

Question 12.15#

What is printed after running the following code?

a = np.array([1, 1, 1])
b = np.mean(a)
print(b)

Question 12.16#

What is stored in valid after running the following code?

values = np.array([-4, 6, -2, 12, 16, 8])
valid = (values > 0) & (values < 10)

Question 12.17#

What is stored in b running the following code?

a = np.array([1, 2, 3, 4])
b = a
a[1] = 100

Question 12.18#

What is printed when running the following code?

m = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
print(m[:, 0])

Question 12.19#

What most accurately describes the plot produced by the following code?

for i in range(5):
    plt.plot(i, i, 'o')   
plt.show()

Question 12.20#

What most accurately describes the plot produced by the following code?

x = np.linspace(-2, 2)
y = x**2
plt.plot(x, y)
plt.show()