# NumPy - A to Z

By: David Hoese (@djhoese)

**https://gitlab.ssec.wisc.edu/ssec-dev/numpy-atoz**

# Why NumPy?

Lists are cool, but not for large scientific data.

Python builtin container types like lists, tuples, dicts, and sets are very flexible. However, these types don't store things efficiently when it comes to many elements. Everything in python (even integers) are complete python objects. They take up much more memory and are harder to use efficiently than their C-level counterparts. Not only that, but it isn't easy to perform basic arithmetic on the numbers in these containers.

In [None]:
my_arr = [1, 2, 3]
my_arr

In [None]:
result = []
for val in my_arr:
 result.append(val + 5)
result

# NumPy Arrays

NumPy Arrays are memory efficient, easy to use, and perform calculations fast.

In [3]:
import numpy as np
my_arr = np.array([1, 2, 3])
my_arr + 5

array([6, 7, 8])

# Useful Array Attributes

In [4]:
my_arr = np.zeros((10, 5))
my_arr.shape # 10 rows, 5 columns

(10, 5)

In [5]:
my_arr.ndim

2

In [6]:
my_arr.dtype

dtype('float64')

In [7]:
my_arr.size # number of elements

50

In [8]:
my_arr.nbytes # total number of bytes for all data

400

In [9]:
my_arr.T # transpose

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [10]:
my_arr.T.shape

(5, 10)

# Creating Arrays

All of the functions can be found in NumPy's documentation [here](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.array-creation.html).

In [11]:
np.array([1, 2, 3])

array([1, 2, 3])

In [12]:
np.array([1.2, 2.2, 3.2])

array([1.2, 2.2, 3.2])

In [13]:
np.array([1, 2, 3], dtype=np.float32)

array([1., 2., 3.], dtype=float32)

In [14]:
np.array([-1, 0, 1, 2, 3, 4], dtype=np.bool)

array([ True, False, True, True, True, True])

In [15]:
np.zeros((2, 10))

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [16]:
np.ones((2, 10))

array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
 [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

In [17]:
np.empty((2, 10)) # "random"

array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
 [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

In [18]:
np.full((2, 10), 2.35)

array([[2.35, 2.35, 2.35, 2.35, 2.35, 2.35, 2.35, 2.35, 2.35, 2.35],
 [2.35, 2.35, 2.35, 2.35, 2.35, 2.35, 2.35, 2.35, 2.35, 2.35]])

In [19]:
np.full((2, 10), np.nan)

array([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
 [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]])

In [20]:
my_arr.shape

(10, 5)

In [21]:
np.ones_like(my_arr)

array([[1., 1., 1., 1., 1.],
 [1., 1., 1., 1., 1.],
 [1., 1., 1., 1., 1.],
 [1., 1., 1., 1., 1.],
 [1., 1., 1., 1., 1.],
 [1., 1., 1., 1., 1.],
 [1., 1., 1., 1., 1.],
 [1., 1., 1., 1., 1.],
 [1., 1., 1., 1., 1.],
 [1., 1., 1., 1., 1.]])

In [22]:
np.eye(5)

array([[1., 0., 0., 0., 0.],
 [0., 1., 0., 0., 0.],
 [0., 0., 1., 0., 0.],
 [0., 0., 0., 1., 0.],
 [0., 0., 0., 0., 1.]])

In [23]:
np.eye(5, 10, k=1)

array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
 [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
 [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]])

In [24]:
np.arange(5)

array([0, 1, 2, 3, 4])

In [25]:
np.arange(5.0)

array([0., 1., 2., 3., 4.])

In [26]:
np.arange(5.0, 8.0)

array([5., 6., 7.])

In [27]:
np.arange(5.0, 25.0, 5.2)

array([ 5. , 10.2, 15.4, 20.6])

In [28]:
np.arange(8.0, 5.0)

array([], dtype=float64)

In [29]:
np.arange(8.0, 5.0, -1.2)

array([8. , 6.8, 5.6])

In [30]:
np.linspace(5, 16, 25)

array([ 5. , 5.45833333, 5.91666667, 6.375 , 6.83333333,
 7.29166667, 7.75 , 8.20833333, 8.66666667, 9.125 ,
 9.58333333, 10.04166667, 10.5 , 10.95833333, 11.41666667,
 11.875 , 12.33333333, 12.79166667, 13.25 , 13.70833333,
 14.16666667, 14.625 , 15.08333333, 15.54166667, 16. ])

In [31]:
np.linspace(5, 16, 25, endpoint=False)

array([ 5. , 5.44, 5.88, 6.32, 6.76, 7.2 , 7.64, 8.08, 8.52,
 8.96, 9.4 , 9.84, 10.28, 10.72, 11.16, 11.6 , 12.04, 12.48,
 12.92, 13.36, 13.8 , 14.24, 14.68, 15.12, 15.56])

In [32]:
np.linspace(5, 16, 25, retstep=True)

(array([ 5. , 5.45833333, 5.91666667, 6.375 , 6.83333333,
 7.29166667, 7.75 , 8.20833333, 8.66666667, 9.125 ,
 9.58333333, 10.04166667, 10.5 , 10.95833333, 11.41666667,
 11.875 , 12.33333333, 12.79166667, 13.25 , 13.70833333,
 14.16666667, 14.625 , 15.08333333, 15.54166667, 16. ]),
 0.4583333333333333)

In [33]:
np.logspace(2, 5, num=12)

array([ 100. , 187.38174229, 351.11917342, 657.93322466,
 1232.84673944, 2310.12970008, 4328.76128108, 8111.3083079 ,
 15199.11082953, 28480.35868436, 53366.99231206, 100000. ])

In [34]:
np.logspace(2, 5, num=25, base=2)

array([ 4. , 4.36203093, 4.75682846, 5.18735822, 5.65685425,
 6.1688433 , 6.72717132, 7.33603235, 8. , 8.72406186,
 9.51365692, 10.37471644, 11.3137085 , 12.3376866 , 13.45434264,
 14.67206469, 16. , 17.44812372, 19.02731384, 20.74943287,
 22.627417 , 24.67537321, 26.90868529, 29.34412938, 32. ])

In [35]:
mg = np.meshgrid(np.arange(5, 8), np.arange(12, 16))
mg[0]

array([[5, 6, 7],
 [5, 6, 7],
 [5, 6, 7],
 [5, 6, 7]])

In [36]:
mg[1]

array([[12, 12, 12],
 [13, 13, 13],
 [14, 14, 14],
 [15, 15, 15]])

In [37]:
# without creating temporary arrays
np.mgrid[12:16, 5:8]

array([[[12, 12, 12],
 [13, 13, 13],
 [14, 14, 14],
 [15, 15, 15]],

 [[ 5, 6, 7],
 [ 5, 6, 7],
 [ 5, 6, 7],
 [ 5, 6, 7]]])

In [38]:
np.repeat(my_arr, 2, axis=1).shape

(10, 10)

In [39]:
np.arange(10).reshape((2, 5))

array([[0, 1, 2, 3, 4],
 [5, 6, 7, 8, 9]])

In [40]:
a = np.arange(4).reshape((2, 2))
np.tile(a, (2, 3))

array([[0, 1, 0, 1, 0, 1],
 [2, 3, 2, 3, 2, 3],
 [0, 1, 0, 1, 0, 1],
 [2, 3, 2, 3, 2, 3]])

In [41]:
np.random.random((2, 2)) # alias for np.random.random_sample
# Similar: np.random.rand(2, 2)

array([[0.79931986, 0.74795771],
 [0.12213411, 0.79055517]])

In [42]:
np.random.randint(5, 10)

5

In [43]:
np.random.randint(5, 15, size=(2, 3))

array([[ 8, 7, 6],
 [12, 14, 6]])

## Flatten a multidimensional array

`.flatten` always returns a copy

In [44]:
a = np.zeros((2, 2))
a.flatten()

array([0., 0., 0., 0.])

Get a flattened version of the array that is contiguous in memory. If the array is contiguous already then this is a "view" and data is not copied.

In [45]:
a.ravel()

array([0., 0., 0., 0.])

Get flattened view that doesn't have to be contiguous in memory.

In [46]:
a.reshape((-1,))

array([0., 0., 0., 0.])

Flatten one or more dimensions:

In [47]:
a = np.zeros((3, 4, 5, 6))
a.reshape((3, -1, 6)).shape

(3, 20, 6)

In [48]:
a.flags

 C_CONTIGUOUS : True
 F_CONTIGUOUS : False
 OWNDATA : True
 WRITEABLE : True
 ALIGNED : True
 WRITEBACKIFCOPY : False
 UPDATEIFCOPY : False

In [49]:
a[::2].flags

 C_CONTIGUOUS : False
 F_CONTIGUOUS : False
 OWNDATA : False
 WRITEABLE : True
 ALIGNED : True
 WRITEBACKIFCOPY : False
 UPDATEIFCOPY : False

In [50]:
np.asfortranarray(a).flags

 C_CONTIGUOUS : False
 F_CONTIGUOUS : True
 OWNDATA : True
 WRITEABLE : True
 ALIGNED : True
 WRITEBACKIFCOPY : False
 UPDATEIFCOPY : False

# Universal functions and Broadcasting

Universal functions (ufuncs) perform operations on arrays element-by-element. If arrays are **not** the same shape, we can broadcast the array elements to match shapes and perform the function. NumPy's available universal functions are documented [here](https://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs).

In [51]:
a = np.full((2, 10), 2)
b = np.arange(10)
print(a)
print(b)

[[2 2 2 2 2 2 2 2 2 2]
 [2 2 2 2 2 2 2 2 2 2]]
[0 1 2 3 4 5 6 7 8 9]


In [52]:
a + b

array([[ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])

In [53]:
np.power(a, b)

array([[ 1, 2, 4, 8, 16, 32, 64, 128, 256, 512],
 [ 1, 2, 4, 8, 16, 32, 64, 128, 256, 512]])

NumPy also automatically casts data types when needed.

In [54]:
a = np.array([1, 2, 3], dtype=np.uint8)
b = np.array([2.2, 3.5, 4.6], dtype=np.float32)
a + b

array([3.2, 5.5, 7.6], dtype=float32)

# Logical and Bitwise Functions

In [55]:
a = np.arange(5)
a 

array([0, 1, 2, 3, 4])

In [56]:
a > 2

array([False, False, False, True, True])

In [57]:
(a > 0) & (a < 3)

array([False, True, True, False, False])

In [58]:
np.logical_and(a > 0, a < 3)

array([False, True, True, False, False])

In [59]:
(a > 0) and (a < 3)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [60]:
b = np.arange(3, 8)
a & b

array([0, 0, 0, 2, 4])

In [61]:
a | b

array([3, 5, 7, 7, 7])

In [62]:
a ^ b

array([3, 5, 7, 5, 3])

In [63]:
a == b

array([False, False, False, False, False])

In [64]:
a != b

array([ True, True, True, True, True])

In [65]:
c = a > 2
c

array([False, False, False, True, True])

In [66]:
~c

array([ True, True, True, False, False])

# Indexing and Slicing

In [67]:
a = np.zeros((5, 10))

In [68]:
a[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [69]:
a[1, 2] # index-1 row, index-2 column

0.0

In [70]:
a[:, 0]

array([0., 0., 0., 0., 0.])

In [71]:
a[0:2]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [72]:
a[1:5:2] # every other

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [73]:
a[:, :]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
 [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [74]:
a[:, np.newaxis, :, np.newaxis].shape

(5, 1, 10, 1)

In [75]:
np.newaxis is None

True

Useful if you need an array to have a certain number of dimensions.

In [76]:
np.repeat(np.arange(10), 5)

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4,
 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8,
 8, 9, 9, 9, 9, 9])

In [77]:
np.repeat(np.arange(10)[:, np.newaxis], 5, axis=1)

array([[0, 0, 0, 0, 0],
 [1, 1, 1, 1, 1],
 [2, 2, 2, 2, 2],
 [3, 3, 3, 3, 3],
 [4, 4, 4, 4, 4],
 [5, 5, 5, 5, 5],
 [6, 6, 6, 6, 6],
 [7, 7, 7, 7, 7],
 [8, 8, 8, 8, 8],
 [9, 9, 9, 9, 9]])

What if I don't know how many dimensions there are?

In [78]:
a = np.zeros((3, 4, 5, 6)) # 4 dimensions
a[:, :, :, 0].shape

(3, 4, 5)

In [79]:
a[..., 0].shape

(3, 4, 5)

In [80]:
# '...' maps to the Ellipsis object
Ellipsis # this is a builtin python object

Ellipsis

# Boolean Masks and Index Arrays

In [81]:
a = np.arange(40).reshape((5, 8))
b = np.random.randint(5, 17, size=(5, 8))

In [82]:
b > 12

array([[False, True, False, False, False, True, True, False],
 [False, False, True, False, True, True, False, False],
 [False, False, False, False, False, False, False, False],
 [ True, True, False, True, False, False, False, False],
 [ True, False, False, True, False, True, False, False]])

In [83]:
a[b > 12]

array([ 1, 5, 6, 10, 12, 13, 24, 25, 27, 32, 35, 37])

In [84]:
a[b > 12] * 2

array([ 2, 10, 12, 20, 24, 26, 48, 50, 54, 64, 70, 74])

In [85]:
c = a.copy()
c[b > 12] = 0
c

array([[ 0, 0, 2, 3, 4, 0, 0, 7],
 [ 8, 9, 0, 11, 0, 0, 14, 15],
 [16, 17, 18, 19, 20, 21, 22, 23],
 [ 0, 0, 26, 0, 28, 29, 30, 31],
 [ 0, 33, 34, 0, 36, 0, 38, 39]])

In [86]:
# <condition>, <x>, <y>
# result = <x> if <condition> else <y>
np.where(b > 12, a, a + 100)

array([[100, 1, 102, 103, 104, 5, 6, 107],
 [108, 109, 10, 111, 12, 13, 114, 115],
 [116, 117, 118, 119, 120, 121, 122, 123],
 [ 24, 25, 126, 27, 128, 129, 130, 131],
 [ 32, 133, 134, 35, 136, 37, 138, 139]])

What if we wanted the **locations** of where things are True?

In [87]:
b

array([[ 9, 16, 7, 12, 10, 16, 13, 9],
 [ 8, 11, 15, 7, 14, 15, 9, 12],
 [10, 12, 9, 5, 5, 5, 8, 9],
 [14, 14, 10, 15, 11, 8, 6, 8],
 [16, 6, 11, 14, 10, 13, 11, 6]])

In [88]:
idx = np.nonzero(b > 12)
idx

(array([0, 0, 0, 1, 1, 1, 3, 3, 3, 4, 4, 4]),
 array([1, 5, 6, 2, 4, 5, 0, 1, 3, 0, 3, 5]))

In [89]:
a[idx]

array([ 1, 5, 6, 10, 12, 13, 24, 25, 27, 32, 35, 37])

# Tons of Functions

* np.sum
* np.clip
* np.mean
* np.std
* np.min
* np.max
* np.argmin
* np.argmax

In [90]:
a.sum()

780

In [91]:
np.sum(a)

780

In [92]:
np.clip(a, 10, 30)

array([[10, 10, 10, 10, 10, 10, 10, 10],
 [10, 10, 10, 11, 12, 13, 14, 15],
 [16, 17, 18, 19, 20, 21, 22, 23],
 [24, 25, 26, 27, 28, 29, 30, 30],
 [30, 30, 30, 30, 30, 30, 30, 30]])

In [93]:
np.clip(a, 10, None)

array([[10, 10, 10, 10, 10, 10, 10, 10],
 [10, 10, 10, 11, 12, 13, 14, 15],
 [16, 17, 18, 19, 20, 21, 22, 23],
 [24, 25, 26, 27, 28, 29, 30, 31],
 [32, 33, 34, 35, 36, 37, 38, 39]])

In [94]:
a.min()

0

In [95]:
a.min(axis=1)

array([ 0, 8, 16, 24, 32])

In [96]:
np.argmax(a) # index in to flattened array

39

In [97]:
np.unravel_index(np.argmax(a), a.shape)

(4, 7)

# Masked Arrays

Use Masked Arrays to "hide" bad values in your data.

My Opinion: Don't use masked arrays unless you really have to. Use higher level libraries like pandas and xarray where NaN (`np.nan`) is used as a "sentinel" value. Masked arrays use more memory and calculations can be slower when using them.

In [98]:
a = np.ma.masked_array([1, 2, 3], mask=[False, True, False])
a

masked_array(data=[1, --, 3],
 mask=[False, True, False],
 fill_value=999999)

In [99]:
a + 1

masked_array(data=[2, --, 4],
 mask=[False, True, False],
 fill_value=999999)

In [100]:
a.data

array([1, 2, 3])

In [101]:
a.mask

array([False, True, False])

In [102]:
a.filled(0)

array([1, 0, 3])

In [103]:
a = np.arange(5)
np.ma.masked_where(a < 2, a)

masked_array(data=[--, --, 2, 3, 4],
 mask=[ True, True, False, False, False],
 fill_value=999999)

# Saving flat binary files

If you want to save numpy arrays for later use in the simplest format with no additional dependencies, write them as either a flat binary file or a .npy/.npz file.

To create a flat binary file from an array:

In [104]:
a = np.arange(5, dtype=np.uint8)
a.tofile('test_binary.dat')

To load data from a flat binary file:

In [105]:
a = np.fromfile('test_binary.dat', dtype=np.uint8)

Be careful. No dtype or shape information is recorded in the binary file. `fromfile` will read the file as a flat array.

An even better option than `fromfile` is to use `memmap`. Instead of loading all of the data in to memory, a memory map keeps the data on disk and only reads what it needs (and other fancy OS-level stuff).

In [106]:
a = np.memmap('test_binary.dat', dtype=np.uint8, mode='r')

A better option is the `.npy` format which records this dtype and shape information.

In [107]:
np.save('test.npy', a)

In [108]:
new_a = np.load('test.npy', mmap_mode='r')
new_a

memmap([0, 1, 2, 3, 4], dtype=uint8)

Note the `mmap_mode` argument is optional. Without it we would be loading the data in to memory instead of a memory map.

## Save multiple arrays to disk

Although other file formats provide more features (HDF5, NetCDF4, zarr, etc), numpy also comes with a `.npz` format for storing multiple arrays at once. There is a `savez` function and a compressing version called `savez_compressed`.

In [109]:
a = np.arange(5, dtype=np.uint8)
b = np.arange(10, dtype=np.float32)
np.savez('test_multiple.npz', my_var1=a, my_var2=b)

In [110]:
from_disk = np.load('test_multiple.npz') # same load as before
from_disk

<numpy.lib.npyio.NpzFile at 0x7febaa57b908>

In [111]:
list(from_disk.keys())

['my_var1', 'my_var2']

In [112]:
from_disk['my_var1']

array([0, 1, 2, 3, 4], dtype=uint8)

# Not Covered

* Record arrays (use pandas or xarray)
* Matrix
* dask
* pandas
* xarray
* zarr