Dataframe
s, which we'll explore in Chapter 3.import numpy as np
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
x = np.zeros(4, dtype=int)
# Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
print(data.dtype)
'U10'
translates to "Unicode string of maximum length 10," 'i4'
translates to "4-byte (i.e., 32 bit) integer," and 'f8'
translates to "8-byte (i.e., 64 bit) float."
We'll discuss other options for these type codes in the following section.data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)
# Get all names
data['name']
# Get first row of data
data[0]
# Get the name from the last row
data[-1]['name']
# Get names where age is under 30
data[data['age'] < 30]['name']
Dataframe
object, which is a structure built on NumPy arrays that offers a variety of useful data manipulation functionality similar to what we've shown here, as well as much, much more.np.dtype({'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
dtype
s instead:np.dtype({'names':('name', 'age', 'weight'),
'formats':((np.str_, 10), int, np.float32)})
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])
np.dtype('S10,i4,f8')
<
or >
, which means "little endian" or "big endian," respectively, and specifies the ordering convention for significant bits.
The next character specifies the type of data: characters, bytes, ints, floating points, and so on (see the table below).
The last character or characters represents the size of the object in bytes.Character | Description | Example |
---|---|---|
'b' | Byte | np.dtype('b') |
'i' | Signed integer | np.dtype('i4') == np.int32 |
'u' | Unsigned integer | np.dtype('u1') == np.uint8 |
'f' | Floating point | np.dtype('f8') == np.int64 |
'c' | Complex floating point | np.dtype('c16') == np.complex128 |
'S' , 'a' | String | np.dtype('S5') |
'U' | Unicode string | np.dtype('U') == np.str_ |
'V' | Raw data (void) | np.dtype('V') == np.void |
mat
component consisting of a floating-point matrix:tp = np.dtype([('id', 'i8'), ('mat', 'f8', (3, 3))])
X = np.zeros(1, dtype=tp)
print(X[0])
print(X['mat'][0])
X
array consists of an id
and a matrix.
Why would you use this rather than a simple multidimensional array, or perhaps a Python dictionary?
The reason is that this NumPy dtype
directly maps onto a C structure definition, so the buffer containing the array content can be accessed directly within an appropriately written C program.
If you find yourself writing a Python interface to a legacy C or Fortran library that manipulates structured data, you'll probably find structured arrays quite useful!np.recarray
class, which is almost identical to the structured arrays just described, but with one additional feature: fields can be accessed as attributes rather than as dictionary keys.
Recall that we previously accessed the ages by writing:data['age']
data_rec = data.view(np.recarray)
data_rec.age
%timeit data['age']
%timeit data_rec['age']
%timeit data_rec.age