eval()
and query()
functions, which rely on the Numexpr package.
In this notebook we will walk through their use and give some rules-of-thumb about when you might think about using them.query()
and eval()
: Compound Expressionsimport numpy as np
rng = np.random.RandomState(42)
x = rng.rand(1000000)
y = rng.rand(1000000)
%timeit x + y
%timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x))
mask = (x > 0.5) & (y < 0.5)
tmp1 = (x > 0.5)
tmp2 = (y < 0.5)
mask = tmp1 & tmp2
x
and y
arrays are very large, this can lead to significant memory and computational overhead.
The Numexpr library gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays.
The Numexpr documentation has more details, but for the time being it is sufficient to say that the library accepts a string giving the NumPy-style expression you'd like to compute:import numexpr
mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')
np.allclose(mask, mask_numexpr)
eval()
and query()
tools that we will discuss here are conceptually similar, and depend on the Numexpr package.pandas.eval()
for Efficient Operationseval()
function in Pandas uses string expressions to efficiently compute operations using DataFrame
s.
For example, consider the following DataFrame
s:import pandas as pd
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
for i in range(4))
DataFrame
s using the typical Pandas approach, we can just write the sum:%timeit df1 + df2 + df3 + df4
pd.eval
by constructing the expression as a string:%timeit pd.eval('df1 + df2 + df3 + df4')
eval()
version of this expression is about 50% faster (and uses much less memory), while giving the same result:np.allclose(df1 + df2 + df3 + df4,
pd.eval('df1 + df2 + df3 + df4'))
pd.eval()
pd.eval()
supports a wide range of operations.
To demonstrate these, we'll use the following integer DataFrame
s:df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3)))
for i in range(5))
pd.eval()
supports all arithmetic operators. For example:result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1, result2)
pd.eval()
supports all comparison operators, including chained expressions:result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1, result2)
pd.eval()
supports the &
and |
bitwise operators:result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1, result2)
and
and or
in Boolean expressions:result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')
np.allclose(result1, result3)
pd.eval()
supports access to object attributes via the obj.attr
syntax, and indexes via the obj[index]
syntax:result1 = df2.T[0] + df3.iloc[1]
result2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2)
pd.eval()
.
If you'd like to execute these more complicated types of expressions, you can use the Numexpr library itself.DataFrame.eval()
for Column-Wise Operationspd.eval()
function, DataFrame
s have an eval()
method that works in similar ways.
The benefit of the eval()
method is that columns can be referred to by name.
We'll use this labeled array as an example:df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()
pd.eval()
as above, we can compute expressions with the three columns like this:result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(result1, result2)
DataFrame.eval()
method allows much more succinct evaluation of expressions with the columns:result3 = df.eval('(A + B) / (C - 1)')
np.allclose(result1, result3)
DataFrame.eval()
also allows assignment to any column.
Let's use the DataFrame
from before, which has columns 'A'
, 'B'
, and 'C'
:df.head()
df.eval()
to create a new column 'D'
and assign to it a value computed from the other columns:df.eval('D = (A + B) / C', inplace=True)
df.head()
df.eval('D = (A - B) / C', inplace=True)
df.head()
DataFrame.eval()
method supports an additional syntax that lets it work with local Python variables.
Consider the following:column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)
@
character here marks a variable name rather than a column name, and lets you efficiently evaluate expressions involving the two "namespaces": the namespace of columns, and the namespace of Python objects.
Notice that this @
character is only supported by the DataFrame.eval()
method, not by the pandas.eval()
function, because the pandas.eval()
function only has access to the one (Python) namespace.DataFrame
has another method based on evaluated strings, called the query()
method.
Consider the following:result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1, result2)
DataFrame.eval()
, this is an expression involving columns of the DataFrame
.
It cannot be expressed using the DataFrame.eval()
syntax, however!
Instead, for this type of filtering operation, you can use the query()
method:result2 = df.query('A < 0.5 and B < 0.5')
np.allclose(result1, result2)
query()
method also accepts the @
flag to mark local variables:Cmean = df['C'].mean()
result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')
np.allclose(result1, result2)
DataFrame
s will result in implicit creation of temporary arrays:
For example, this:x = df[(df.A < 0.5) & (df.B < 0.5)]
tmp1 = df.A < 0.5
tmp2 = df.B < 0.5
tmp3 = tmp1 & tmp2
x = df[tmp3]
DataFrame
s is significant compared to your available system memory (typically several gigabytes) then it's a good idea to use an eval()
or query()
expression.
You can check the approximate size of your array in bytes using this:df.values.nbytes