import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]
AttributeError Traceback (most recent call last)
<ipython-input-3-fc1d891ab539> in <module>()
1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]
<ipython-input-3-fc1d891ab539> in <listcomp>(.0)
1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]
AttributeError: 'NoneType' object has no attribute 'capitalize'
str
attribute of Pandas Series and Index objects containing strings.
So, for example, suppose we create a Pandas Series with this data:import pandas as pd
names = pd.Series(data)
names
names.str.capitalize()
str
attribute will list all the vectorized string methods available to Pandas.monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
'Eric Idle', 'Terry Jones', 'Michael Palin'])
str
methods that mirror Python string methods:len() | lower() | translate() | islower() |
ljust() | upper() | startswith() | isupper() |
rjust() | find() | endswith() | isnumeric() |
center() | rfind() | isalnum() | isdecimal() |
zfill() | index() | isalpha() | split() |
strip() | rindex() | isdigit() | rsplit() |
rstrip() | capitalize() | isspace() | partition() |
lstrip() | swapcase() | istitle() | rpartition() |
lower()
, return a series of strings:monte.str.lower()
monte.str.len()
monte.str.startswith('T')
monte.str.split()
re
module:Method | Description |
---|---|
match() | Call re.match() on each element, returning a boolean. |
extract() | Call re.match() on each element, returning matched groups as strings. |
findall() | Call re.findall() on each element |
replace() | Replace occurrences of pattern with some other string |
contains() | Call re.search() on each element, returning a boolean |
count() | Count occurrences of pattern |
split() | Equivalent to str.split() , but accepts regexps |
rsplit() | Equivalent to str.rsplit() , but accepts regexps |
monte.str.extract('([A-Za-z]+)', expand=False)
^
) and end-of-string ($
) regular expression characters:monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
Series
or Dataframe
entries opens up many possibilities for analysis and cleaning of data.Method | Description |
---|---|
get() | Index each element |
slice() | Slice each element |
slice_replace() | Replace slice in each element with passed value |
cat() | Concatenate strings |
repeat() | Repeat values |
normalize() | Return Unicode form of string |
pad() | Add whitespace to left, right, or both sides of strings |
wrap() | Split long strings into lines with length less than a given width |
join() | Join strings in each element of the Series with passed separator |
get_dummies() | extract dummy variables as a dataframe |
get()
and slice()
operations, in particular, enable vectorized element access from each array.
For example, we can get a slice of the first three characters of each array using str.slice(0, 3)
.
Note that this behavior is also available through Python's normal indexing syntax–for example, df.str.slice(0, 3)
is equivalent to df.str[0:3]
:monte.str[0:3]
df.str.get(i)
and df.str[i]
is likewise similar.get()
and slice()
methods also let you access elements of arrays returned by split()
.
For example, to extract the last name of each entry, we can combine split()
and get()
:monte.str.split().str.get(-1)
get_dummies()
method.
This is useful when your data has a column containing some sort of coded indicator.
For example, we might have a dataset that contains information in the form of codes, such as A="born in America," B="born in the United Kingdom," C="likes cheese," D="likes spam":full_monte = pd.DataFrame({'name': monte,
'info': ['B|C|D', 'B|D', 'A|C',
'B|D', 'B|C', 'B|C|D']})
full_monte
get_dummies()
routine lets you quickly split-out these indicator variables into a DataFrame
:full_monte['info'].str.get_dummies('|')
# !curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz
# !gunzip recipeitems-latest.json.gz
pd.read_json
to read it:try:
recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
print("ValueError:", e)
ValueError
mentioning that there is "trailing data."
Searching for the text of this error on the Internet, it seems that it's due to using a file in which each line is itself a valid JSON, but the full file is not.
Let's check if this interpretation is true:with open('recipeitems-latest.json') as f:
line = f.readline()
pd.read_json(line).shape
pd.read_json
:# read the entire file into a Python array
with open('recipeitems-latest.json', 'r') as f:
# Extract each line
data = (line.strip() for line in f)
# Reformat so each line is the element of a list
data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)
recipes.shape
recipes.iloc[0]
recipes.ingredients.str.len().describe()
recipes.name[np.argmax(recipes.ingredients.str.len())]
recipes.description.str.contains('[Bb]reakfast').sum()
recipes.ingredients.str.contains('[Cc]innamon').sum()
recipes.ingredients.str.contains('[Cc]inamon').sum()
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
DataFrame
consisting of True and False values, indicating whether this ingredient appears in the list:import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
for spice in spice_list))
spice_df.head()
query()
method of DataFrame
s, discussed in High-Performance Pandas: eval()
and query()
:selection = spice_df.query('parsley & paprika & tarragon')
len(selection)
recipes.name[selection.index]