
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]AttributeError Traceback (most recent call last)
<ipython-input-3-fc1d891ab539> in <module>()
1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]
<ipython-input-3-fc1d891ab539> in <listcomp>(.0)
1 data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
----> 2 [s.capitalize() for s in data]
AttributeError: 'NoneType' object has no attribute 'capitalize'
str attribute of Pandas Series and Index objects containing strings.
So, for example, suppose we create a Pandas Series with this data:
import pandas as pd
names = pd.Series(data)
names
names.str.capitalize()str attribute will list all the vectorized string methods available to Pandas.
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
'Eric Idle', 'Terry Jones', 'Michael Palin'])str methods that mirror Python string methods:len() | lower() | translate() | islower() |
ljust() | upper() | startswith() | isupper() |
rjust() | find() | endswith() | isnumeric() |
center() | rfind() | isalnum() | isdecimal() |
zfill() | index() | isalpha() | split() |
strip() | rindex() | isdigit() | rsplit() |
rstrip() | capitalize() | isspace() | partition() |
lstrip() | swapcase() | istitle() | rpartition() |
lower(), return a series of strings:
monte.str.lower()
monte.str.len()
monte.str.startswith('T')
monte.str.split()re module:| Method | Description |
|---|---|
match() | Call re.match() on each element, returning a boolean. |
extract() | Call re.match() on each element, returning matched groups as strings. |
findall() | Call re.findall() on each element |
replace() | Replace occurrences of pattern with some other string |
contains() | Call re.search() on each element, returning a boolean |
count() | Count occurrences of pattern |
split() | Equivalent to str.split(), but accepts regexps |
rsplit() | Equivalent to str.rsplit(), but accepts regexps |

monte.str.extract('([A-Za-z]+)', expand=False)^) and end-of-string ($) regular expression characters:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')Series or Dataframe entries opens up many possibilities for analysis and cleaning of data.| Method | Description |
|---|---|
get() | Index each element |
slice() | Slice each element |
slice_replace() | Replace slice in each element with passed value |
cat() | Concatenate strings |
repeat() | Repeat values |
normalize() | Return Unicode form of string |
pad() | Add whitespace to left, right, or both sides of strings |
wrap() | Split long strings into lines with length less than a given width |
join() | Join strings in each element of the Series with passed separator |
get_dummies() | extract dummy variables as a dataframe |
get() and slice() operations, in particular, enable vectorized element access from each array.
For example, we can get a slice of the first three characters of each array using str.slice(0, 3).
Note that this behavior is also available through Python's normal indexing syntax–for example, df.str.slice(0, 3) is equivalent to df.str[0:3]:
monte.str[0:3]df.str.get(i) and df.str[i] is likewise similar.get() and slice() methods also let you access elements of arrays returned by split().
For example, to extract the last name of each entry, we can combine split() and get():
monte.str.split().str.get(-1)get_dummies() method.
This is useful when your data has a column containing some sort of coded indicator.
For example, we might have a dataset that contains information in the form of codes, such as A="born in America," B="born in the United Kingdom," C="likes cheese," D="likes spam":
full_monte = pd.DataFrame({'name': monte,
'info': ['B|C|D', 'B|D', 'A|C',
'B|D', 'B|C', 'B|C|D']})
full_monteget_dummies() routine lets you quickly split-out these indicator variables into a DataFrame:
full_monte['info'].str.get_dummies('|')
# !curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz
# !gunzip recipeitems-latest.json.gzpd.read_json to read it:
try:
recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
print("ValueError:", e)ValueError mentioning that there is "trailing data."
Searching for the text of this error on the Internet, it seems that it's due to using a file in which each line is itself a valid JSON, but the full file is not.
Let's check if this interpretation is true:
with open('recipeitems-latest.json') as f:
line = f.readline()
pd.read_json(line).shapepd.read_json:
# read the entire file into a Python array
with open('recipeitems-latest.json', 'r') as f:
# Extract each line
data = (line.strip() for line in f)
# Reformat so each line is the element of a list
data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)
recipes.shape
recipes.iloc[0]
recipes.ingredients.str.len().describe()
recipes.name[np.argmax(recipes.ingredients.str.len())]
recipes.description.str.contains('[Bb]reakfast').sum()
recipes.ingredients.str.contains('[Cc]innamon').sum()
recipes.ingredients.str.contains('[Cc]inamon').sum()
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']DataFrame consisting of True and False values, indicating whether this ingredient appears in the list:
import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
for spice in spice_list))
spice_df.head()query() method of DataFrames, discussed in High-Performance Pandas: eval() and query():
selection = spice_df.query('parsley & paprika & tarragon')
len(selection)
recipes.name[selection.index]