Data preparation#
Data Science & AI Workbench supports data preparation using numeric libraries such as NumPy, SciPy, and Pandas.
These examples use this small data file vendors.csv
:
Vendor Number,Vendor Name,Month,Day,Year,Active,Open Orders,2015,2016,Percent Growth
"104.0",ACME Inc,2,15,2014,"Y",200,"$45,000.00",$54000.00,20.00%
205,Apogee LTD,8,12,2015,"Y",150,"$29,000.00","$30,450.00",5.00%
143,Zenith Co,4,5,2014,"Y",290,"$18,000.00",$23400.00,30.00%
166,Hollerith Propulsion,9,25,2015,"Y",180,"$48,000.00",$48960.00,2.00%
180,Airtek Industrial,8,2,2014,"N",Closed,"$23,000.00",$17250.00,-25.00%
The columns are the vendor ID number, vendor name, month day and year of first purchase from the vendor, whether the account is currently active, the number of open orders, purchases in 2015 and 2016, and percent growth in orders from 2015 to 2016.
Converting data types#
Computers handle many types of data, including integer numbers such as 365, floating point numbers such as 365.2425, strings such as “ACME Inc”, and more.
An operation such as division may work for integers and floating point numbers, but produce an error if used on strings.
Often data libraries such as pandas will automatically use the correct types, but they do provide ways to correct and change the types when needed. For example, you may wish to convert between an integer such as 25, the floating point number 25.0, and strings such as “25”, “25.0”, or “$25.00”.
Pandas data types or dtypes
correspond to similar Python types.
Strings are called str
in Python and object
in pandas.
Integers are called int
in Python and int64
in pandas, indicating that
pandas stores integers as 64-bit numbers.
Floating point numbers are called float
in Python and float64
in pandas,
also indicating that they are stored with 64 bits.
A boolean value, named for logician George Boole, can be either True or False.
These are called bool
in Python and bool
in pandas.
Pandas includes some data types with no corresponding native Python type:
datetime64
for date and time values, timedelta[ns]
for storing the
difference between two times as a number of nanoseconds, and category
where
each item is one of a list of strings.
Here we import the vendor data file and show the dtypes:
import pandas as pd
import numpy as np
df = pd.read_csv('vendors.csv')
df.dtypes
Vendor Number float64
Vendor Name object
Month int64
Day int64
Year int64
Active object
Open Orders object
2015 object
2016 object
Percent Growth object
dtype: object
Try adding the 2015 and 2016 sales:
df['2015']+df['2016']
0 $45,000.00$54000.00
1 $29,000.00$30,450.00
2 $18,000.00$23400.00
3 $48,000.00$48960.00
4 $23,000.00$17250.00
dtype: object
These columns were stored as the type “object”, and concatenated as strings, not added as numbers.
Examine more information about the DataFrame:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
Vendor Number 5 non-null float64
Vendor Name 5 non-null object
Month 5 non-null int64
Day 5 non-null int64
Year 5 non-null int64
Active 5 non-null object
Open Orders 5 non-null object
2015 5 non-null object
2016 5 non-null object
Percent Growth 5 non-null object
dtypes: float64(1), int64(3), object(6)
memory usage: 480.0+ bytes
Vendor Number is a float
and not an int
. 2015 and 2016 sales, percent
growth, and open orders are stored as objects and not numbers. The month, day,
and year values should be converted to datetime64
, and the active column
should be converted to a boolean.
The data can be converted with the astype()
function, custom functions, or
pandas functions such as to_numeric()
or to_datetime()
.
astype()#
The astype()
function can convert the Vendor Number column to int
:
df['Vendor Number'].astype('int')
0 104
1 205
2 143
3 166
4 180
Name: Vendor Number, dtype: int64
astype()
returns a copy, so an assignment statement will convert the
original data. This can be checked by showing the dtypes
.
df['Vendor Number'] = df['Vendor Number'].astype('int')
df.dtypes
Vendor Number int64
Vendor Name object
Month int64
Day int64
Year int64
Active object
Open Orders object
2015 object
2016 object
Percent Growth object
dtype: object
However, trying to convert the 2015 column to a float
or the Open Orders
column to an int
returns an error.
df['2015'].astype('float')
ValueError: could not convert string to float: '$23,000.00'
df['Open Orders'].astype('int')
ValueError: invalid literal for int() with base 10: 'Closed'
Even worse, trying to convert the Active column to a bool
completes with no
errors, but converts both Y and N values to True
.
df['Active'].astype('bool')
0 True
1 True
2 True
3 True
4 True
Name: Active, dtype: bool
astype()
works if the data is clean and can be interpreted simply as a
number, or if you want to convert a number to a string. Other conversions
require custom functions or pandas functions such as to_numeric()
or
to_datetime()
.
Custom conversion functions#
This small custom function converts a currency string like the ones in the 2015
column to a float
by first removing the comma (,
) and dollar sign ($
)
characters.
def currency_to_float(a):
return float(a.replace(',','').replace('$',''))
Test the function on the 2015 column with the apply()
function:
df['2015'].apply(currency_to_float)
0 45000.0
1 29000.0
2 18000.0
3 48000.0
4 23000.0
Name: 2015, dtype: float64
Convert the 2015 and 2016 columns and show the dtypes
:
df['2015'] = df['2015'].apply(currency_to_float)
df['2016'] = df['2016'].apply(currency_to_float)
df.dtypes
Vendor Number int64
Vendor Name object
Month int64
Day int64
Year int64
Active object
Open Orders object
2015 float64
2016 float64
Percent Growth object
dtype: object
Convert the Percent Growth column:
def percent_to_float(a):
return float(a.replace('%',''))/100
df['Percent Growth'].apply(percent_to_float)
0 0.20
1 0.05
2 0.30
3 0.02
4 -0.25
Name: Percent Growth, dtype: float64
df['Percent Growth'] = df['Percent Growth'].apply(percent_to_float)
df.dtypes
Vendor Number int64
Vendor Name object
Month int64
Day int64
Year int64
Active object
Open Orders object
2015 float64
2016 float64
Percent Growth float64
dtype: object
NumPy’s np.where()
function is a good way to convert the Active column to
bool
. This code converts “Y” values to True
and all other values to
False
, then shows the dtypes
:
np.where(df["Active"] == "Y", True, False)
array([ True, True, True, True, False])
df["Active"] = np.where(df["Active"] == "Y", True, False)
df.dtypes
Vendor Number int64
Vendor Name object
Month int64
Day int64
Year int64
Active bool
Open Orders object
2015 float64
2016 float64
Percent Growth float64
dtype: object
Pandas helper functions#
The Open Orders column has several integers, but one string. Using astype()
on this column would produce an error, but the pd.to_numeric()
function
built in to pandas will convert the numeric values to numbers and any other
values to the “not a number” or “NaN” value built in to the floating point
number standard:
pd.to_numeric(df['Open Orders'], errors='coerce')
0 200.0
1 150.0
2 290.0
3 180.0
4 NaN
Name: Open Orders, dtype: float64
In this case, a non-numeric value in this field indicates that there are zero
open orders, so we can convert NaN values to zero with the function
fillna()
:
pd.to_numeric(df['Open Orders'], errors='coerce').fillna(0)
0 200.0
1 150.0
2 290.0
3 180.0
4 0.0
Name: Open Orders, dtype: float64
Similarly, the pd.to_datetime()
function built in to pandas can convert the
Month Day and Year columns to datetime64[ns]
:
pd.to_datetime(df[['Month', 'Day', 'Year']])
0 2014-02-15
1 2015-08-12
2 2014-04-05
3 2015-09-25
4 2014-08-02
dtype: datetime64[ns]
Use these functions to change the DataFrame, then show the dtypes
:
df['Open Orders'] = pd.to_numeric(df['Open Orders'], errors='coerce').fillna(0)
df['First Purchase Date'] = pd.to_datetime(df[['Month', 'Day', 'Year']])
df.dtypes
Vendor Number int64
Vendor Name object
Month int64
Day int64
Year int64
Active bool
Open Orders float64
2015 float64
2016 float64
Percent Growth float64
First Purchase Date datetime64[ns]
dtype: object
Converting data as it is read#
You can apply dtype
and converters
in the pd.read_csv()
function.
Defining dtype
is like performing astype()
on the data.
A dtype
or a converter
can only be applied once to a specified column.
If you try to apply both to the same column, the dtype
is skipped.
After converting as much of the data as possible in pd.read_csv()
, use code
similar to the previous examples to convert the rest.
df2 = pd.read_csv('vendors.csv',
dtype={'Vendor Number': 'int'},
converters={'2015': currency_to_float,
'2016': currency_to_float,
'Percent Growth': percent_to_float})
df2["Active"] = np.where(df2["Active"] == "Y", True, False)
df2['Open Orders'] = pd.to_numeric(df2['Open Orders'], errors='coerce').fillna(0)
df2['First Purchase Date'] = pd.to_datetime(df2[['Month', 'Day', 'Year']])
df2
Vendor Number Vendor Name Month Day Year Active Open Orders 2015 2016 Percent Growth First Purchase Date
0 104 ACME Inc 2 15 2014 True 200.0 45000.0 54000.0 0.20 2014-02-15
1 205 Apogee LTD 8 12 2015 True 150.0 29000.0 30450.0 0.05 2015-08-12
2 143 Zenith Co 4 5 2014 True 290.0 18000.0 23400.0 0.30 2014-04-05
3 166 Hollerith Propulsion 9 25 2015 True 180.0 48000.0 48960.0 0.02 2015-09-25
4 180 Airtek Industrial 8 2 2014 False 0.0 23000.0 17250.0 -0.25 2014-08-02
df2.dtypes
Vendor Number int64
Vendor Name object
Month int64
Day int64
Year int64
Active bool
Open Orders float64
2015 float64
2016 float64
Percent Growth float64
First Purchase Date datetime64[ns]
dtype: object
We thank http://pbpython.com/pandas_dtypes.html for providing data preparation examples that inspired these examples.
Merging and joining data sets#
You can use pandas to merge DataFrames:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, on='key')
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
The available merge methods are left
to use keys from the left frame only,
right
to use keys from the right frame only, outer
to use the union of
keys from both frames, and the default inner
to use the intersection of keys
from both frames.
This merge using the default inner join omits key combinations found in only one of the source DataFrames:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
This example omits the rows with key1
and key2
set to K0, K1
, K2,
K1
, or K2, K0
.
Joins also copy information when necessary. The left DataFrame had one row with
the keys set to K1, K0
and the right DataFrame had two. The output DataFrame
has two, with the information from the left DataFrame copied into both rows.
The next example shows the results of a left, right, and outer merge on the same inputs. Empty cells are filled in with NaN values.
pd.merge(left, right, how='left', on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
pd.merge(left, right, how='right', on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
pd.merge(left, right, how='outer', on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3
If a key combination appears more than once in both tables, the output will contain the Cartesian product of the associated data.
In this small example a key that appears twice in the left frame and three times in the right frame produces six rows in the output frame.
left = pd.DataFrame({'A' : [1,2], 'B' : [2, 2]})
right = pd.DataFrame({'A' : [4,5,6], 'B': [2,2,2]})
pd.merge(left, right, on='B', how='outer')
A_x B A_y
0 1 2 4
1 1 2 5
2 1 2 6
3 2 2 4
4 2 2 5
5 2 2 6
To prevent very large outputs and memory overflow, manage duplicate values in keys before joining large DataFrames.
While merging uses one or more columns as keys, joining uses the indexes, also known as row labels.
Join can also perform left, right, inner, and outer merges, and defaults to left.
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
left.join(right)
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
left.join(right, how='outer')
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
left.join(right, how='inner')
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2
This is equivalent to using merge with arguments instructing it to use the indexes:
pd.merge(left, right, left_index=True, right_index=True, how='inner')
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2
You can join a frame indexed by a join key to a frame where the key is a column:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key': ['K0', 'K1', 'K0', 'K1']})
right = pd.DataFrame({'C': ['C0', 'C1'],
'D': ['D0', 'D1']},
index=['K0', 'K1'])
left.join(right, on='key')
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K0 C0 D0
3 A3 B3 K1 C1 D1
You can join on multiple keys if the passed DataFrame has a MultiIndex
:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1']})
index = pd.MultiIndex.from_tuples([('K0', 'K0'), ('K1', 'K0'),
('K2', 'K0'), ('K2', 'K1')])
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=index)
right
C D
K0 K0 C0 D0
K1 K0 C1 D1
K2 K0 C2 D2
K1 C3 D3
left.join(right, on=['key1', 'key2'])
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A3 B3 K2 K1 C3 D3
Note that this defaulted to a left join, but other types are also available:
left.join(right, on=['key1', 'key2'], how='inner')
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
2 A2 B2 K1 K0 C1 D1
3 A3 B3 K2 K1 C3 D3
For more information, including examples of using merge to join a single index to a multi-index or to join two multi-indexes, see the pandas documentation on merging.
When column names in the input frames overlap, pandas appends suffixes to
disambiguate them. These default to _x
and _y
but you can customize
them:
left = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'v': [1, 2, 3]})
right = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'v': [4, 5, 6]})
pd.merge(left, right, on='k')
k v_x v_y
0 K0 1 4
1 K0 1 5
pd.merge(left, right, on='k', suffixes=['_l', '_r'])
k v_l v_r
0 K0 1 4
1 K0 1 5
Join has similar arguments lsuffix
and rsuffix
.
left = left.set_index('k')
right = right.set_index('k')
left.join(right, lsuffix='_l', rsuffix='_r')
v_l v_r
k
K0 1 4.0
K0 1 5.0
K1 2 NaN
K2 3 NaN
You can join a list or tuple of DataFrames on their indexes:
right2 = pd.DataFrame({'v': [7, 8, 9]}, index=['K1', 'K1', 'K2'])
left.join([right, right2])
v_x v_y v
K0 1 4.0 NaN
K0 1 5.0 NaN
K1 2 NaN 7.0
K1 2 NaN 8.0
K2 3 NaN 9.0
If you have two frames with similar indices and want to fill in missing values
in the left frame with values from the right frame, use the combine_first()
method:
df1 = pd.DataFrame([[np.nan, 3., 5.],
[-4.6, np.nan, np.nan],
[np.nan, 7., np.nan]])
df2 = pd.DataFrame([[-42.6, np.nan, -8.2],
[-5., 1.6, 4]],
index=[1, 2])
df1.combine_first(df2)
0 1 2
0 NaN 3.0 5.0
1 -4.6 NaN -8.2
2 -5.0 7.0 4.0
The method update()
overwrites values in a frame with values from another
frame:
df1.update(df2)
df1
0 1 2
0 NaN 3.0 5.0
1 -42.6 NaN -8.2
2 -5.0 1.6 4.0
The pandas documentation on merging has more information, including examples of combining time series and other ordered data, with options to fill and interpolate missing data.
We thank the pandas documentation for many of these examples.
Filtering data#
This example uses a vendors DataFrame similar to the one we used above:
import pandas as pd
import numpy as np
df = pd.DataFrame({'VendorNumber': [104, 205, 143, 166, 180],
'VendorName': ['ACME Inc', 'Apogee LTD', 'Zenith Co', 'Hollerith Propulsion', 'Airtek Industrial'],
'Active': [True, True, True, True, False],
'OpenOrders': [200, 150, 290, 180, 0],
'Purchases2015': [45000.0, 29000.0, 18000.0, 48000.0, 23000.0],
'Purchases2016': [54000.0, 30450.0, 23400.0, 48960.0, 17250.0],
'PercentGrowth': [0.20, 0.05, 0.30, 0.02, -0.25],
'FirstPurchaseDate': ['2014-02-15', '2015-08-12', '2014-04-05', '2015-09-25', '2014-08-02']})
df['FirstPurchaseDate'] = df['FirstPurchaseDate'].astype('datetime64[ns]')
df
VendorNumber VendorName Active OpenOrders Purchases2015 Purchases2016 PercentGrowth FirstPurchaseDate
0 104 ACME Inc True 200 45000.0 54000.0 0.20 2014-02-15
1 205 Apogee LTD True 150 29000.0 30450.0 0.05 2015-08-12
2 143 Zenith Co True 290 18000.0 23400.0 0.30 2014-04-05
3 166 Hollerith Propulsion True 180 48000.0 48960.0 0.02 2015-09-25
4 180 Airtek Industrial False 0 23000.0 17250.0 -0.25 2014-08-02
To filter only certain rows from a DataFrame, call the query
method with a
boolean expression based on the column names.
df.query('OpenOrders>160')
VendorNumber VendorName Active OpenOrders Purchases2015 Purchases2016 PercentGrowth FirstPurchaseDate
0 104 ACME Inc True 200 45000.0 54000.0 0.20 2014-02-15
2 143 Zenith Co True 290 18000.0 23400.0 0.30 2014-04-05
3 166 Hollerith Propulsion True 180 48000.0 48960.0 0.02 2015-09-25
Filtering can be done with indices instead of queries:
df[(df.OpenOrders < 190) & (df.Active == True)]
VendorNumber VendorName Active OpenOrders Purchases2015 Purchases2016 PercentGrowth FirstPurchaseDate
1 205 Apogee LTD True 150 29000.0 30450.0 0.05 2015-08-12
3 166 Hollerith Propulsion True 180 48000.0 48960.0 0.02 2015-09-25