Data preparation#

Data Science & AI Workbench supports data preparation using numeric libraries such as NumPy, SciPy, and Pandas.

These examples use this small data file vendors.csv:

Vendor Number,Vendor Name,Month,Day,Year,Active,Open Orders,2015,2016,Percent Growth
"104.0",ACME Inc,2,15,2014,"Y",200,"$45,000.00",$54000.00,20.00%
205,Apogee LTD,8,12,2015,"Y",150,"$29,000.00","$30,450.00",5.00%
143,Zenith Co,4,5,2014,"Y",290,"$18,000.00",$23400.00,30.00%
166,Hollerith Propulsion,9,25,2015,"Y",180,"$48,000.00",$48960.00,2.00%
180,Airtek Industrial,8,2,2014,"N",Closed,"$23,000.00",$17250.00,-25.00%

The columns are the vendor ID number, vendor name, month day and year of first purchase from the vendor, whether the account is currently active, the number of open orders, purchases in 2015 and 2016, and percent growth in orders from 2015 to 2016.

Converting data types#

Computers handle many types of data, including integer numbers such as 365, floating point numbers such as 365.2425, strings such as “ACME Inc”, and more.

An operation such as division may work for integers and floating point numbers, but produce an error if used on strings.

Often data libraries such as pandas will automatically use the correct types, but they do provide ways to correct and change the types when needed. For example, you may wish to convert between an integer such as 25, the floating point number 25.0, and strings such as “25”, “25.0”, or “$25.00”.

Pandas data types or dtypes correspond to similar Python types.

Strings are called str in Python and object in pandas.

Integers are called int in Python and int64 in pandas, indicating that pandas stores integers as 64-bit numbers.

Floating point numbers are called float in Python and float64 in pandas, also indicating that they are stored with 64 bits.

A boolean value, named for logician George Boole, can be either True or False. These are called bool in Python and bool in pandas.

Pandas includes some data types with no corresponding native Python type: datetime64 for date and time values, timedelta[ns] for storing the difference between two times as a number of nanoseconds, and category where each item is one of a list of strings.

Here we import the vendor data file and show the dtypes:

import pandas as pd
import numpy as np
df = pd.read_csv('vendors.csv')
df.dtypes
Vendor Number     float64
Vendor Name        object
Month               int64
Day                 int64
Year                int64
Active             object
Open Orders        object
2015               object
2016               object
Percent Growth     object
dtype: object

Try adding the 2015 and 2016 sales:

df['2015']+df['2016']
0     $45,000.00$54000.00
1    $29,000.00$30,450.00
2     $18,000.00$23400.00
3     $48,000.00$48960.00
4     $23,000.00$17250.00
dtype: object

These columns were stored as the type “object”, and concatenated as strings, not added as numbers.

Examine more information about the DataFrame:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
Vendor Number     5 non-null float64
Vendor Name       5 non-null object
Month             5 non-null int64
Day               5 non-null int64
Year              5 non-null int64
Active            5 non-null object
Open Orders       5 non-null object
2015              5 non-null object
2016              5 non-null object
Percent Growth    5 non-null object
dtypes: float64(1), int64(3), object(6)
memory usage: 480.0+ bytes

Vendor Number is a float and not an int. 2015 and 2016 sales, percent growth, and open orders are stored as objects and not numbers. The month, day, and year values should be converted to datetime64, and the active column should be converted to a boolean.

The data can be converted with the astype() function, custom functions, or pandas functions such as to_numeric() or to_datetime().

astype()#

The astype() function can convert the Vendor Number column to int:

df['Vendor Number'].astype('int')
0    104
1    205
2    143
3    166
4    180
Name: Vendor Number, dtype: int64

astype() returns a copy, so an assignment statement will convert the original data. This can be checked by showing the dtypes.

df['Vendor Number'] = df['Vendor Number'].astype('int')
df.dtypes
Vendor Number      int64
Vendor Name       object
Month              int64
Day                int64
Year               int64
Active            object
Open Orders       object
2015              object
2016              object
Percent Growth    object
dtype: object

However, trying to convert the 2015 column to a float or the Open Orders column to an int returns an error.

df['2015'].astype('float')
ValueError: could not convert string to float: '$23,000.00'
df['Open Orders'].astype('int')
ValueError: invalid literal for int() with base 10: 'Closed'

Even worse, trying to convert the Active column to a bool completes with no errors, but converts both Y and N values to True.

df['Active'].astype('bool')
0    True
1    True
2    True
3    True
4    True
Name: Active, dtype: bool

astype() works if the data is clean and can be interpreted simply as a number, or if you want to convert a number to a string. Other conversions require custom functions or pandas functions such as to_numeric() or to_datetime().

Custom conversion functions#

This small custom function converts a currency string like the ones in the 2015 column to a float by first removing the comma (,) and dollar sign ($) characters.

def currency_to_float(a):
    return float(a.replace(',','').replace('$',''))

Test the function on the 2015 column with the apply() function:

df['2015'].apply(currency_to_float)
0    45000.0
1    29000.0
2    18000.0
3    48000.0
4    23000.0
Name: 2015, dtype: float64

Convert the 2015 and 2016 columns and show the dtypes:

df['2015'] = df['2015'].apply(currency_to_float)
df['2016'] = df['2016'].apply(currency_to_float)
df.dtypes
Vendor Number       int64
Vendor Name        object
Month               int64
Day                 int64
Year                int64
Active             object
Open Orders        object
2015              float64
2016              float64
Percent Growth     object
dtype: object

Convert the Percent Growth column:

def percent_to_float(a):
    return float(a.replace('%',''))/100
df['Percent Growth'].apply(percent_to_float)
0    0.20
1    0.05
2    0.30
3    0.02
4   -0.25
Name: Percent Growth, dtype: float64
df['Percent Growth'] = df['Percent Growth'].apply(percent_to_float)
df.dtypes
Vendor Number       int64
Vendor Name        object
Month               int64
Day                 int64
Year                int64
Active             object
Open Orders        object
2015              float64
2016              float64
Percent Growth    float64
dtype: object

NumPy’s np.where() function is a good way to convert the Active column to bool. This code converts “Y” values to True and all other values to False, then shows the dtypes:

np.where(df["Active"] == "Y", True, False)
array([ True,  True,  True,  True, False])
df["Active"] = np.where(df["Active"] == "Y", True, False)
df.dtypes
Vendor Number       int64
Vendor Name        object
Month               int64
Day                 int64
Year                int64
Active               bool
Open Orders        object
2015              float64
2016              float64
Percent Growth    float64
dtype: object

Pandas helper functions#

The Open Orders column has several integers, but one string. Using astype() on this column would produce an error, but the pd.to_numeric() function built in to pandas will convert the numeric values to numbers and any other values to the “not a number” or “NaN” value built in to the floating point number standard:

pd.to_numeric(df['Open Orders'], errors='coerce')
0    200.0
1    150.0
2    290.0
3    180.0
4      NaN
Name: Open Orders, dtype: float64

In this case, a non-numeric value in this field indicates that there are zero open orders, so we can convert NaN values to zero with the function fillna():

pd.to_numeric(df['Open Orders'], errors='coerce').fillna(0)
0    200.0
1    150.0
2    290.0
3    180.0
4      0.0
Name: Open Orders, dtype: float64

Similarly, the pd.to_datetime() function built in to pandas can convert the Month Day and Year columns to datetime64[ns]:

pd.to_datetime(df[['Month', 'Day', 'Year']])
0   2014-02-15
1   2015-08-12
2   2014-04-05
3   2015-09-25
4   2014-08-02
dtype: datetime64[ns]

Use these functions to change the DataFrame, then show the dtypes:

df['Open Orders'] = pd.to_numeric(df['Open Orders'], errors='coerce').fillna(0)
df['First Purchase Date'] = pd.to_datetime(df[['Month', 'Day', 'Year']])
df.dtypes
Vendor Number                   int64
Vendor Name                    object
Month                           int64
Day                             int64
Year                            int64
Active                           bool
Open Orders                   float64
2015                          float64
2016                          float64
Percent Growth                float64
First Purchase Date    datetime64[ns]
dtype: object

Converting data as it is read#

You can apply dtype and converters in the pd.read_csv() function. Defining dtype is like performing astype() on the data.

A dtype or a converter can only be applied once to a specified column. If you try to apply both to the same column, the dtype is skipped.

After converting as much of the data as possible in pd.read_csv(), use code similar to the previous examples to convert the rest.

df2 = pd.read_csv('vendors.csv',
                  dtype={'Vendor Number': 'int'},
                  converters={'2015': currency_to_float,
                              '2016': currency_to_float,
                              'Percent Growth': percent_to_float})
df2["Active"] = np.where(df2["Active"] == "Y", True, False)
df2['Open Orders'] = pd.to_numeric(df2['Open Orders'], errors='coerce').fillna(0)
df2['First Purchase Date'] = pd.to_datetime(df2[['Month', 'Day', 'Year']])
df2
   Vendor Number           Vendor Name  Month  Day  Year  Active  Open Orders     2015     2016  Percent Growth First Purchase Date
0            104              ACME Inc      2   15  2014    True        200.0  45000.0  54000.0            0.20          2014-02-15
1            205            Apogee LTD      8   12  2015    True        150.0  29000.0  30450.0            0.05          2015-08-12
2            143             Zenith Co      4    5  2014    True        290.0  18000.0  23400.0            0.30          2014-04-05
3            166  Hollerith Propulsion      9   25  2015    True        180.0  48000.0  48960.0            0.02          2015-09-25
4            180     Airtek Industrial      8    2  2014   False          0.0  23000.0  17250.0           -0.25          2014-08-02
df2.dtypes
Vendor Number                   int64
Vendor Name                    object
Month                           int64
Day                             int64
Year                            int64
Active                           bool
Open Orders                   float64
2015                          float64
2016                          float64
Percent Growth                float64
First Purchase Date    datetime64[ns]
dtype: object

We thank http://pbpython.com/pandas_dtypes.html for providing data preparation examples that inspired these examples.

Merging and joining data sets#

You can use pandas to merge DataFrames:

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, on='key')
  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2
3  K3  A3  B3  C3  D3

The available merge methods are left to use keys from the left frame only, right to use keys from the right frame only, outer to use the union of keys from both frames, and the default inner to use the intersection of keys from both frames.

This merge using the default inner join omits key combinations found in only one of the source DataFrames:

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, on=['key1', 'key2'])
  key1 key2   A   B   C   D
0   K0   K0  A0  B0  C0  D0
1   K1   K0  A2  B2  C1  D1
2   K1   K0  A2  B2  C2  D2

This example omits the rows with key1 and key2 set to K0, K1, K2, K1, or K2, K0.

Joins also copy information when necessary. The left DataFrame had one row with the keys set to K1, K0 and the right DataFrame had two. The output DataFrame has two, with the information from the left DataFrame copied into both rows.

The next example shows the results of a left, right, and outer merge on the same inputs. Empty cells are filled in with NaN values.

pd.merge(left, right, how='left', on=['key1', 'key2'])
  key1 key2   A   B    C    D
0   K0   K0  A0  B0   C0   D0
1   K0   K1  A1  B1  NaN  NaN
2   K1   K0  A2  B2   C1   D1
3   K1   K0  A2  B2   C2   D2
4   K2   K1  A3  B3  NaN  NaN
pd.merge(left, right, how='right', on=['key1', 'key2'])
  key1 key2    A    B   C   D
0   K0   K0   A0   B0  C0  D0
1   K1   K0   A2   B2  C1  D1
2   K1   K0   A2   B2  C2  D2
3   K2   K0  NaN  NaN  C3  D3
pd.merge(left, right, how='outer', on=['key1', 'key2'])
  key1 key2    A    B    C    D
0   K0   K0   A0   B0   C0   D0
1   K0   K1   A1   B1  NaN  NaN
2   K1   K0   A2   B2   C1   D1
3   K1   K0   A2   B2   C2   D2
4   K2   K1   A3   B3  NaN  NaN
5   K2   K0  NaN  NaN   C3   D3

If a key combination appears more than once in both tables, the output will contain the Cartesian product of the associated data.

In this small example a key that appears twice in the left frame and three times in the right frame produces six rows in the output frame.

left = pd.DataFrame({'A' : [1,2], 'B' : [2, 2]})
right = pd.DataFrame({'A' : [4,5,6], 'B': [2,2,2]})
pd.merge(left, right, on='B', how='outer')
   A_x  B  A_y
0    1  2    4
1    1  2    5
2    1  2    6
3    2  2    4
4    2  2    5
5    2  2    6

To prevent very large outputs and memory overflow, manage duplicate values in keys before joining large DataFrames.

While merging uses one or more columns as keys, joining uses the indexes, also known as row labels.

Join can also perform left, right, inner, and outer merges, and defaults to left.

left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                    index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                      'D': ['D0', 'D2', 'D3']},
                     index=['K0', 'K2', 'K3'])
left.join(right)
     A   B    C    D
K0  A0  B0   C0   D0
K1  A1  B1  NaN  NaN
K2  A2  B2   C2   D2
left.join(right, how='outer')
      A    B    C    D
K0   A0   B0   C0   D0
K1   A1   B1  NaN  NaN
K2   A2   B2   C2   D2
K3  NaN  NaN   C3   D3
left.join(right, how='inner')
     A   B   C   D
K0  A0  B0  C0  D0
K2  A2  B2  C2  D2

This is equivalent to using merge with arguments instructing it to use the indexes:

pd.merge(left, right, left_index=True, right_index=True, how='inner')
     A   B   C   D
K0  A0  B0  C0  D0
K2  A2  B2  C2  D2

You can join a frame indexed by a join key to a frame where the key is a column:

left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3'],
                     'key': ['K0', 'K1', 'K0', 'K1']})
right = pd.DataFrame({'C': ['C0', 'C1'],
                      'D': ['D0', 'D1']},
                     index=['K0', 'K1'])
left.join(right, on='key')
    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K0  C0  D0
3  A3  B3  K1  C1  D1

You can join on multiple keys if the passed DataFrame has a MultiIndex:

left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3'],
                     'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1']})
index = pd.MultiIndex.from_tuples([('K0', 'K0'), ('K1', 'K0'),
                                   ('K2', 'K0'), ('K2', 'K1')])
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']},
                     index=index)
right
        C   D
K0 K0  C0  D0
K1 K0  C1  D1
K2 K0  C2  D2
   K1  C3  D3
left.join(right, on=['key1', 'key2'])
    A   B key1 key2    C    D
0  A0  B0   K0   K0   C0   D0
1  A1  B1   K0   K1  NaN  NaN
2  A2  B2   K1   K0   C1   D1
3  A3  B3   K2   K1   C3   D3

Note that this defaulted to a left join, but other types are also available:

left.join(right, on=['key1', 'key2'], how='inner')
    A   B key1 key2   C   D
0  A0  B0   K0   K0  C0  D0
2  A2  B2   K1   K0  C1  D1
3  A3  B3   K2   K1  C3  D3

For more information, including examples of using merge to join a single index to a multi-index or to join two multi-indexes, see the pandas documentation on merging.

When column names in the input frames overlap, pandas appends suffixes to disambiguate them. These default to _x and _y but you can customize them:

left = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'v': [1, 2, 3]})
right = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'v': [4, 5, 6]})
pd.merge(left, right, on='k')
    k  v_x  v_y
0  K0    1    4
1  K0    1    5
pd.merge(left, right, on='k', suffixes=['_l', '_r'])
    k  v_l  v_r
0  K0    1    4
1  K0    1    5

Join has similar arguments lsuffix and rsuffix.

left = left.set_index('k')
right = right.set_index('k')
left.join(right, lsuffix='_l', rsuffix='_r')
    v_l  v_r
k
K0    1  4.0
K0    1  5.0
K1    2  NaN
K2    3  NaN

You can join a list or tuple of DataFrames on their indexes:

right2 = pd.DataFrame({'v': [7, 8, 9]}, index=['K1', 'K1', 'K2'])
left.join([right, right2])
    v_x  v_y    v
K0    1  4.0  NaN
K0    1  5.0  NaN
K1    2  NaN  7.0
K1    2  NaN  8.0
K2    3  NaN  9.0

If you have two frames with similar indices and want to fill in missing values in the left frame with values from the right frame, use the combine_first() method:

df1 = pd.DataFrame([[np.nan, 3., 5.],
                    [-4.6, np.nan, np.nan],
                    [np.nan, 7., np.nan]])
df2 = pd.DataFrame([[-42.6, np.nan, -8.2],
                    [-5., 1.6, 4]],
                   index=[1, 2])
df1.combine_first(df2)
     0    1    2
0  NaN  3.0  5.0
1 -4.6  NaN -8.2
2 -5.0  7.0  4.0

The method update() overwrites values in a frame with values from another frame:

df1.update(df2)
df1
      0    1    2
0   NaN  3.0  5.0
1 -42.6  NaN -8.2
2  -5.0  1.6  4.0

The pandas documentation on merging has more information, including examples of combining time series and other ordered data, with options to fill and interpolate missing data.

We thank the pandas documentation for many of these examples.

Filtering data#

This example uses a vendors DataFrame similar to the one we used above:

import pandas as pd
import numpy as np
df = pd.DataFrame({'VendorNumber': [104, 205, 143, 166, 180],
               'VendorName': ['ACME Inc', 'Apogee LTD', 'Zenith Co', 'Hollerith Propulsion', 'Airtek Industrial'],
               'Active': [True, True, True, True, False],
               'OpenOrders': [200, 150, 290, 180, 0],
               'Purchases2015': [45000.0, 29000.0, 18000.0, 48000.0, 23000.0],
               'Purchases2016': [54000.0, 30450.0, 23400.0, 48960.0, 17250.0],
               'PercentGrowth': [0.20, 0.05, 0.30, 0.02, -0.25],
               'FirstPurchaseDate': ['2014-02-15', '2015-08-12', '2014-04-05', '2015-09-25', '2014-08-02']})
df['FirstPurchaseDate'] = df['FirstPurchaseDate'].astype('datetime64[ns]')
df
   VendorNumber            VendorName  Active  OpenOrders  Purchases2015  Purchases2016  PercentGrowth FirstPurchaseDate
0           104              ACME Inc    True         200        45000.0        54000.0           0.20        2014-02-15
1           205            Apogee LTD    True         150        29000.0        30450.0           0.05        2015-08-12
2           143             Zenith Co    True         290        18000.0        23400.0           0.30        2014-04-05
3           166  Hollerith Propulsion    True         180        48000.0        48960.0           0.02        2015-09-25
4           180     Airtek Industrial   False           0        23000.0        17250.0          -0.25        2014-08-02

To filter only certain rows from a DataFrame, call the query method with a boolean expression based on the column names.

df.query('OpenOrders>160')
   VendorNumber            VendorName  Active  OpenOrders  Purchases2015  Purchases2016  PercentGrowth FirstPurchaseDate
0           104              ACME Inc    True         200        45000.0        54000.0           0.20        2014-02-15
2           143             Zenith Co    True         290        18000.0        23400.0           0.30        2014-04-05
3           166  Hollerith Propulsion    True         180        48000.0        48960.0           0.02        2015-09-25

Filtering can be done with indices instead of queries:

df[(df.OpenOrders < 190) & (df.Active == True)]
   VendorNumber            VendorName  Active  OpenOrders  Purchases2015  Purchases2016  PercentGrowth FirstPurchaseDate
1           205            Apogee LTD    True         150        29000.0        30450.0           0.05        2015-08-12
3           166  Hollerith Propulsion    True         180        48000.0        48960.0           0.02        2015-09-25