Data preparation¶
Anaconda Enterprise supports data preparation using numeric libraries such as NumPy, SciPy, and Pandas.
These examples use this small data file vendors.csv
:
The columns are the vendor ID number, vendor name, month day and year of first purchase from the vendor, whether the account is currently active, the number of open orders, purchases in 2015 and 2016, and percent growth in orders from 2015 to 2016.
Converting data types¶
Computers handle many types of data, including integer numbers such as 365, floating point numbers such as 365.2425, strings such as “ACME Inc”, and more.
An operation such as division may work for integers and floating point numbers, but produce an error if used on strings.
Often data libraries such as pandas will automatically use the correct types, but they do provide ways to correct and change the types when needed. For example, you may wish to convert between an integer such as 25, the floating point number 25.0, and strings such as “25”, “25.0”, or “$25.00”.
Pandas data types or dtypes
correspond to similar Python types.
Strings are called str
in Python and object
in pandas.
Integers are called int
in Python and int64
in pandas, indicating that
pandas stores integers as 64-bit numbers.
Floating point numbers are called float
in Python and float64
in pandas,
also indicating that they are stored with 64 bits.
A boolean value, named for logician George Boole, can be either True or False.
These are called bool
in Python and bool
in pandas.
Pandas includes some data types with no corresponding native Python type:
datetime64
for date and time values, timedelta[ns]
for storing the
difference between two times as a number of nanoseconds, and category
where
each item is one of a list of strings.
Here we import the vendor data file and show the dtypes:
Try adding the 2015 and 2016 sales:
These columns were stored as the type “object”, and concatenated as strings, not added as numbers.
Examine more information about the DataFrame:
Vendor Number is a float
and not an int
. 2015 and 2016 sales, percent
growth, and open orders are stored as objects and not numbers. The month, day,
and year values should be converted to datetime64
, and the active column
should be converted to a boolean.
The data can be converted with the astype()
function, custom functions, or
pandas functions such as to_numeric()
or to_datetime()
.
astype()¶
The astype()
function can convert the Vendor Number column to int
:
astype()
returns a copy, so an assignment statement will convert the
original data. This can be checked by showing the dtypes
.
However, trying to convert the 2015 column to a float
or the Open Orders
column to an int
returns an error.
Even worse, trying to convert the Active column to a bool
completes with no
errors, but converts both Y and N values to True
.
astype()
works if the data is clean and can be interpreted simply as a
number, or if you want to convert a number to a string. Other conversions
require custom functions or pandas functions such as to_numeric()
or
to_datetime()
.
Custom conversion functions¶
This small custom function converts a currency string like the ones in the 2015
column to a float
by first removing the comma (,
) and dollar sign ($
)
characters.
Test the function on the 2015 column with the apply()
function:
Convert the 2015 and 2016 columns and show the dtypes
:
Convert the Percent Growth column:
NumPy’s np.where()
function is a good way to convert the Active column to
bool
. This code converts “Y” values to True
and all other values to
False
, then shows the dtypes
:
Pandas helper functions¶
The Open Orders column has several integers, but one string. Using astype()
on this column would produce an error, but the pd.to_numeric()
function
built in to pandas will convert the numeric values to numbers and any other
values to the “not a number” or “NaN” value built in to the floating point
number standard:
In this case, a non-numeric value in this field indicates that there are zero
open orders, so we can convert NaN values to zero with the function
fillna()
:
Similarly, the pd.to_datetime()
function built in to pandas can convert the
Month Day and Year columns to datetime64[ns]
:
Use these functions to change the DataFrame, then show the dtypes
:
Converting data as it is read¶
You can apply dtype
and converters
in the pd.read_csv()
function.
Defining dtype
is like performing astype()
on the data.
A dtype
or a converter
can only be applied once to a specified column.
If you try to apply both to the same column, the dtype
is skipped.
After converting as much of the data as possible in pd.read_csv()
, use code
similar to the previous examples to convert the rest.
We thank http://pbpython.com/pandas_dtypes.html for providing data preparation examples that inspired these examples.
Merging and joining data sets¶
You can use pandas to merge DataFrames:
The available merge methods are left
to use keys from the left frame only,
right
to use keys from the right frame only, outer
to use the union of
keys from both frames, and the default inner
to use the intersection of keys
from both frames.
This merge using the default inner join omits key combinations found in only one of the source DataFrames:
This example omits the rows with key1
and key2
set to K0, K1
, K2,
K1
, or K2, K0
.
Joins also copy information when necessary. The left DataFrame had one row with
the keys set to K1, K0
and the right DataFrame had two. The output DataFrame
has two, with the information from the left DataFrame copied into both rows.
The next example shows the results of a left, right, and outer merge on the same inputs. Empty cells are filled in with NaN values.
If a key combination appears more than once in both tables, the output will contain the Cartesian product of the associated data.
In this small example a key that appears twice in the left frame and three times in the right frame produces six rows in the output frame.
To prevent very large outputs and memory overflow, manage duplicate values in keys before joining large DataFrames.
While merging uses one or more columns as keys, joining uses the indexes, also known as row labels.
Join can also perform left, right, inner, and outer merges, and defaults to left.
This is equivalent to using merge with arguments instructing it to use the indexes:
You can join a frame indexed by a join key to a frame where the key is a column:
You can join on multiple keys if the passed DataFrame has a MultiIndex
:
Note that this defaulted to a left join, but other types are also available:
For more information, including examples of using merge to join a single index to a multi-index or to join two multi-indexes, see the pandas documentation on merging.
When column names in the input frames overlap, pandas appends suffixes to
disambiguate them. These default to _x
and _y
but you can customize
them:
Join has similar arguments lsuffix
and rsuffix
.
You can join a list or tuple of DataFrames on their indexes:
If you have two frames with similar indices and want to fill in missing values
in the left frame with values from the right frame, use the combine_first()
method:
The method update()
overwrites values in a frame with values from another
frame:
The pandas documentation on merging has more information, including examples of combining time series and other ordered data, with options to fill and interpolate missing data.
We thank the pandas documentation for many of these examples.
Filtering data¶
This example uses a vendors DataFrame similar to the one we used above:
To filter only certain rows from a DataFrame, call the query
method with a
boolean expression based on the column names.
Filtering can be done with indices instead of queries: