Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS602 –Data-Driven Development with –Spring’21

Handout 11

Pandas data manipulation

COMBINING DATA: MERGE AND JOIN

Already saw some examples of adding columns, rows: functions concat and append

Merge function provides SQL-style ‘join’ capabilities, based on equality of column or index values. https://pandas.pydata.org/pandas-docs/stable/merging.html

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

Similar to SQL join; implemented via merge:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True,

suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Returns: a data frame with data from left and right merged according to the params below.

If on parameter, specifying which columns to join in, not passed

and left_index and right_index are False, the intersection of the columns in the DataFrames will be inferred to be the join keys.

left, right: DataFrames to be merged. By default they will be merged by the equality of values in the commonly named column (intersection of columns).

how: One of 'left', 'right', 'outer', 'inner'.

on: Column or index names to join on. Must be found in both the left and right DataFrame objects.

left_on/right_on: Columns or index levels from the left/right DataFrame to use as keys. Can either be column names, index level names.

left_index/

right_index: If True, use the index (row labels) from the left DataFrame as its join key(s).

sort: Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve performance substantially in many cases.

suffixes: A tuple of string suffixes to apply to overlapping columns. Defaults to ('_x','_y').

Notice:

By default items are in sorted order by the join column. The result uses a new index.

Example 1

>>> df1

emp group

0 Jane Acc

1 Anne Eng

2 Bob Eng

3 Sam HR

>>>df2

emp start 0 Lisa 2004

1 Bob 2008

2 Jake 2012

3 Sam 2014

how – control inclusion of non-matching rows

>>> pd.merge (df1,df2) # how=’inner’

emp group start

0 Bob Eng 2008 1 Sam HR 2014

>>> pd.merge(df1,df2, how='outer') emp group start

0 Jane Acc NaN

1 Anne Eng NaN 2 Bob Eng 2008.0 3 Sam HR 2014.0

4 Lisa NaN 2004.0

5 Jake NaN 2012.0

>>> pd.merge(df1,df2, how='left')

emp group start

0 Jane Acc NaN

1 Anne Eng NaN

>>> pd.merge(df1,df2, how='right')

emp group start

0 Bob Eng 2008 1 Sam HR 2014

2 Lisa NaN 2004

3 Jake NaN 2012

Bob Eng 2008.0

Sam HR 2014.0

Example 2

>>> df3 #employee group

emp group start

0 Jane Acc 2011

1 Anne Eng 2009

2 Bob Eng 2017

3 Sam HR 2017

>>> df4 #employee start

employee start

0 Bob 2008

1 Jake 2012

2 Sam 2014

>>> df5 #group manager

Out[25]:

group emp

0 Acc Lisa

1 Eng Tom

2 HR David

left_on/right_on, suffixes – if join column titles do not match, and if frames contain the same columns that need to be renamed in the result.

>>> pd.merge(df3, df4, left_on='emp', right_on = 'employee')

emp group start_x employee start_y

0 Bob Eng 2017 Bob 2008

1 Sam HR 2017 Sam 2014

>>> pd.merge(df3, df4, left_on='emp',right_on=’employee',suffixes = ('_group','_comp'))

emp group start_group employee start_comp

0 Bob Eng 2017 Bob 2008

1 Sam HR 2017 Sam 2014

left_index/right_index Merging by the index values (can also merge by index with column)

>>> pd.merge(df3, df4, left_index=True, right_index = True)

emp group start_x employee start_y

0 Jane Acc 2011 Bob 2008

1 Anne Eng 2009 Jake 2012

2 Bob Eng 2017 Sam 2014

on – which columns to use for joining

>>> pd.merge(df3, df5)

Empty DataFrame

Columns: [emp, group, start] Index: []

>>> pd.merge(df3, df5, on = 'group')

emp_x group start emp_y

0 Jane Acc 2011 Lisa

1 Anne Eng 2009 Tom

2 Bob Eng 2017 Tom

3 Sam HR 2017 David

Practice problems:

1. Given the above definitions, what is the result of pd.merge(df3, df4) ?

2. Both df3 and df5 contain data on employees and their group names. Create a data frame that combines all employees and their group info from df3, df5. The result should contain the following: