Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS602 –Data-Driven Development with –Spring’21

Handout 11

Pandas data manipulation

COMBINING DATA: MERGE AND JOIN

Already saw some examples of adding columns, rows: functions concat and append

Merge function provides SQL-style ‘join’ capabilities, based on equality of column or index values. https://pandas.pydata.org/pandas-docs/stable/merging.html

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

Similar to SQL join; implemented via merge:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True,

suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Returns: a data frame with data from left and right merged according to the params below.

If on parameter, specifying which columns to join in, not passed

and left_index and right_index are False, the intersection of the columns in the DataFrames will be inferred to be the join keys.

left, right: DataFrames to be merged. By default they will be merged by the equality of values in the commonly named column (intersection of columns).

how: One of 'left', 'right', 'outer', 'inner'.

on: Column or index names to join on. Must be found in both the left and right DataFrame objects.

left_on/right_on: Columns or index levels from the left/right DataFrame to use as keys. Can either be column names, index level names.

left_index/

right_index: If True, use the index (row labels) from the left DataFrame as its join key(s).

sort: Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve performance substantially in many cases.

suffixes: A tuple of string suffixes to apply to overlapping columns. Defaults to ('_x','_y').

Notice:

By default items are in sorted order by the join column. The result uses a new index.

Example 1

>>> df1

emp group

0 Jane Acc

1 Anne Eng

2 Bob Eng

3 Sam HR

>>>df2

emp start 0 Lisa 2004

1 Bob 2008

2 Jake 2012

3 Sam 2014

how – control inclusion of non-matching rows

>>> pd.merge (df1,df2) # how=’inner’

emp group start

0 Bob Eng 2008 1 Sam HR 2014

>>> pd.merge(df1,df2, how='outer') emp group start

0 Jane Acc NaN

1 Anne   Eng    NaN 2  Bob   Eng  2008.0 3 Sam HR 2014.0

4 Lisa   NaN 2004.0

5 Jake   NaN 2012.0

>>> pd.merge(df1,df2, how='left')

emp group start

0 Jane Acc NaN

1 Anne Eng NaN

>>> pd.merge(df1,df2, how='right')

emp group start

0 Bob Eng 2008 1 Sam HR 2014

2 Lisa NaN 2004

3 Jake NaN 2012

2

3

Bob Eng 2008.0

Sam HR 2014.0

Example 2

>>> df3 #employee group

emp group start

0 Jane Acc 2011

1 Anne Eng 2009

2 Bob Eng 2017

3 Sam HR 2017

>>> df4 #employee start

employee start

0 Bob 2008

1 Jake 2012

2 Sam 2014

>>> df5 #group manager

Out[25]:

group emp

0 Acc Lisa

1 Eng Tom

2 HR David

left_on/right_on, suffixes – if join column titles do not match, and if frames contain the same columns that need to be renamed in the result.

>>> pd.merge(df3, df4, left_on='emp', right_on = 'employee')

emp group start_x employee start_y

0 Bob Eng 2017 Bob 2008

1 Sam HR 2017 Sam 2014

>>> pd.merge(df3, df4, left_on='emp',right_on=’employee',suffixes = ('_group','_comp'))

emp group start_group employee start_comp

0 Bob Eng 2017 Bob 2008

1 Sam HR 2017 Sam 2014

left_index/right_index Merging by the index values (can also merge by index with column)

>>> pd.merge(df3, df4, left_index=True, right_index = True)

emp group start_x employee start_y

0 Jane Acc 2011 Bob 2008

1 Anne Eng 2009 Jake 2012

2 Bob Eng 2017 Sam 2014

on – which columns to use for joining

>>> pd.merge(df3, df5)

Empty DataFrame

Columns: [emp, group, start] Index: []

>>> pd.merge(df3, df5, on = 'group')

emp_x group start emp_y

0 Jane Acc 2011 Lisa

1 Anne Eng 2009 Tom

2 Bob Eng 2017 Tom

3 Sam HR 2017 David

Practice problems:

1. Given the above definitions, what is the result of pd.merge(df3, df4) ?

2. Both df3 and df5 contain data on employees and their group names. Create a data frame that combines all employees and their group info from df3, df5. The result should contain the following:

0

emp

Jane

group

Acc

1

Anne

Eng

2

Bob

Eng

3

Sam

HR

4

Lisa

Acc

5

Tom

Eng

6

David

HR

GROUPBY

Pandas groupby() function mimics the SQL group by clause, in creating groupings of data.

Following the grouping (a.k.a split), items in groups can be used to apply a function on them, and then

combine the results in a new data frame.

https://pandas.pydata.org/pandas-docs/stable/groupby.html https://pandas.pydata.org/pandas-docs/version/0.23/api.html#groupby

DataFrame.groupby(by=None, axis=0)

by: label, or list of labels, function or mapping

Used to determine the groups for the groupby.

A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted a (single) key

If by is a function, it’s called on each value of the object’s index.

If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups.

Example :

>>> staff = pd.DataFrame({'emp': ['Vic', 'Anne', 'Bob', 'Sam', 'Tom', 'Bob'],

'dept': ['Acc', 'Eng', 'Eng', 'HR', 'Acc', 'HR'], 'start':[2016, 2017, 2001, 2017, 2011, 2005 ],

'vac': [11, 12, 34, 12, 16, 5 ]}).sort_values('dept')

>>> staff

0

emp

Vic

dept

Acc

start

2016

vac

11

4

Tom

Acc

2011

16

1

Anne

Eng

2017

12

2

Bob

Eng

2001

34

3

Sam

HR

2017

12

5

Bob

HR

2005

5

>>> for g in staff.groupby(by = 'dept'): # Iterate over groups:

>>> print(g, type(g))

('Acc', emp dept start vac 0 Vic Acc 2016 11

4 Tom Acc 2011 16) ('Eng', emp dept start vac

1 Anne Eng 2017 12

2 Bob Eng 2001 34) ('HR', emp dept start vac

3 Sam HR 2017 12

5 Bob HR 2005 5)

>>> dg = staff.groupby(by = 'dept')

>>> dg.min() # Apply a group-function per each group

emp start vac

dept

Acc

Tom

2011

11

Eng

Anne

2001