CS602 –Data-Driven Development with –Spring’21 Handout 11
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CS602 –Data-Driven Development with –Spring’21
Handout 11
Pandas data manipulation
COMBINING DATA: MERGE AND JOIN
Already saw some examples of adding columns, rows: functions concat and append
Merge function provides SQL-style ‘join’ capabilities, based on equality of column or index values. https://pandas.pydata.org/pandas-docs/stable/merging.html
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
Similar to SQL join; implemented via merge:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
Returns: a data frame with data from left and right merged according to the params below.
If on parameter, specifying which columns to join in, not passed
and left_index and right_index are False, the intersection of the columns in the DataFrames will be inferred to be the join keys.
left, right: DataFrames to be merged. By default they will be merged by the equality of values in the commonly named column (intersection of columns).
how: One of 'left', 'right', 'outer', 'inner'.
on: Column or index names to join on. Must be found in both the left and right DataFrame objects.
left_on/right_on: Columns or index levels from the left/right DataFrame to use as keys. Can either be column names, index level names.
left_index/
right_index: If True, use the index (row labels) from the left DataFrame as its join key(s).
sort: Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve performance substantially in many cases.
suffixes: A tuple of string suffixes to apply to overlapping columns. Defaults to ('_x','_y').
Notice:
By default items are in sorted order by the join column. The result uses a new index.
Example 1
>>> df1 emp group 0 Jane Acc 1 Anne Eng 2 Bob Eng 3 Sam HR |
>>>df2 emp start 0 Lisa 2004 1 Bob 2008 2 Jake 2012 3 Sam 2014 |
how – control inclusion of non-matching rows
>>> pd.merge (df1,df2) # how=’inner’ emp group start 0 Bob Eng 2008 1 Sam HR 2014 |
>>> pd.merge(df1,df2, how='outer') emp group start 0 Jane Acc NaN 1 Anne Eng NaN 2 Bob Eng 2008.0 3 Sam HR 2014.0 4 Lisa NaN 2004.0 5 Jake NaN 2012.0 |
||
>>> pd.merge(df1,df2, how='left')
emp group start 0 Jane Acc NaN 1 Anne Eng NaN |
>>> pd.merge(df1,df2, how='right')
emp group start 0 Bob Eng 2008 1 Sam HR 2014 2 Lisa NaN 2004 3 Jake NaN 2012 |
||
2 3 |
Bob Eng 2008.0 Sam HR 2014.0 |
|
Example 2
>>> df3 #employee group
emp group start 0 Jane Acc 2011 1 Anne Eng 2009 2 Bob Eng 2017 3 Sam HR 2017 |
>>> df4 #employee start
employee start 0 Bob 2008 1 Jake 2012 2 Sam 2014 |
>>> df5 #group manager Out[25]: group emp 0 Acc Lisa 1 Eng Tom 2 HR David |
left_on/right_on, suffixes – if join column titles do not match, and if frames contain the same columns that need to be renamed in the result.
>>> pd.merge(df3, df4, left_on='emp', right_on = 'employee') emp group start_x employee start_y 0 Bob Eng 2017 Bob 2008 1 Sam HR 2017 Sam 2014 |
>>> pd.merge(df3, df4, left_on='emp',right_on=’employee',suffixes = ('_group','_comp')) emp group start_group employee start_comp 0 Bob Eng 2017 Bob 2008 1 Sam HR 2017 Sam 2014 |
left_index/right_index Merging by the index values (can also merge by index with column)
>>> pd.merge(df3, df4, left_index=True, right_index = True)
emp group start_x employee start_y
0 Jane Acc 2011 Bob 2008
1 Anne Eng 2009 Jake 2012
2 Bob Eng 2017 Sam 2014
on – which columns to use for joining
>>> pd.merge(df3, df5)
Empty DataFrame Columns: [emp, group, start] Index: [] |
>>> pd.merge(df3, df5, on = 'group')
emp_x group start emp_y 0 Jane Acc 2011 Lisa 1 Anne Eng 2009 Tom 2 Bob Eng 2017 Tom 3 Sam HR 2017 David |
Practice problems:
1. Given the above definitions, what is the result of pd.merge(df3, df4) ?
2. Both df3 and df5 contain data on employees and their group names. Create a data frame that combines all employees and their group info from df3, df5. The result should contain the following:
0 |
emp Jane |
group Acc |
1 |
Anne |
Eng |
2 |
Bob |
Eng |
3 |
Sam |
HR |
4 |
Lisa |
Acc |
5 |
Tom |
Eng |
6 |
David |
HR |
GROUPBY
Pandas groupby() function mimics the SQL group by clause, in creating groupings of data.
Following the grouping (a.k.a split), items in groups can be used to apply a function on them, and then
combine the results in a new data frame.
https://pandas.pydata.org/pandas-docs/stable/groupby.html https://pandas.pydata.org/pandas-docs/version/0.23/api.html#groupby
DataFrame.groupby(by=None, axis=0)
by: label, or list of labels, function or mapping
Used to determine the groups for the groupby.
A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted a (single) key
If by is a function, it’s called on each value of the object’s index.
If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups.
Example :
>>> staff = pd.DataFrame({'emp': ['Vic', 'Anne', 'Bob', 'Sam', 'Tom', 'Bob'],
'dept': ['Acc', 'Eng', 'Eng', 'HR', 'Acc', 'HR'], 'start':[2016, 2017, 2001, 2017, 2011, 2005 ],
'vac': [11, 12, 34, 12, 16, 5 ]}).sort_values('dept')
>>> staff
0 |
emp Vic |
dept Acc |
start 2016 |
vac 11 |
4 |
Tom |
Acc |
2011 |
16 |
1 |
Anne |
Eng |
2017 |
12 |
2 |
Bob |
Eng |
2001 |
34 |
3 |
Sam |
HR |
2017 |
12 |
5 |
Bob |
HR |
2005 |
5 |
>>> for g in staff.groupby(by = 'dept'): # Iterate over groups:
>>> print(g, type(g))
('Acc', emp dept start vac 0 Vic Acc 2016 11
4 Tom Acc 2011 16)
1 Anne Eng 2017 12
2 Bob Eng 2001 34)
3 Sam HR 2017 12
5 Bob HR 2005 5)
>>> dg = staff.groupby(by = 'dept')
>>> dg.min() # Apply a group-function per each group
emp start vac
dept
Acc |
Tom |
2011 |
11 |
Eng |
Anne |
2001 |
2022-11-24