Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Lab 2 Notes

Overview of Quantitative Methods

Data manipulation & Tidyverse

We can make our R experience much more pleasant with user-made libraries:

#install.packages("readxl")

#install.packages("tidyverse")

library(readxl)

#(this is considered bad taste btw, and it's better to load only the tidyverse packages you need)

library(tidyverse)

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --

## v dplyr 1.1.4 v readr 2.1.4

## v forcats 1.0.0 v stringr 1.5.1

## v ggplot2 3.4.4 v tibble 3.2.1

## v lubridate 1.9.3 v tidyr 1.3.0

## v purrr 1.0.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --

## x dplyr::filter() masks stats::filter()

## x dplyr::lag() masks stats::lag()

## i Use the conflicted package () to force all conflicts to become errors

We will work with the data from the American coffee preference test this time.

gact <- read_xlsx("GACTT_RESULTS_ANONYMIZED_LAB2.xlsx")

gact_codebook <- read_xlsx("GACTT_RESULTS_ANONYMIZED_LAB2.xlsx",

sheet = "codebook") #to read particular xlsx sheet

head(gact_codebook)

## # A tibble: 6 x 2

## variable_name survey_question

##

## 1 submission_id Submission ID

## 2 zip What is your ZIP code?

## 3 age What is your age?

## 4 gender Gender

## 5 cups How many cups of coffee do you typically drink per day?

## 6 where Where do you typically drink coffee?

glimpse(gact)

## Rows: 3,280

## Columns: 6

## $ submission_id "gMR29l", "BkPN0e", "W5G8jj", "4xWgGr", "QD27Q8", "V0LPe~

## $ zip NA, NA, NA, NA, NA, NA, "10023", NA, NA, NA, NA, "12345"~

## $ age "18-24 years old", "25-34 years old", "25-34 years old",~

## $ gender NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Oth~

## $ cups NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "2", "1", "3~

## $ where NA, NA, NA, NA, NA, NA, "At a cafe, At the office, At ho~

Tidyverse is the set of libraries which make R lot cooler. For example, it has the pipe %>% which applies function to object:

gact %>% head(3)

## # A tibble: 3 x 6

## submission_id zip age gender cups where

##

## 1 gMR29l 18-24 years old

## 2 BkPN0e 25-34 years old

## 3 W5G8jj 25-34 years old

head(gact, 3) #equivalent

## # A tibble: 3 x 6

## submission_id zip age gender cups where

##

## 1 gMR29l 18-24 years old

## 2 BkPN0e 25-34 years old

## 3 W5G8jj 25-34 years old

Tidyverse also introduces a ton of functions for data manipulation (with pipe we don’t change the original object):

select(gact, submission_id)

## # A tibble: 3,280 x 1

## submission_id

##

## 1 gMR29l

## 2 BkPN0e

## 3 W5G8jj

## 4 4xWgGr

## 5 QD27Q8

## 6 V0LPeM

## 7 V0Gaxg

## 8 AdzRL0

## 9 EXQLWN

## 10 xpa2K9

## # i 3,270 more rows

gact %>% select(submission_id)

## # A tibble: 3,280 x 1

## submission_id

##

## 1 gMR29l

## 2 BkPN0e

## 3 W5G8jj

## 4 4xWgGr

## 5 QD27Q8

## 6 V0LPeM

## 7 V0Gaxg

## 8 AdzRL0

## 9 EXQLWN

## 10 xpa2K9

## # i 3,270 more rows

gact %>% rename(submission_ID = submission_id)

## # A tibble: 3,280 x 6

## submission_ID zip age gender cups where

##

## 1 gMR29l 18-24 years old

## 2 BkPN0e 25-34 years old

## 3 W5G8jj 25-34 years old

## 4 4xWgGr 35-44 years old

## 5 QD27Q8 25-34 years old

## 6 V0LPeM 55-64 years old

## 7 V0Gaxg 10023 18-24 years old At a cafe, At the office, A~

## 8 AdzRL0

## 9 EXQLWN

## 10 xpa2K9

## # i 3,270 more rows

gact %>% filter(age == "18-24 years old")

## # A tibble: 399 x 6

## submission_id zip age gender cups where

##

## 1 gMR29l 18-24 years old

## 2 V0Gaxg 10023 18-24 years old At a cafe, At the office, ~

## 3 jyDqva 2312 18-24 years old 2 At the office

## 4 eB7QoJ 12345 18-24 years old 3 At home

## 5 24xzlg 11221 18-24 years old At home

## 6 AlbYao 85225 18-24 years old 2 At the office, At home

## 7 42NYWO 18-24 years old

## 8 J2kj9J 110034 18-24 years old 2

## 9 8dpDNA 61761 18-24 years old 3

## 10 vNx6A4 61600 18-24 years old

## # i 389 more rows

gact %>% filter(age == "18-24 years old" & cups >= 3.0) #what's the problem here?

## # A tibble: 99 x 6

## submission_id zip age gender cups where

##

## 1 eB7QoJ 12345 18-24 years old 3 At home

## 2 8dpDNA 61761 18-24 years old 3

## 3 qbEQPO 61600 18-24 years old 3 At the office, At~

## 4 BGboZR 92008 18-24 years old 3 At home, At a caf~

## 5 aBrp0W 14456 18-24 years old 4 At home

## 6 EPbVxA 90039 18-24 years old Female Less than 1 At a cafe, At home

## 7 WApa9N 90036 18-24 years old Male Less than 1 At the office

## 8 zKQA18 95834 18-24 years old Non-binary Less than 1 On the go, At home

## 9 eBxO5q 90036 18-24 years old Non-binary Less than 1 At home

## 10 MAdj6k 83709 18-24 years old Male 3 At home, At a cafe

## # i 89 more rows

table(gact$cups) #`cups` has weird categories while substantially it is a number

##

## 1 2 3 4 Less than 1 More than 4

## 1005 1378 404 101 246 60

#how to recode strings:

gact$cups[15:30] %>% recode("Less than 1" = '0', "More than 4" = '5')

## [1] "3" "1" "2" "2" "3" NA NA "5" "1" "2" "1" NA "3" NA "1" "2"

gact$cups_number <- gact$cups %>% recode("Less than 1" = '0', "More than 4" = '5')

#time to change type:

gact %>% mutate(cups_num = as.numeric(cups_number),

cups_norm = cups_num - mean(cups_num, na.rm = T)) #remove NA = TRUE

## # A tibble: 3,280 x 9

## submission_id zip age gender cups where cups_number cups_num cups_norm

##

## 1 gMR29l 18-24 ~ NA NA

## 2 BkPN0e 25-34 ~ NA NA

## 3 W5G8jj 25-34 ~ NA NA

## 4 4xWgGr 35-44 ~ NA NA

## 5 QD27Q8 25-34 ~ NA NA

## 6 V0LPeM 55-64 ~ NA NA

## 7 V0Gaxg 10023 18-24 ~ At a~ NA NA

## 8 AdzRL0 NA NA

## 9 EXQLWN NA NA

## 10 xpa2K9 NA NA

## # i 3,270 more rows

gact$cups_num <- as.numeric(gact$cups_number)

gact %>% arrange(cups_num, desc = T)

## # A tibble: 3,280 x 8

## submission_id zip age gender cups where cups_number cups_num

##

## 1 bl41Zg 34322 45-54 years old Male Less t~ At a~ 0 0

## 2 KMBroK 11768 25-34 years old Female Less t~ At h~ 0 0

## 3 6dQ0eO 11201 25-34 years old Male Less t~ At t~ 0 0

## 4 eBxgQO 94043 25-34 years old Male Less t~ None~ 0 0

## 5 EPbVxA 90039 18-24 years old Female Less t~ At a~ 0 0

## 6 dY2GGD 80234 25-34 years old Female Less t~ On t~ 0 0

## 7 WApa9N 90036 18-24 years old Male Less t~ At t~ 0 0

## 8 BGbJE5 80234 25-34 years old Male Less t~ None~ 0 0

## 9 VZ1WVN 15206 25-34 years old Male Less t~ At h~ 0 0

## 10 RzBq4P 89148 35-44 years old Male Less t~ At h~ 0 0

## # i 3,270 more rows

With pipe we can also beautifully chain functions (you can think of the syntax as and SQL query)

#let's also create the numeric column for age

table(gact$age)

##

## <18 years="" old="">65 years old 18-24 years old 25-34 years old 35-44 years old

## 13 63 399 1652 760

## 45-54 years old 55-64 years old

## 232 130

gact %>%

select(submission_id, cups_num, age) %>%

filter(cups_num >= 3) %>%

mutate(age_num = str_remove(age, " years old"))

## # A tibble: 565 x 4

## submission_id cups_num age age_num

##

## 1 Zd694B 3 <18 years old <18

## 2 QAeYZY 3 >65 years old >65

## 3 eB7QoJ 3 18-24 years old 18-24

## 4 1r2YJQ 5 25-34 years old 25-34

## 5 aBr18q 3 25-34 years old 25-34

## 6 8dpDNA 3 18-24 years old 18-24

## 7 qbEQPO 3 18-24 years old 18-24

## 8 BGboZR 3 18-24 years old 18-24

## 9 RzBeZj 5 45-54 years old 45-54

## 10 5dqajd 5 25-34 years old 25-34

## # i 555 more rows

#equivalent to, but more readable!

mutate(filter(select(gact, submission_id, cups_num, age), cups_num >= 3),

age_num = str_remove(age, " years old"))

## # A tibble: 565 x 4

## submission_id cups_num age age_num

##

## 1 Zd694B 3 <18 years old <18

## 2 QAeYZY 3 >65 years old >65

## 3 eB7QoJ 3 18-24 years old 18-24

## 4 1r2YJQ 5 25-34 years old 25-34

## 5 aBr18q 3 25-34 years old 25-34

## 6 8dpDNA 3 18-24 years old 18-24

## 7 qbEQPO 3 18-24 years old 18-24

## 8 BGboZR 3 18-24 years old 18-24

## 9 RzBeZj 5 45-54 years old 45-54

## 10 5dqajd 5 25-34 years old 25-34

## # i 555 more rows

gact$age_num <- gact$age %>%

recode("<18 years old" = 18,

">65 years old" = 65,

"18-24 years old" = 21,

"25-34 years old" = 30,

"35-44 years old" = 40,

"45-54 years old" = 50,

"55-64 years old" = 60)

Also, we can make summaries:

gact %>%

summarize(mean_cups = mean(cups_num),

mean_cups_narm = mean(cups_num, na.rm = T),

median_cups = median(cups_num, na.rm = T))

## # A tibble: 1 x 3

## mean_cups mean_cups_narm median_cups

##

## 1 NA 1.78 2

gact %>%

group_by(gender) %>% #same, but for each gender

summarize(mean_cups = mean(cups_num, na.rm = T),

median_cups = median(cups_num, na.rm = T),

n_obs = n())

## # A tibble: 6 x 4

## gender mean_cups median_cups n_obs

##

## 1 Female 1.44 1 579

## 2 Male 1.87 2 2134

## 3 Non-binary 1.48 1 96

## 4 Other (please specify) 1.78 2 9

## 5 Prefer not to say 1.68 2 28

## 6 1.86 2 434

gact %>%

select(gender, age, cups_num) %>%

filter(age %in% c("<18 years old", "18-24 years old", "25-34 years old")) %>%

group_by(gender, age) %>% #we can group by many columns

summarize(mean_cups = mean(cups_num, na.rm = T),

median_cups = median(cups_num, na.rm = T),

n_obs = n())

## ‘summarise()‘ has grouped output by ’gender’. You can override using the

## ‘.groups‘ argument.

## # A tibble: 16 x 5

## # Groups: gender [6]

## gender age mean_cups median_cups n_obs

##

## 1 Female 18-24 years old 1.14 1 59

## 2 Female 25-34 years old 1.36 1 296

## 3 Female <18 years old 2 2 1

## 4 Male 18-24 years old 1.59 2 254

## 5 Male 25-34 years old 1.78 2 1105

## 6 Male <18 years old 1.6 2 5

## 7 Non-binary 18-24 years old 1 1 12

## 8 Non-binary 25-34 years old 1.46 1 65

## 9 Other (please specify) 18-24 years old 1.25 1.5 4

## 10 Other (please specify) 25-34 years old 2 2 3

## 11 Other (please specify) <18 years old 3 3 1

## 12 Prefer not to say 18-24 years old 0.5 0.5 4

## 13 Prefer not to say 25-34 years old 1.67 2 12

## 14 18-24 years old 1.79 2 66

## 15 25-34 years old 1.66 2 171

## 16 <18 years old 1.4 1 6

Now to some serious things. Question “where do you drink coffee” (variable where in our data) is multiple-choice and the result is total mess, let’s make nice dummies from it.

gact %>% select(where) %>% table() %>% head()

## where

## At a cafe

## 49

## At a cafe, At home

## 70

## At a cafe, At home, At the office

## 14

## At a cafe, At home, At the office, On the go

## 5

## At a cafe, At home, On the go

## 11

## At a cafe, At home, On the go, At the office

## 6

gact %>%

drop_na(where) %>% #we don't need missing observations for this

select(submission_id, where) %>%

separate_rows(where, sep = ",\\s*") %>% #that's regex and it's black magic

# we flattened our data so each submission now has as many rows as there were choices

mutate(dummy = 1)

## # A tibble: 5,748 x 3

## submission_id where dummy

##

## 1 V0Gaxg At a cafe 1

## 2 V0Gaxg At the office 1

## 3 V0Gaxg At home 1

## 4 V0Gaxg On the go 1

## 5 7WWB4A On the go 1

## 6 7WWB4A At a cafe 1

## 7 7WWB4A At home 1

## 8 7WWB4A At the office 1

## 9 pKL8aB At the office 1

## 10 Zd694B At home 1

## # i 5,738 more rows

gact %>%

#drop_na(where) %>% #this time let's not drop the missing data

select(submission_id, where) %>%

separate_rows(where, sep = ",\\s*") %>%

mutate(dummy = 1) %>%

#pivot_* is a VERY useful function for transforming your data from long to wide format or vice versa

pivot_wider(names_from = where, values_from = dummy, values_fill = 0)

## # A tibble: 3,280 x 7

## submission_id ‘NA‘ ‘At a cafe‘ ‘At the office‘ ‘At home‘ ‘On the go‘

##

## 1 gMR29l 1 0 0 0 0

## 2 BkPN0e 1 0 0 0 0

## 3 W5G8jj 1 0 0 0 0

## 4 4xWgGr 1 0 0 0 0

## 5 QD27Q8 1 0 0 0 0

## 6 V0LPeM 1 0 0 0 0

## 7 V0Gaxg 0 1 1 1 1

## 8 AdzRL0 1 0 0 0 0

## 9 EXQLWN 1 0 0 0 0

## 10 xpa2K9 1 0 0 0 0

## # i 3,270 more rows

## # i 1 more variable: ‘None of these‘

We can of course just mutate the original dataset, but let’s also practice merging data:

gact_where <- gact %>%

select(submission_id, where) %>%

separate_rows(where, sep = ",\\s*") %>%

mutate(dummy = 1) %>%

pivot_wider(names_from = where, values_from = dummy, values_fill = 0) %>%

select(-`NA`) %>% #drop `NA` column

rename("where_cafe" = "At a cafe", #rename to get rid of whitespaces

"where_office" = "At the office",

"where_home" = "At home",

"where_go" = "On the go",

"where_other" = "None of these")

gact <- left_join(gact, #identical to SQL left join

gact_where, #to each row in gact add data from gact_where

by = "submission_id") #matching by `submission_id` value