闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Lab 2 Notes

Overview of Quantitative Methods

Data manipulation & Tidyverse

We can make our R experience much more pleasant with user-made libraries:

#install.packages("readxl")

#install.packages("tidyverse")

library(readxl)

#(this is considered bad taste btw, and it's better to load only the tidyverse packages you need)

library(tidyverse)

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --

## v dplyr 1.1.4 v readr 2.1.4

## v forcats 1.0.0 v stringr 1.5.1

## v ggplot2 3.4.4 v tibble 3.2.1

## v lubridate 1.9.3 v tidyr 1.3.0

## v purrr 1.0.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --

## x dplyr::filter() masks stats::filter()

## x dplyr::lag() masks stats::lag()

## i Use the conflicted package () to force all conflicts to become errors

We will work with the data from the American coffee preference test this time.

gact <- read_xlsx("GACTT_RESULTS_ANONYMIZED_LAB2.xlsx")

gact_codebook <- read_xlsx("GACTT_RESULTS_ANONYMIZED_LAB2.xlsx",

sheet = "codebook") #to read particular xlsx sheet

head(gact_codebook)

## # A tibble: 6 x 2

## variable_name survey_question

## 1 submission_id Submission ID

## 2 zip What is your ZIP code?

## 3 age What is your age?

## 4 gender Gender

## 5 cups How many cups of coffee do you typically drink per day?

## 6 where Where do you typically drink coffee?

glimpse(gact)

## Rows: 3,280

## Columns: 6

## $ submission_id "gMR29l", "BkPN0e", "W5G8jj", "4xWgGr", "QD27Q8", "V0LPe~

## $ zip NA, NA, NA, NA, NA, NA, "10023", NA, NA, NA, NA, "12345"~

## $ age "18-24 years old", "25-34 years old", "25-34 years old",~

## $ gender NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Oth~

## $ cups NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "2", "1", "3~

## $ where NA, NA, NA, NA, NA, NA, "At a cafe, At the office, At ho~

Tidyverse is the set of libraries which make R lot cooler. For example, it has the pipe %>% which applies function to object:

gact %>% head(3)

## # A tibble: 3 x 6

## submission_id zip age gender cups where

## 1 gMR29l 18-24 years old

## 2 BkPN0e 25-34 years old

## 3 W5G8jj 25-34 years old

head(gact, 3) #equivalent

## # A tibble: 3 x 6

## submission_id zip age gender cups where

## 1 gMR29l 18-24 years old

## 2 BkPN0e 25-34 years old

## 3 W5G8jj 25-34 years old

Tidyverse also introduces a ton of functions for data manipulation (with pipe we don’t change the original object):

select(gact, submission_id)

## # A tibble: 3,280 x 1

## submission_id

## 1 gMR29l

## 2 BkPN0e

## 3 W5G8jj

## 4 4xWgGr

## 5 QD27Q8

## 6 V0LPeM

## 7 V0Gaxg

## 8 AdzRL0

## 9 EXQLWN

## 10 xpa2K9

## # i 3,270 more rows

gact %>% select(submission_id)

## # A tibble: 3,280 x 1

## submission_id

## 1 gMR29l

## 2 BkPN0e

## 3 W5G8jj

## 4 4xWgGr

## 5 QD27Q8

## 6 V0LPeM

## 7 V0Gaxg

## 8 AdzRL0

## 9 EXQLWN

## 10 xpa2K9

## # i 3,270 more rows

gact %>% rename(submission_ID = submission_id)

## # A tibble: 3,280 x 6

## submission_ID zip age gender cups where

## 1 gMR29l 18-24 years old

## 2 BkPN0e 25-34 years old

## 3 W5G8jj 25-34 years old

## 4 4xWgGr 35-44 years old

## 5 QD27Q8 25-34 years old

## 6 V0LPeM 55-64 years old

## 7 V0Gaxg 10023 18-24 years old At a cafe, At the office, A~

## 8 AdzRL0

## 9 EXQLWN

## 10 xpa2K9

## # i 3,270 more rows

gact %>% filter(age == "18-24 years old")

## # A tibble: 399 x 6

## submission_id zip age gender cups where

## 1 gMR29l 18-24 years old

## 2 V0Gaxg 10023 18-24 years old At a cafe, At the office, ~

## 3 jyDqva 2312 18-24 years old 2 At the office

## 4 eB7QoJ 12345 18-24 years old 3 At home

## 5 24xzlg 11221 18-24 years old At home

## 6 AlbYao 85225 18-24 years old 2 At the office, At home

## 7 42NYWO 18-24 years old

## 8 J2kj9J 110034 18-24 years old 2

## 9 8dpDNA 61761 18-24 years old 3

## 10 vNx6A4 61600 18-24 years old

## # i 389 more rows

gact %>% filter(age == "18-24 years old" & cups >= 3.0) #what's the problem here?

## # A tibble: 99 x 6

## submission_id zip age gender cups where

## 1 eB7QoJ 12345 18-24 years old 3 At home

## 2 8dpDNA 61761 18-24 years old 3

## 3 qbEQPO 61600 18-24 years old 3 At the office, At~

## 4 BGboZR 92008 18-24 years old 3 At home, At a caf~

## 5 aBrp0W 14456 18-24 years old 4 At home

## 6 EPbVxA 90039 18-24 years old Female Less than 1 At a cafe, At home

## 7 WApa9N 90036 18-24 years old Male Less than 1 At the office

## 8 zKQA18 95834 18-24 years old Non-binary Less than 1 On the go, At home

## 9 eBxO5q 90036 18-24 years old Non-binary Less than 1 At home

## 10 MAdj6k 83709 18-24 years old Male 3 At home, At a cafe

## # i 89 more rows

table(gact$cups) #`cups` has weird categories while substantially it is a number

## 1 2 3 4 Less than 1 More than 4

## 1005 1378 404 101 246 60

#how to recode strings:

gact$cups[15:30] %>% recode("Less than 1" = '0', "More than 4" = '5')

## [1] "3" "1" "2" "2" "3" NA NA "5" "1" "2" "1" NA "3" NA "1" "2"

gact$cups_number <- gact$cups %>% recode("Less than 1" = '0', "More than 4" = '5')

#time to change type:

gact %>% mutate(cups_num = as.numeric(cups_number),

cups_norm = cups_num - mean(cups_num, na.rm = T)) #remove NA = TRUE

## # A tibble: 3,280 x 9

## submission_id zip age gender cups where cups_number cups_num cups_norm

## 1 gMR29l 18-24 ~ NA NA

## 2 BkPN0e 25-34 ~ NA NA

## 3 W5G8jj 25-34 ~ NA NA

## 4 4xWgGr 35-44 ~ NA NA

## 5 QD27Q8 25-34 ~ NA NA

## 6 V0LPeM 55-64 ~ NA NA

## 7 V0Gaxg 10023 18-24 ~ At a~ NA NA

## 8 AdzRL0 NA NA

## 9 EXQLWN NA NA

## 10 xpa2K9 NA NA

## # i 3,270 more rows

gact$cups_num <- as.numeric(gact$cups_number)

gact %>% arrange(cups_num, desc = T)

## # A tibble: 3,280 x 8

## submission_id zip age gender cups where cups_number cups_num

## 1 bl41Zg 34322 45-54 years old Male Less t~ At a~ 0 0

## 2 KMBroK 11768 25-34 years old Female Less t~ At h~ 0 0

## 3 6dQ0eO 11201 25-34 years old Male Less t~ At t~ 0 0

## 4 eBxgQO 94043 25-34 years old Male Less t~ None~ 0 0

## 5 EPbVxA 90039 18-24 years old Female Less t~ At a~ 0 0

## 6 dY2GGD 80234 25-34 years old Female Less t~ On t~ 0 0

## 7 WApa9N 90036 18-24 years old Male Less t~ At t~ 0 0

## 8 BGbJE5 80234 25-34 years old Male Less t~ None~ 0 0

## 9 VZ1WVN 15206 25-34 years old Male Less t~ At h~ 0 0

## 10 RzBq4P 89148 35-44 years old Male Less t~ At h~ 0 0

## # i 3,270 more rows

With pipe we can also beautifully chain functions (you can think of the syntax as and SQL query)

#let's also create the numeric column for age

table(gact$age)

## <18 years="" old="">65 years old 18-24 years old 25-34 years old 35-44 years old

## 13 63 399 1652 760

## 45-54 years old 55-64 years old

## 232 130

gact %>%

select(submission_id, cups_num, age) %>%

filter(cups_num >= 3) %>%

mutate(age_num = str_remove(age, " years old"))

## # A tibble: 565 x 4

## submission_id cups_num age age_num

## 1 Zd694B 3 <18 years old <18

## 2 QAeYZY 3 >65 years old >65

## 3 eB7QoJ 3 18-24 years old 18-24

## 4 1r2YJQ 5 25-34 years old 25-34

## 5 aBr18q 3 25-34 years old 25-34

## 6 8dpDNA 3 18-24 years old 18-24

## 7 qbEQPO 3 18-24 years old 18-24

## 8 BGboZR 3 18-24 years old 18-24

## 9 RzBeZj 5 45-54 years old 45-54

## 10 5dqajd 5 25-34 years old 25-34

## # i 555 more rows

#equivalent to, but more readable!

mutate(filter(select(gact, submission_id, cups_num, age), cups_num >= 3),

age_num = str_remove(age, " years old"))

## # A tibble: 565 x 4

## submission_id cups_num age age_num

## 1 Zd694B 3 <18 years old <18

## 2 QAeYZY 3 >65 years old >65

## 3 eB7QoJ 3 18-24 years old 18-24

## 4 1r2YJQ 5 25-34 years old 25-34

## 5 aBr18q 3 25-34 years old 25-34

## 6 8dpDNA 3 18-24 years old 18-24

## 7 qbEQPO 3 18-24 years old 18-24

## 8 BGboZR 3 18-24 years old 18-24

## 9 RzBeZj 5 45-54 years old 45-54

## 10 5dqajd 5 25-34 years old 25-34

## # i 555 more rows

gact$age_num <- gact$age %>%

recode("<18 years old" = 18,

">65 years old" = 65,

"18-24 years old" = 21,

"25-34 years old" = 30,

"35-44 years old" = 40,

"45-54 years old" = 50,

"55-64 years old" = 60)

Also, we can make summaries:

gact %>%

summarize(mean_cups = mean(cups_num),

mean_cups_narm = mean(cups_num, na.rm = T),

median_cups = median(cups_num, na.rm = T))

## # A tibble: 1 x 3

## mean_cups mean_cups_narm median_cups

## 1 NA 1.78 2

gact %>%

group_by(gender) %>% #same, but for each gender

summarize(mean_cups = mean(cups_num, na.rm = T),

median_cups = median(cups_num, na.rm = T),

n_obs = n())

## # A tibble: 6 x 4

## gender mean_cups median_cups n_obs

## 1 Female 1.44 1 579

## 2 Male 1.87 2 2134

## 3 Non-binary 1.48 1 96

## 4 Other (please specify) 1.78 2 9

## 5 Prefer not to say 1.68 2 28

## 6 1.86 2 434

gact %>%

select(gender, age, cups_num) %>%

filter(age %in% c("<18 years old", "18-24 years old", "25-34 years old")) %>%

group_by(gender, age) %>% #we can group by many columns

summarize(mean_cups = mean(cups_num, na.rm = T),

median_cups = median(cups_num, na.rm = T),

n_obs = n())

## ‘summarise()‘ has grouped output by ’gender’. You can override using the

## ‘.groups‘ argument.

## # A tibble: 16 x 5

## # Groups: gender [6]

## gender age mean_cups median_cups n_obs

## 1 Female 18-24 years old 1.14 1 59

## 2 Female 25-34 years old 1.36 1 296

## 3 Female <18 years old 2 2 1

## 4 Male 18-24 years old 1.59 2 254

## 5 Male 25-34 years old 1.78 2 1105

## 6 Male <18 years old 1.6 2 5

## 7 Non-binary 18-24 years old 1 1 12

## 8 Non-binary 25-34 years old 1.46 1 65

## 9 Other (please specify) 18-24 years old 1.25 1.5 4

## 10 Other (please specify) 25-34 years old 2 2 3

## 11 Other (please specify) <18 years old 3 3 1

## 12 Prefer not to say 18-24 years old 0.5 0.5 4

## 13 Prefer not to say 25-34 years old 1.67 2 12

## 14 18-24 years old 1.79 2 66

## 15 25-34 years old 1.66 2 171

## 16 <18 years old 1.4 1 6

Now to some serious things. Question “where do you drink coffee” (variable where in our data) is multiple-choice and the result is total mess, let’s make nice dummies from it.

gact %>% select(where) %>% table() %>% head()

## where

## At a cafe

## 49

## At a cafe, At home

## 70

## At a cafe, At home, At the office

## 14

## At a cafe, At home, At the office, On the go

## 5

## At a cafe, At home, On the go

## 11

## At a cafe, At home, On the go, At the office

## 6

gact %>%

drop_na(where) %>% #we don't need missing observations for this

select(submission_id, where) %>%

separate_rows(where, sep = ",\\s*") %>% #that's regex and it's black magic

# we flattened our data so each submission now has as many rows as there were choices

mutate(dummy = 1)

## # A tibble: 5,748 x 3

## submission_id where dummy

## 1 V0Gaxg At a cafe 1

## 2 V0Gaxg At the office 1

## 3 V0Gaxg At home 1

## 4 V0Gaxg On the go 1

## 5 7WWB4A On the go 1

## 6 7WWB4A At a cafe 1

## 7 7WWB4A At home 1

## 8 7WWB4A At the office 1

## 9 pKL8aB At the office 1

## 10 Zd694B At home 1

## # i 5,738 more rows

gact %>%

#drop_na(where) %>% #this time let's not drop the missing data

select(submission_id, where) %>%

separate_rows(where, sep = ",\\s*") %>%

mutate(dummy = 1) %>%

#pivot_* is a VERY useful function for transforming your data from long to wide format or vice versa

pivot_wider(names_from = where, values_from = dummy, values_fill = 0)

## # A tibble: 3,280 x 7

## submission_id ‘NA‘ ‘At a cafe‘ ‘At the office‘ ‘At home‘ ‘On the go‘

## 1 gMR29l 1 0 0 0 0

## 2 BkPN0e 1 0 0 0 0

## 3 W5G8jj 1 0 0 0 0

## 4 4xWgGr 1 0 0 0 0

## 5 QD27Q8 1 0 0 0 0

## 6 V0LPeM 1 0 0 0 0

## 7 V0Gaxg 0 1 1 1 1

## 8 AdzRL0 1 0 0 0 0

## 9 EXQLWN 1 0 0 0 0

## 10 xpa2K9 1 0 0 0 0

## # i 3,270 more rows

## # i 1 more variable: ‘None of these‘

We can of course just mutate the original dataset, but let’s also practice merging data:

gact_where <- gact %>%

select(submission_id, where) %>%

separate_rows(where, sep = ",\\s*") %>%

mutate(dummy = 1) %>%

pivot_wider(names_from = where, values_from = dummy, values_fill = 0) %>%

select(-`NA`) %>% #drop `NA` column

rename("where_cafe" = "At a cafe", #rename to get rid of whitespaces

"where_office" = "At the office",

"where_home" = "At home",

"where_go" = "On the go",

"where_other" = "None of these")

gact <- left_join(gact, #identical to SQL left join

gact_where, #to each row in gact add data from gact_where

by = "submission_id") #matching by `submission_id` value

2024-02-03

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths

maple

C语言

R语言

Internet and World Wide Web

Principles of Programming Languages

sql

scheme

prolog

JavaScript

Haskell

essay

HDL

VBA

会计学(Accounting)

Rust

经济学（ Economics）

算法分析（Algorithm analysis）

MATLAB

Philosophy

Ethics

地理学（Geography）

Project management （管理学）

SysML

社会学（Sociology）

商业分析(Business Analysis)

市场营销学(Marketing)

人类学(Anthropology)

人文艺术(Arts and humanities)

电气工程（Electrical Engineering）

材料学（hylology）

生物科学（biological science）

哲学（Philosophy）

管理科学与工程类（Management science and Engineering）

工商管理（Business Administration）

数学（mathematics）

计算机（computer）

网络安全（Cyber Security）

统计学 Statistics

金融 Finance

经济与贸易 Economy and trade

Excel

Chemistry

LaTeX

OCaml

SPSS

Project

ASP

Stata

FORTRAN

Information system

SDLC

Basic

Digital Media

Biological

Android

ruby

HTML/CSS

Scala

PHP语言

Overview of Quantitative Methods Lab 2 Notes