闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STATS5020 - Introduction to R programming

Assignment 1

Task 1 [13 marks in total]

Q1. [starwars - Read data; a1q1] [2 marks] The dataset starwars .csv contains information from a long time ago in a galaxy far, far away. . . (if you are not familiar with that line, you should watch Star Wars the minute after you submit this assignment. I mean it). The dataset contains the following columns:

Variable	Class	Description
name	character	Name of the character
height	interger	Height (cm)
mass eye_color	double factor	Weight (kg) eye colours
gender	factor	The gender role or gender identity of the character as determined by their personality or the way they were programmed (as in the case for Droids).
homeworld	factor	Name of homeworld
species	factor	name of species

Table 1: Variables for the starwars .csv data frame.

Use R to read in the ﬁle starwars .csv correctly and save it as a data frame called starwars. starwars = read .csv ( !starwars .csv! , na .strings = !NA !)

Q2v1. [starwars - missing sum; a1q2] [2 marks] Deﬁne a vector of length 7 called missing where each element corresponds to a column in starwars containing the number of missing values in that column. The

elements of the vector should be named to show the corresponding names of the columns. missing = colSums (is .na (starwars))

Q2v2. [starwars - missing average; a1q2] [2 marks] Deﬁne a vector of length 7 called missing where each element corresponds to a column in starwars containing the average number of missing values in that column.

The elements of the vector should be named to show the corresponding names of the columns. missing = colMeans (is .na (starwars))

Q3. [starwars - missing; a1q3] [2 marks] Update the data frame starwars by removing the rows from starwars where the height or mass of the character is missing. The updated data frame should be called starwars.

starwars = starwars[!is .na (starwars$height) & !is .na (starwars$mass), ]

Q4v1. [starwars - height by gender; a1q4] [2 marks] Create a vector of length 2 that contains the average height of females and males.

c (mean (starwars$height[starwars$gender == !female !], na .rm = T),

mean (starwars$height[starwars$gender == !male !], na .rm = T)

)

## [1] 170 .2000 177 .9545

Q4v2. [starwars - weight by gender; a1q4] [2 marks] Create a vector of length 2 that contains the average weight of females and males.

c (mean (starwars$mass[starwars$gender == !female !], na .rm = T),

mean (starwars$mass[starwars$gender == !male !], na .rm = T)

)

## [1] 54 .02000 81 .00455

Q5v1. [starwars - humans tatooine; a1q5] [3 marks] In Episode IV (“a new hope”), Luke Skywalker is a farmer on Tatooine living with his uncle and aunt. Tatooine is a harsh desert world, where humans and Droid coexist in relative harmony. Based on the starwars data frame, how many humans live in Tatooine? sum (starwars$homeworld == !Tatooine ! & starwars$species == !Human! , na .rm = T)

## [1] 6

Q5v2. [starwars - droids tatooine; a1q5] [3 marks] In Episode IV (“A New Hope”), Luke Skywalker is a

farmer on Tatooine living with his uncle and aunt. Tatooine is a harsh desert world, where humans and

droids coexist in relative harmony. ased on the starwars data frame, how many droids live in Tatooine? sum (starwars$homeworld == !Tatooine ! & starwars$species == !Droid! , na .rm = T)

## [1] 2

Q6v1. [starwars - ewoks; a1q6] [2 marks] In Episode VI (“Return of the Jedi”) the Alliance engages the Empire in a battle against the second Death Star above Endor. Endor is a small forested moon, home of the

Ewoks, who join the Alliance in their ﬁght. Ewoks stand about one meter tall and are very concerned about their body mass index (BMI). Create a new data frame called ewoks that contains only characters from the Ewok species. Add a new column to the ewok data frame called BMI, which contains the BMI deﬁned as m/h2 , where m is mass in kg, and h is height in meters.

ewok .species = starwars$species == !Ewok!

ewok.species[is.na (ewok.species)] = FALSE

ewoks = starwars[ewok .species, ]

ewoks$BMI = ewoks$mass/((ewoks$height/100)ˆ2)

Q6v2. [starwars - chewie; a1q6] [2 marks] In Episode IV (“A New Hope”) Chewbacca (Chewie) is a Wookie that servers as the co-pilot of the Millennium Falcon starship. Wookies are very tall individuals and are very concerned about their body mass index (BMI). Create a new data frame called chewie containing only characters from the same homeworld as Chewbacca. Add a new column to the chewie data frame called BMI, which contains the BMI deﬁned as m/h2 , where m is mass in kg, and h is height in meters.

chewie .home = starwars$homeworld[starwars$name == !Chewbacca!]

chewie .home = starwars$homeworld == chewie .home

chewie.home[is.na (chewie.home)] = FALSE

chewie = starwars[chewie .home, ]

chewie$BMI = chewie$mass/((chewie$height/100)ˆ2)

Task 2 [27 marks in total]

Q7. [spotify - read data; a1q7] [2 marks] The dataset spotify_songs .txt contains information about 500

diﬀerent songs available from Spotify. The dataset contains the following columns:

Use R to read in the ﬁle spotify_songs .txt correctly and save it as a data frame called spotify_songs. spotify_songs = read .table( !spotify_songs .txt ! , header = T)

Q8 [spotify - rm missing artist; a1q8] [1 mark] Update the data frame spotify_songs by removing the rows from spotify_songs where the track artist is missing. The updated data frame should be called

Variable	Class	Description
track_id track_artist	character character	Song unique ID Song Artist
track_popularity playlist_genre	double character	Song Popularity (0-100) where higher is better Playlist genre
danceability	double	describes how suitable a track is for dancing based on a combination of
energy	double	musical elements. A value of 0.0 is least danceable and 1.0 is most danceable Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.
duration_ms	double	Duration of song in milliseconds

Table 2: Variables for the spotify_songs .txt data frame.

spotify_songs.

spotify_songs = spotify_songs[!is .na (spotify_songs$track_artist), ]

Q9v1 [spotify - most danceable; a1q9] [2 marks] What artist has the most danceable song? spotify_songs$track_artist[which .max (spotify_songs$danceability)]

## [1] Paul Kalkbrenner

## 458 Levels: 2Pac 3 Tenores 3LAU 500 Year Flood 6ix9ine . . . Zuna

Q9v2 [spotify - most energetic; a1q9] [2 marks] What artist has the most energetic song? spotify_songs$track_artist[which .max (spotify_songs$energy)]

## [1] Stoltenhoff

## 458 Levels: 2Pac 3 Tenores 3LAU 500 Year Flood 6ix9ine . . . Zuna

Q10v1 [spotify - rock dance pop; a1q10] [3 marks] How many rock songs have a danceability level above 0.6 and popularity of 70 or more?

rock = spotify_songs[spotify_songs$playlist_genre == !rock ! , ]

sum (rock$danceability > 0.6 & rock$track_popularity >= 70)

## [1] 2

Q10v2 [spotify - latin dance pop; a1q10] [3 marks] How many Latin songs have a danceability level above 0.6 and popularity of 70 or more?

latino = spotify_songs[spotify_songs$playlist_genre == !latin ! , ]

sum (latino$danceability > 0.6 & latino$track_popularity >= 70)

## [1] 15

Q11v1 [spotify - popularityGroup; a1q11] [3 marks] Create a new column in the spotify_songs data frame called popularityGroup deﬁned as follows

．．．．

popularityGroup = ．

．．．．

(

low,

medium-low,

medium-high,

high,

supreme,

if track_popularity s 20,

if 20 < track_popularity s 40,

if 40 < track_popularity s 60,

if 60 < track_popularity s 80,

if track_popularity > 80.

Next, deﬁne a vector called n_artists which contains the number of songs falling into each of these categories.

The elements of the vector should be named to show the corresponding categories.

spotify_songs = transform(spotify_songs,

popularityGroup = cut(track_popularity,

breaks = c (-Inf , 20 , 40 , 60 , 80 , Inf), labels = c ( !low ! , !medium-low ! ,

!medium-high ! , !high ! , !supreme !)))

n_artists = c (sum (spotify_songs$popularityGroup == !low !),

sum (spotify_songs$popularityGroup == !medium-low !),

sum (spotify_songs$popularityGroup == !medium-high !),

sum (spotify_songs$popularityGroup == !high !),

sum (spotify_songs$popularityGroup == !supreme !)

)

names (n_artists) = c ( !low ! , !medium-low ! , !medium-high ! , !high ! , !supreme !) n_artists

low medium-low medium-high

118 101 132

high

126

supreme

# Note that the table() function make this easier!

n_artists = table(spotify_songs$popularityGroup); n_artists

## low medium-low medium-high high supreme

## 118 101 132 126 18

Q11v2 [spotify - energyGroup; a1q11] [3 marks] Create a new column in the spotify_songs data frame called energyGroup deﬁned as follows

．．

energyGroup = ．

．．

(

low,

medium-low,

medium-high,

high,

if energy s 0.4,

if 0.4 < energy s 0.6,

if 0.6 < energy s 0.8,

if energy > 0.8,

Next, deﬁne a vector called n_artists which contains the number of songs falling into each of these categories.

The elements of the vector should be named to show the corresponding categories.

spotify_songs = transform(spotify_songs,

energyGroup = cut(energy,

breaks = c (-Inf , 0.4 , 0.6 , 0.8 , Inf), labels = c ( !low ! , !medium-low ! ,

!medium-high ! , !high !)))

n_artists = c (sum (spotify_songs$energyGroup == !low !),

sum (spotify_songs$energyGroup == !medium-low !),

sum (spotify_songs$energyGroup == !medium-high !),

sum (spotify_songs$energyGroup == !high !)

)

names (n_artists) = c ( !low ! , !medium-low ! , !medium-high ! , !high !)

n_artists

low medium-low medium-high

26 104 192

high

173

# Note that the table() function make this easier!

n_artists = table(spotify_songs$energyGroup); n_artists

## low medium-low medium-high high

## 26 104 192 173

Q12 [spotify - minutes; a1q12] [2 marks] Create a new column called duration_min that contains the track duration in minutes.

spotify_songs$duration_min = spotify_songs$duration_ms/60000

Q13 [spotify - sort energy; a1q13] [2 marks] Sort the spotify_songs data frame in ascending order according to the energy level.

spotify_songs = spotify_songs[order(spotify_songs$energy, decreasing = F), ]

Q14v1 [spotify - xy polyReg; a1q14] [3 marks] We will model the relationship between energy and danceability using the following polynomial regression model of degree p:

E(yi ) = β0 + β1 xi + β2 xi(2) + . . . + βp xi(p), i = 1, . . . , n,

where n is the number of rows in spotify_songs, yi = danceability[i] and xi = energy[i], i = 1, . . . , n.

For a covariate vector ↓ = (x1 , ..., xn ) the design matrix for the polynomial regression of degree p takes the

form

The default value of p is 4. Deﬁne x and y to be the vectors energy and danceability from the spotify_songs data frame, respectively. Use x to deﬁne the design matrix X.

x = spotify_songs$energy

y = spotify_songs$danceability

n = nrow (spotify_songs)

X = cbind(rep ( 1 ,n), x, xˆ2, xˆ3, xˆ4)

Q14v2 [spotify - xy reg; a1q14] [3 marks] We will model the relationship between energy and danceability using the following regression model:

E(yi ) = β0 + β1 xi + β2 log(xi ), i = 1, . . . , n,

where n is the number of rows in spotify_songs, yi = danceability[i] and xi = energy[i], i = 1, . . . , n. For a covariate vector ↓ = (x1 , ..., xn ) the design matrix for the above regression takes the form

Deﬁne x and y to be the vectors energy Use x to deﬁne the design matrix X. x = spotify_songs$energy

y = spotify_songs$danceability n = nrow (spotify_songs)

X = cbind(1 , x, log(x))

l(l)o(o)g(g)x(x)2(1)┐'

X = ' '

' '

and danceability from the spotify_songs data frame, respectively.

Q15 [spotify - ﬁtted values; a1q15] [2 marks] Deﬁne a vector y .hat which contains the ﬁtted values for the regression computed using the design matrix X and the vector y previously deﬁned. The ﬁtted values can be computed using

yˆ = X (XT X )− 1 XT y .

y .hat = X%*%solve(t(X)%*%X)%*%t(X)%*%y

Q16 [spotify - create list; a1q16] [3 marks] Create a list called reg of length 3. The ﬁrst entry of reg, named data, is another list of length 2. The ﬁrst entry of data, should contain the vector x and the second entry should contain the vector y that you deﬁned in Question 9. The second entry of reg, named designM, should contain the design matrix X. Finally, the third entry of reg, named yhat, should contain the vector of ﬁtted values y .hat.

Note: if you were unable to deﬁne the objects x, y, X and/or y .hat in in the previous questions, then

2022-10-15

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths

maple