STATS5020 - Introduction to R programming Assignment 1
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STATS5020 - Introduction to R programming
Assignment 1
Task 1 [13 marks in total]
Q1. [starwars - Read data; a1q1] [2 marks] The dataset starwars .csv contains information from a long time ago in a galaxy far, far away. . . (if you are not familiar with that line, you should watch Star Wars the minute after you submit this assignment. I mean it). The dataset contains the following columns:
Variable |
Class |
Description |
name |
character |
Name of the character |
height |
interger |
Height (cm) |
mass eye_color |
double factor |
Weight (kg) eye colours |
gender |
factor |
The gender role or gender identity of the character as determined by their personality or the way they were programmed (as in the case for Droids). |
homeworld |
factor |
Name of homeworld |
species |
factor |
name of species |
Table 1: Variables for the starwars .csv data frame.
Use R to read in the file starwars .csv correctly and save it as a data frame called starwars. starwars = read .csv ( !starwars .csv! , na .strings = !NA !)
Q2v1. [starwars - missing sum; a1q2] [2 marks] Define a vector of length 7 called missing where each element corresponds to a column in starwars containing the number of missing values in that column. The
elements of the vector should be named to show the corresponding names of the columns. missing = colSums (is .na (starwars))
Q2v2. [starwars - missing average; a1q2] [2 marks] Define a vector of length 7 called missing where each element corresponds to a column in starwars containing the average number of missing values in that column.
The elements of the vector should be named to show the corresponding names of the columns. missing = colMeans (is .na (starwars))
Q3. [starwars - missing; a1q3] [2 marks] Update the data frame starwars by removing the rows from starwars where the height or mass of the character is missing. The updated data frame should be called starwars.
starwars = starwars[!is .na (starwars$height) & !is .na (starwars$mass), ]
Q4v1. [starwars - height by gender; a1q4] [2 marks] Create a vector of length 2 that contains the average height of females and males.
c (mean (starwars$height[starwars$gender == !female !], na .rm = T),
mean (starwars$height[starwars$gender == !male !], na .rm = T)
)
## [1] 170 .2000 177 .9545
Q4v2. [starwars - weight by gender; a1q4] [2 marks] Create a vector of length 2 that contains the average weight of females and males.
c (mean (starwars$mass[starwars$gender == !female !], na .rm = T),
mean (starwars$mass[starwars$gender == !male !], na .rm = T)
)
## [1] 54 .02000 81 .00455
Q5v1. [starwars - humans tatooine; a1q5] [3 marks] In Episode IV (“a new hope”), Luke Skywalker is a farmer on Tatooine living with his uncle and aunt. Tatooine is a harsh desert world, where humans and Droid coexist in relative harmony. Based on the starwars data frame, how many humans live in Tatooine? sum (starwars$homeworld == !Tatooine ! & starwars$species == !Human! , na .rm = T)
## [1] 6
Q5v2. [starwars - droids tatooine; a1q5] [3 marks] In Episode IV (“A New Hope”), Luke Skywalker is a
farmer on Tatooine living with his uncle and aunt. Tatooine is a harsh desert world, where humans and
droids coexist in relative harmony. ased on the starwars data frame, how many droids live in Tatooine? sum (starwars$homeworld == !Tatooine ! & starwars$species == !Droid! , na .rm = T)
## [1] 2
Q6v1. [starwars - ewoks; a1q6] [2 marks] In Episode VI (“Return of the Jedi”) the Alliance engages the Empire in a battle against the second Death Star above Endor. Endor is a small forested moon, home of the
Ewoks, who join the Alliance in their fight. Ewoks stand about one meter tall and are very concerned about their body mass index (BMI). Create a new data frame called ewoks that contains only characters from the Ewok species. Add a new column to the ewok data frame called BMI, which contains the BMI defined as m/h2 , where m is mass in kg, and h is height in meters.
ewok .species = starwars$species == !Ewok!
ewok.species[is.na (ewok.species)] = FALSE
ewoks = starwars[ewok .species, ]
ewoks$BMI = ewoks$mass/((ewoks$height/100)ˆ2)
Q6v2. [starwars - chewie; a1q6] [2 marks] In Episode IV (“A New Hope”) Chewbacca (Chewie) is a Wookie that servers as the co-pilot of the Millennium Falcon starship. Wookies are very tall individuals and are very concerned about their body mass index (BMI). Create a new data frame called chewie containing only characters from the same homeworld as Chewbacca. Add a new column to the chewie data frame called BMI, which contains the BMI defined as m/h2 , where m is mass in kg, and h is height in meters.
chewie .home = starwars$homeworld[starwars$name == !Chewbacca!]
chewie .home = starwars$homeworld == chewie .home
chewie.home[is.na (chewie.home)] = FALSE
chewie = starwars[chewie .home, ]
chewie$BMI = chewie$mass/((chewie$height/100)ˆ2)
Task 2 [27 marks in total]
Q7. [spotify - read data; a1q7] [2 marks] The dataset spotify_songs .txt contains information about 500
different songs available from Spotify. The dataset contains the following columns:
Use R to read in the file spotify_songs .txt correctly and save it as a data frame called spotify_songs. spotify_songs = read .table( !spotify_songs .txt ! , header = T)
Q8 [spotify - rm missing artist; a1q8] [1 mark] Update the data frame spotify_songs by removing the rows from spotify_songs where the track artist is missing. The updated data frame should be called
Variable |
Class |
Description |
track_id track_artist |
character character |
Song unique ID Song Artist |
track_popularity playlist_genre |
double character |
Song Popularity (0-100) where higher is better Playlist genre |
danceability |
double |
describes how suitable a track is for dancing based on a combination of |
energy |
double |
musical elements. A value of 0.0 is least danceable and 1.0 is most danceable Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. |
duration_ms |
double |
Duration of song in milliseconds |
Table 2: Variables for the spotify_songs .txt data frame.
spotify_songs.
spotify_songs = spotify_songs[!is .na (spotify_songs$track_artist), ]
Q9v1 [spotify - most danceable; a1q9] [2 marks] What artist has the most danceable song? spotify_songs$track_artist[which .max (spotify_songs$danceability)]
## [1] Paul Kalkbrenner
## 458 Levels: 2Pac 3 Tenores 3LAU 500 Year Flood 6ix9ine . . . Zuna
Q9v2 [spotify - most energetic; a1q9] [2 marks] What artist has the most energetic song? spotify_songs$track_artist[which .max (spotify_songs$energy)]
## [1] Stoltenhoff
## 458 Levels: 2Pac 3 Tenores 3LAU 500 Year Flood 6ix9ine . . . Zuna
Q10v1 [spotify - rock dance pop; a1q10] [3 marks] How many rock songs have a danceability level above 0.6 and popularity of 70 or more?
rock = spotify_songs[spotify_songs$playlist_genre == !rock ! , ]
sum (rock$danceability > 0.6 & rock$track_popularity >= 70)
## [1] 2
Q10v2 [spotify - latin dance pop; a1q10] [3 marks] How many Latin songs have a danceability level above 0.6 and popularity of 70 or more?
latino = spotify_songs[spotify_songs$playlist_genre == !latin ! , ]
sum (latino$danceability > 0.6 & latino$track_popularity >= 70)
## [1] 15
Q11v1 [spotify - popularityGroup; a1q11] [3 marks] Create a new column in the spotify_songs data frame called popularityGroup defined as follows
,
. . . .
popularityGroup = .
. . . .
(
low,
medium-low,
medium-high,
high,
supreme,
if track_popularity s 20,
if 20 < track_popularity s 40,
if 40 < track_popularity s 60,
if 60 < track_popularity s 80,
if track_popularity > 80.
Next, define a vector called n_artists which contains the number of songs falling into each of these categories.
The elements of the vector should be named to show the corresponding categories.
spotify_songs = transform(spotify_songs,
popularityGroup = cut(track_popularity,
breaks = c (-Inf , 20 , 40 , 60 , 80 , Inf), labels = c ( !low ! , !medium-low ! ,
!medium-high ! , !high ! , !supreme !)))
n_artists = c (sum (spotify_songs$popularityGroup == !low !),
sum (spotify_songs$popularityGroup == !medium-low !),
sum (spotify_songs$popularityGroup == !medium-high !),
sum (spotify_songs$popularityGroup == !high !),
sum (spotify_songs$popularityGroup == !supreme !)
)
names (n_artists) = c ( !low ! , !medium-low ! , !medium-high ! , !high ! , !supreme !) n_artists
##
##
low medium-low medium-high
118 101 132
high
126
supreme
18
# Note that the table() function make this easier!
n_artists = table(spotify_songs$popularityGroup); n_artists
##
## low medium-low medium-high high supreme
## 118 101 132 126 18
Q11v2 [spotify - energyGroup; a1q11] [3 marks] Create a new column in the spotify_songs data frame called energyGroup defined as follows
,
. .
energyGroup = .
. .
(
low,
medium-low,
medium-high,
high,
if energy s 0.4,
if 0.4 < energy s 0.6,
if 0.6 < energy s 0.8,
if energy > 0.8,
Next, define a vector called n_artists which contains the number of songs falling into each of these categories.
The elements of the vector should be named to show the corresponding categories.
spotify_songs = transform(spotify_songs,
energyGroup = cut(energy,
breaks = c (-Inf , 0.4 , 0.6 , 0.8 , Inf), labels = c ( !low ! , !medium-low ! ,
!medium-high ! , !high !)))
n_artists = c (sum (spotify_songs$energyGroup == !low !),
sum (spotify_songs$energyGroup == !medium-low !),
sum (spotify_songs$energyGroup == !medium-high !),
sum (spotify_songs$energyGroup == !high !)
)
names (n_artists) = c ( !low ! , !medium-low ! , !medium-high ! , !high !)
n_artists
##
##
low medium-low medium-high
26 104 192
high
173
# Note that the table() function make this easier!
n_artists = table(spotify_songs$energyGroup); n_artists
##
## low medium-low medium-high high
## 26 104 192 173
Q12 [spotify - minutes; a1q12] [2 marks] Create a new column called duration_min that contains the track duration in minutes.
spotify_songs$duration_min = spotify_songs$duration_ms/60000
Q13 [spotify - sort energy; a1q13] [2 marks] Sort the spotify_songs data frame in ascending order according to the energy level.
spotify_songs = spotify_songs[order(spotify_songs$energy, decreasing = F), ]
Q14v1 [spotify - xy polyReg; a1q14] [3 marks] We will model the relationship between energy and danceability using the following polynomial regression model of degree p:
E(yi ) = β0 + β1 xi + β2 xi(2) + . . . + βp xi(p), i = 1, . . . , n,
where n is the number of rows in spotify_songs, yi = danceability[i] and xi = energy[i], i = 1, . . . , n.
For a covariate vector ↓ = (x1 , ..., xn ) the design matrix for the polynomial regression of degree p takes the
form
1
x2
'
xn
The default value of p is 4. Define x and y to be the vectors energy and danceability from the spotify_songs data frame, respectively. Use x to define the design matrix X.
x = spotify_songs$energy
y = spotify_songs$danceability
n = nrow (spotify_songs)
X = cbind(rep ( 1 ,n), x, xˆ2, xˆ3, xˆ4)
Q14v2 [spotify - xy reg; a1q14] [3 marks] We will model the relationship between energy and danceability using the following regression model:
E(yi ) = β0 + β1 xi + β2 log(xi ), i = 1, . . . , n,
where n is the number of rows in spotify_songs, yi = danceability[i] and xi = energy[i], i = 1, . . . , n. For a covariate vector ↓ = (x1 , ..., xn ) the design matrix for the above regression takes the form
Define x and y to be the vectors energy Use x to define the design matrix X. x = spotify_songs$energy
y = spotify_songs$danceability n = nrow (spotify_songs)
X = cbind(1 , x, log(x))
l(l)o(o)g(g)x(x)2(1)┐'
X = ' '
' '
and danceability from the spotify_songs data frame, respectively.
Q15 [spotify - fitted values; a1q15] [2 marks] Define a vector y .hat which contains the fitted values for the regression computed using the design matrix X and the vector y previously defined. The fitted values can be computed using
yˆ = X (XT X )− 1 XT y .
y .hat = X%*%solve(t(X)%*%X)%*%t(X)%*%y
Q16 [spotify - create list; a1q16] [3 marks] Create a list called reg of length 3. The first entry of reg, named data, is another list of length 2. The first entry of data, should contain the vector x and the second entry should contain the vector y that you defined in Question 9. The second entry of reg, named designM, should contain the design matrix X. Finally, the third entry of reg, named yhat, should contain the vector of fitted values y .hat.
Note: if you were unable to define the objects x, y, X and/or y .hat in in the previous questions, then
2022-10-15