COMP7024 Programming for Data Science Assignment 2023 Q1
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Programming for Data Science (COMP7024)
Assignment
2023 Q1
Due 1st of March 2023
Introduction
This assignment consists of four questions, each of equal value, giving a total contribution of 40% to this subject. The beginning of each question provides a breakdown of marks for each part in that question. For example, a breakdown of (1 + 3 + 6 = 10) implies a question consisting of three parts, where the first, second and third parts are worth 1, 3 and 6 marks respectively.
IMPoRTANT
R and only packages (that is, R libraries) described in the lectures and tutorials for this subject can be used for generating answers for this assignment! In addition, RMarkdown must not be used2. Penalties may apply for non-conformance.
Answer structure
In doing this assignment, you should not seek to use the maximum word limit declared in the Learning Guide. Note that marks are not awarded for using many words, rather, for using an economy of words and only stating what is relevant to the question being answered. Consider the old adage – “less can be more” . Therefore, show you know what is relevant and have mastered the ability to get to the point using few, simple words and clear sentences. Seek to also apply this philosophy to the code you write.
Your answers to this assignment are to be provided in a single R script file. All material in your script file should be logically organised, so that related material can be easily and quickly located. Clearly identify yourself in this file, as a minimum: full name and student ID, as comments at the beginning of the file.
R script file
Textual answers should be included as comments in your script file, refer to listing 1 for examples on including comments. The comments in your script file should be:
● Brief and to the point
● Stating a high level perspective
● Stating what is not immediately obvious, but worth mentioning
# In an R script file , comments are prefixed with the hash symbol
# A line with just a comment on it
# Generate a distribution of mean values from a sequence of digits
d < - replicate (1000 ,
{
s < - sample (0:9 , replace = TRUE ) # Generate a sequence of 10 numeric digits
mean ( s ) # A comment to end a line with R code on it
})
hist (d , main = ’Distribution of means ’) # Show distribution of means
# Of cause you would use smarter comments than those used here
# Only state what is not immediately obvious
Listing 1: Some R code with comments (shown in green)
Be brief and to-the-point with respect to comments. The approach described in this section should be the same as used for the exam. Make judicious use of comments a priority in doing this assignment. After all, comments are meant to communicate important and useful details. Make sure you also communicate well through wise choices in variable and function names. Also make wise decisions regarding the layout of everything inside your R script file. Note if things go wrong, good organisation and comments can help you, since they can show if appropriate logic was intended.
Plagiarism
This is an individual effort assignment, therefore the answers you provide must be your own. You may learn from others, but the understanding claimed by your assignment must be yours. If you include any material in this assignment that is not your own, you must acknowledge that fact and declare the source of that material. Be warned, your answers will be checked for plagiarism and if caught, significant penalties may apply.
Submission
Once you have completed the assignment, you must upload your R script file via Turnitin; if you wish, you can also e-mail your R script file directly to me3. This maybe wise if you are having trouble with Turnitin or vUWS and are at risk of submitting late. Once you have e-mailed, seek to successfully submit via Turnitin. Be aware that you may need to rename your R script file by adding the extension “ .txt”, otherwise you may not be successful in submitting via Turnitin.
On a Windows machine you can easily add a “ .txt” extension via file explorer. Select the “View” tab and tick “File name extensions”, refer figure 1. Then select the file to be renamed, press F2 to enter edit mode and add “ .txt” to the very end of the file name; do not remove the “ .R” portion of the file name.
Hopefully a similar process is available on other platforms. Determine the method you will use and test it prior to submission.
You must submit your assignment no later than the due date declared on the first page of this assignment, otherwise late submission penalties will apply, as described in the section titled “Late submission penalties” . Prior to the due date, you may replace a previously submitted version, but only the last submitted version will be marked!
Figure 1: How to add “ .txt” to the file extension on Windows
Late submission penalties exist. The contribution value of the assignment will reduce by 10% per day, for each day after the submission date; therefore four marks per day. For example, if your assignment is four days late, the maximum possible mark you can score for the assignment is 24 out of 40.
Question 1 (3 + 3 + 3 + 1 = 10)
You are to develop a simple gambling game and test what the average outcome is if you always bet $50.
(i)
Write the code necessary to perform a single turn of the game. The algorithm for the game is as follows
● Randomly choose a bet that is one of the following values 10, 15, 20, 25 . . . , 90, 95, 100
● Simulate the roll of a pair of fair dice
● Determine the outcome of the roll as follows
– Any of the following results in losing your bet 11, 33, 55
– You receive twice your bet for any of the following
22 44
– You receive five times your bet for rolling a 66
– Any other roll outcome results in losing half your bet
● Tell the user what the bet and the return values are
● If the return value is twice the bet, also print the following message on a new line You won money!
● If the return value is five times the bet, then also print the following message on a new line Jackpot win!!!
Make sure your code is well organised and has sensible documentation in the form of comments; you will be expanding the capability of your code in the rest of this question. Also seek to make wise choices regarding variable names and code layout.
(ii)
Wrap the dice simulation and return calculator you developed (i), within a function that looks like
betResult < - function ( bet = 50)
{
# bet = bet to be made
#
# Simulate the roll of a pair of fair dice
.
.
.
# Determine the return from the bet
.
.
.
return ( betReturn )
}
Insert within another function, the code you wrote in (i) to randomly determine a bet, make use of the following function template
betGenerator < - function ()
{
# Randomly choose a bet within the following sequence
# {10 , 15 , 20 , 25 ... 90 , 95 , 100}
.
.
.
return ( betAmount )
}
(iii)
Making use of the functions created in (ii), create the following function
playGame < - function ( turns = 10)
{
# turns = number of bets to be made
.
.
.
return ( betReturn )
}
in order to complete the entire functionality of your game as devised in (i), except using a specified number of turns. However in this case, the function playGames() provides the following user output
● A single line of output for each iteration of the game, which looks as follows Bet = 25, Dice outcome = 14, Winnings4 = 505
● A single line stating the final position for the player, for example, “lost $200”
(iv)
What is the overall position for the player after one hundred turns of the game, where every turn consists of a $50 bet? The position should consist of
● Total outlay
● Total winnings
● Overall profit
Question 2 (2 + 2 + 2 + 4 = 10)
For this exercise, you are to make use of the built-in dataset called iris. The first six lines of the dataset can be viewed as follows
> head ( iris )
Sepal . Length Sepal . Width Petal . Length Petal . Width Species
1 5 . 1 3 . 5 1 .4 0 . 2 setosa
2 4 . 9 3 . 0 1 .4 0 . 2 setosa
3 4 . 7 3 . 2 1 . 3 0 . 2 setosa
4 4 . 6 3 . 1 1 . 5 0 . 2 setosa
5 5 . 0 3 . 6 1 .4 0 . 2 setosa
6 5 .4 3 . 9 1 . 7 0 .4 setosa
More information on this dataset can be obtain within R by executing ?iris in the console or in Wikipedia - Iris flower data set. In answering this question, only use functions provided within R base package, hence do not install any other package .
(i)
Using just functional programming, determine the mean Sepal.Length for each species of iris flower. Hint, you only need a single and simple line of code. Using only one or two sentences, explain how your code works.
(ii)
Using two different methods, repeat the exercise in (i), but without using functional programming. Using only one or two sentences, explain how your code works.
(iii)
Only using functional programming, determine the mean for each numeric column of the iris dataset, but according to each species. Therefore produce the following
setosa versicolor virginica |
Sepal . Length 5 . 006 5 . 936 6 . 588 |
Sepal . Width 3 . 428 2 . 770 2 . 974 |
Petal . Length 1 . 462 4 . 260 5 . 552 |
Petal . Width 0 . 246 1 . 326 2 . 026 |
Explain your code using no more than three simple sentences.
(iv)
Using the output of (iii), write code to build a tree structure6, which contains the above output data. The tree structure is described as follows
● There are three branches off the root and each represents a particular species
● Each species branch breaks into the following two branches: Sepal and Petal
● The Sepal branch breaks into two branches consisting of: Length and Width
● Similarly, the Petal branch breaks into two branches consisting of: Length and Width
● The root of the tree consists of just the node, while the other end consists of 12 branches You do not need to visualise the tree structure, just write code to create it.
Question 3 (2 + 5 + 3 = 10)
Here we will perform some simple analysis of data regarding the quality of different red wines. The data is located on vUWS in the file called “wineQuality-red.csv” . Further details for this dataset can be found at UCI - Wine Quality Data Set. The goal is not to become a wine expert, rather to do some simple intuitive investigation.
Load the dataset and do some basic exploration and familiarization of it.
(i)
Write code to produce a single box plot that shows alcohol versus each wine quality. Give the plot a reasonable appearance, hence having a title, axis labels and using colours. Repeat for residual sugar versus quality and density versus quality. Using two simple sentences, which plot shows the greatest connection and worst connection with quality?
(ii)
Using the coding method described in lecture 6, write code to reproduce the visualisation shown in figure 2.
Figure 2: Various mean wine variables versus quality
Note that your visulaisation does not have to match exactly, in essence, just show the same information.
(iii)
There is a built in function in R called cor(), which determines the correlation between two variables. More information can be found at Wikipedia - Correlation. For example the correlation between fixed. acidity and volatile . acidity can be calculated as follows
> cor ( df $ fixed . acidity , df $volatile . acidity )
[1] - 0 . 2561309
where df contains the complete dataset for this question. Correlation basically means
● 1.0 = Perfect correlation or relationship between the variables; e.g. y = x
● 0.5 = Not perfect correlation
● 0.1 = Weak correlation
● 0 = No correlation; e.g. y x
● ·0.1 = Weak inverse correlation
● ·0.5 = Not perfect inverse correlation
● · 1.0 = Perfect inverse correlation; e.g. y = ·x
Ignoring the sign of the correlation, your task is to write R code to find the complete set of correlations for every pair of variables in the wine dataset. Place this data within a matrix data type. Then find which variable pair in the matrix has the best and worst correlation.
Yes you could just do cor(df), but that will get you zero marks! Furthermore, you are restricted to methods presented in the lectures and tutorials. Therefore only use basic coding methods7, rather than finding some package that already does this. Place your code in a sensibly named function.
Question 4 (4 + 3 + 3 = 10)
The goal of this question is to do basic programming and to gain insight into how the functions we have used work; this includes a deeper look into visualisations (e.g. scatter plots). Much of the coding will be minimal, but what is needed is thought into what is happening. Therefore, show your thought and understanding by the comments you write; not a lot of marks will be awarded for code only answers. But make your comments: short, focused and relevant.
(i)
The quantile() function is rather sophisticated, but the basics are quite simple. The goal here is to reproduce the basics of the quantile() function. Create your version of the quantile() function as follows
myQuantile < - function (x , probs = |
seq (0 , 1 , 0 . 25) , na . rm |
= FALSE ) |
{ |
|
|
# sensible comments |
|
|
# |
|
|
. |
|
|
. |
|
|
. |
|
|
|
|
|
return ( res ) |
|
|
} |
|
|
Note that there is a type parameter in the build-in version of the quantile() function, which deter- mines the fine details of its internal operation. We will just keep it simple and ignore the existence of that parameter. However, I recommend you use the quantile() function in order to sanity check your work.
Using the iris dataset I obtained the following
> quantile ( iris $Sepal . Length )
0% 25% 50% 75% 100%
4.3 5.1 5.8 6.4 7.9
> myQuantile ( iris $Sepal . Length )
0% 25% 50% 75% 100%
4.3 5.1 5.8 6.4 7.9
>
(ii)
Now create a crude version of the boxplot function, an example output is shown in figure 3.
Figure 3: The output of our crude boxplot function is shown in the top image, while the result of boxplot() is shown in the bottom. The dataset used was iris$Sepal.Length
2023-03-22