闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Programming for Data Science (COMP7024)

Assignment

2023 Q1

Due 1st of March 2023

Introduction

This assignment consists of four questions, each of equal value, giving a total contribution of 40% to this subject. The beginning of each question provides a breakdown of marks for each part in that question. For example, a breakdown of (1 + 3 + 6 = 10) implies a question consisting of three parts, where the ﬁrst, second and third parts are worth 1, 3 and 6 marks respectively.

IMPoRTANT

R and only packages (that is, R libraries) described in the lectures and tutorials for this subject can be used for generating answers for this assignment! In addition, RMarkdown must not be used2. Penalties may apply for non-conformance.

Answer structure

In doing this assignment, you should not seek to use the maximum word limit declared in the Learning Guide. Note that marks are not awarded for using many words, rather, for using an economy of words and only stating what is relevant to the question being answered. Consider the old adage – “less can be more” . Therefore, show you know what is relevant and have mastered the ability to get to the point using few, simple words and clear sentences. Seek to also apply this philosophy to the code you write.

Your answers to this assignment are to be provided in a single R script ﬁle. All material in your script ﬁle should be logically organised, so that related material can be easily and quickly located. Clearly identify yourself in this ﬁle, as a minimum: full name and student ID, as comments at the beginning of the ﬁle.

R script ﬁle

Textual answers should be included as comments in your script ﬁle, refer to listing 1 for examples on including comments. The comments in your script ﬁle should be:

● Brief and to the point

● Stating a high level perspective

● Stating what is not immediately obvious, but worth mentioning

# In an R script file , comments are prefixed with the hash symbol

# A line with just a comment on it

# Generate a distribution of mean values from a sequence of digits

d < - replicate (1000 ,

{

s < - sample (0:9 , replace = TRUE ) # Generate a sequence of 10 numeric digits

mean ( s ) # A comment to end a line with R code on it

})

hist (d , main = ’Distribution of means ’) # Show distribution of means

# Of cause you would use smarter comments than those used here

# Only state what is not immediately obvious

Listing 1: Some R code with comments (shown in green)

Be brief and to-the-point with respect to comments. The approach described in this section should be the same as used for the exam. Make judicious use of comments a priority in doing this assignment. After all, comments are meant to communicate important and useful details. Make sure you also communicate well through wise choices in variable and function names. Also make wise decisions regarding the layout of everything inside your R script ﬁle. Note if things go wrong, good organisation and comments can help you, since they can show if appropriate logic was intended.

Plagiarism

This is an individual eﬀort assignment, therefore the answers you provide must be your own. You may learn from others, but the understanding claimed by your assignment must be yours. If you include any material in this assignment that is not your own, you must acknowledge that fact and declare the source of that material. Be warned, your answers will be checked for plagiarism and if caught, signiﬁcant penalties may apply.

Submission

Once you have completed the assignment, you must upload your R script ﬁle via Turnitin; if you wish, you can also e-mail your R script ﬁle directly to me3. This maybe wise if you are having trouble with Turnitin or vUWS and are at risk of submitting late. Once you have e-mailed, seek to successfully submit via Turnitin. Be aware that you may need to rename your R script ﬁle by adding the extension “ .txt”, otherwise you may not be successful in submitting via Turnitin.

On a Windows machine you can easily add a “ .txt” extension via ﬁle explorer. Select the “View” tab and tick “File name extensions”, refer ﬁgure 1. Then select the ﬁle to be renamed, press F2 to enter edit mode and add “ .txt” to the very end of the ﬁle name; do not remove the “ .R” portion of the ﬁle name.

Hopefully a similar process is available on other platforms. Determine the method you will use and test it prior to submission.

You must submit your assignment no later than the due date declared on the ﬁrst page of this assignment, otherwise late submission penalties will apply, as described in the section titled “Late submission penalties” . Prior to the due date, you may replace a previously submitted version, but only the last submitted version will be marked!

Figure 1: How to add “ .txt” to the ﬁle extension on Windows

Late submission penalties

Late submission penalties exist. The contribution value of the assignment will reduce by 10% per day, for each day after the submission date; therefore four marks per day. For example, if your assignment is four days late, the maximum possible mark you can score for the assignment is 24 out of 40.

Question 1 (3 + 3 + 3 + 1 = 10)

You are to develop a simple gambling game and test what the average outcome is if you always bet $50.

(i)

Write the code necessary to perform a single turn of the game. The algorithm for the game is as follows

● Randomly choose a bet that is one of the following values 10, 15, 20, 25 . . . , 90, 95, 100

● Simulate the roll of a pair of fair dice

● Determine the outcome of the roll as follows

– Any of the following results in losing your bet 11, 33, 55

– You receive twice your bet for any of the following

22 44

– You receive ﬁve times your bet for rolling a 66

– Any other roll outcome results in losing half your bet

● Tell the user what the bet and the return values are

● If the return value is twice the bet, also print the following message on a new line You won money!

● If the return value is ﬁve times the bet, then also print the following message on a new line Jackpot win!!!

Make sure your code is well organised and has sensible documentation in the form of comments; you will be expanding the capability of your code in the rest of this question. Also seek to make wise choices regarding variable names and code layout.

(ii)

Wrap the dice simulation and return calculator you developed (i), within a function that looks like

betResult < - function ( bet = 50)

{

# bet = bet to be made

# Simulate the roll of a pair of fair dice

# Determine the return from the bet

return ( betReturn )

}

Insert within another function, the code you wrote in (i) to randomly determine a bet, make use of the following function template

betGenerator < - function ()

{

# Randomly choose a bet within the following sequence

# {10 , 15 , 20 , 25 ... 90 , 95 , 100}

return ( betAmount )

}

(iii)

Making use of the functions created in (ii), create the following function

playGame < - function ( turns = 10)

{

# turns = number of bets to be made

return ( betReturn )

}

in order to complete the entire functionality of your game as devised in (i), except using a speciﬁed number of turns. However in this case, the function playGames() provides the following user output

● A single line of output for each iteration of the game, which looks as follows Bet = 25, Dice outcome = 14, Winnings4 = 505

● A single line stating the ﬁnal position for the player, for example, “lost $200”

(iv)

What is the overall position for the player after one hundred turns of the game, where every turn consists of a $50 bet? The position should consist of

● Total outlay

● Total winnings

● Overall proﬁt

Question 2 (2 + 2 + 2 + 4 = 10)

For this exercise, you are to make use of the built-in dataset called iris. The ﬁrst six lines of the dataset can be viewed as follows

> head ( iris )

Sepal . Length Sepal . Width Petal . Length Petal . Width Species

1 5 . 1 3 . 5 1 .4 0 . 2 setosa

2 4 . 9 3 . 0 1 .4 0 . 2 setosa

3 4 . 7 3 . 2 1 . 3 0 . 2 setosa

4 4 . 6 3 . 1 1 . 5 0 . 2 setosa

5 5 . 0 3 . 6 1 .4 0 . 2 setosa

6 5 .4 3 . 9 1 . 7 0 .4 setosa

More information on this dataset can be obtain within R by executing ?iris in the console or in Wikipedia - Iris ﬂower data set. In answering this question, only use functions provided within R base package, hence do not install any other package .

(i)

Using just functional programming, determine the mean Sepal.Length for each species of iris ﬂower. Hint, you only need a single and simple line of code. Using only one or two sentences, explain how your code works.

(ii)

Using two diﬀerent methods, repeat the exercise in (i), but without using functional programming. Using only one or two sentences, explain how your code works.

(iii)

Only using functional programming, determine the mean for each numeric column of the iris dataset, but according to each species. Therefore produce the following

setosa versicolor virginica

Sepal . Length 5 . 006 5 . 936

6 . 588

Sepal . Width 3 . 428 2 . 770

2 . 974

Petal . Length 1 . 462 4 . 260

5 . 552

Petal . Width 0 . 246 1 . 326 2 . 026

Explain your code using no more than three simple sentences.

(iv)

Using the output of (iii), write code to build a tree structure6, which contains the above output data. The tree structure is described as follows

● There are three branches oﬀ the root and each represents a particular species

● Each species branch breaks into the following two branches: Sepal and Petal

● The Sepal branch breaks into two branches consisting of: Length and Width

● Similarly, the Petal branch breaks into two branches consisting of: Length and Width

● The root of the tree consists of just the node, while the other end consists of 12 branches You do not need to visualise the tree structure, just write code to create it.

Question 3 (2 + 5 + 3 = 10)

Here we will perform some simple analysis of data regarding the quality of diﬀerent red wines. The data is located on vUWS in the ﬁle called “wineQuality-red.csv” . Further details for this dataset can be found at UCI - Wine Quality Data Set. The goal is not to become a wine expert, rather to do some simple intuitive investigation.

Load the dataset and do some basic exploration and familiarization of it.

(i)

Write code to produce a single box plot that shows alcohol versus each wine quality. Give the plot a reasonable appearance, hence having a title, axis labels and using colours. Repeat for residual sugar versus quality and density versus quality. Using two simple sentences, which plot shows the greatest connection and worst connection with quality?

(ii)

Using the coding method described in lecture 6, write code to reproduce the visualisation shown in ﬁgure 2.

Figure 2: Various mean wine variables versus quality

Note that your visulaisation does not have to match exactly, in essence, just show the same information.

(iii)

There is a built in function in R called cor(), which determines the correlation between two variables. More information can be found at Wikipedia - Correlation. For example the correlation between ﬁxed. acidity and volatile . acidity can be calculated as follows

> cor ( df $ fixed . acidity , df $volatile . acidity )

[1] - 0 . 2561309

where df contains the complete dataset for this question. Correlation basically means

● 1.0 = Perfect correlation or relationship between the variables; e.g. y = x

● 0.5 = Not perfect correlation

● 0.1 = Weak correlation

● 0 = No correlation; e.g. y x

● ·0.1 = Weak inverse correlation

● ·0.5 = Not perfect inverse correlation

● · 1.0 = Perfect inverse correlation; e.g. y = ·x

Ignoring the sign of the correlation, your task is to write R code to ﬁnd the complete set of correlations for every pair of variables in the wine dataset. Place this data within a matrix data type. Then ﬁnd which variable pair in the matrix has the best and worst correlation.

Yes you could just do cor(df), but that will get you zero marks! Furthermore, you are restricted to methods presented in the lectures and tutorials. Therefore only use basic coding methods7, rather than ﬁnding some package that already does this. Place your code in a sensibly named function.

Question 4 (4 + 3 + 3 = 10)

The goal of this question is to do basic programming and to gain insight into how the functions we have used work; this includes a deeper look into visualisations (e.g. scatter plots). Much of the coding will be minimal, but what is needed is thought into what is happening. Therefore, show your thought and understanding by the comments you write; not a lot of marks will be awarded for code only answers. But make your comments: short, focused and relevant.

(i)

The quantile() function is rather sophisticated, but the basics are quite simple. The goal here is to reproduce the basics of the quantile() function. Create your version of the quantile() function as follows

myQuantile < - function (x , probs =	seq (0 , 1 , 0 . 25) , na . rm	= FALSE )
{
# sensible comments
#
.
.
.

return ( res )
}

Note that there is a type parameter in the build-in version of the quantile() function, which deter- mines the ﬁne details of its internal operation. We will just keep it simple and ignore the existence of that parameter. However, I recommend you use the quantile() function in order to sanity check your work.

Using the iris dataset I obtained the following

> quantile ( iris $Sepal . Length )

0% 25% 50% 75% 100%

4.3 5.1 5.8 6.4 7.9

> myQuantile ( iris $Sepal . Length )

0% 25% 50% 75% 100%

4.3 5.1 5.8 6.4 7.9

(ii)

Now create a crude version of the boxplot function, an example output is shown in ﬁgure 3.

Figure 3: The output of our crude boxplot function is shown in the top image, while the result of boxplot() is shown in the bottom. The dataset used was iris$Sepal.Length

2023-03-22

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths

maple

C语言