闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DATA423-24S1 Assignment 2

20% of your ﬁnal grade

Summary

This assignment has two parts, coding a shiny app and producing a report.

Disclaimer: The data used in this assignment is not genuine; its has been artiﬁcially constructed to have interesting characteristics and challenges embedded in it.

Part 1 - Coding

Create a Shiny app using RStudio. Load the supplied comma separated variables (csv) ﬁle and use Shiny to:

1. Summarise and visualise the data and perform Exploratory Data Analysis. Make good use of controls. This is a minor part of the assignment so borrow from assignment 1 if you can.

2. From this evidence and the supplied background (see below) develop a strategy to deal with missing values. The strategy can change for diﬀerent variables.

3. From this evidence and the supplied background develop a strategy to deal with outliers. The strategy can change for diﬀerent variables.

4. Develop a pre-processing strategy for things like centring and scaling.

5. Implement these strategies using a “recipes” based data processing pipeline.

6. Develop a tuned glmnet model and visualise its test performance. Document the model’s optimal hyper-parameters. Note that you DO NOT need to explore other methods for this assignment - just glmnet.

7. Identify any residual outliers. Think about how to show the train and test residuals.

The submission should be a set of ﬁles: ui.R, server.R and global.R that we should be able to run and grade (without needing to make any changes). Submit these ﬁles as a compressed ZIP ﬁle.

Part 2 - Report

Write a report on your modelling. Include appropriate images from your shiny app.

1. Discuss the data and any curious features that you noticed. Record the issues you would have followed up with a domain expert, were one available.

2. Document and justify your various strategies using words (rather than code).

3. Research the glmnet method and brieﬂy explain this method in your report.

4. Document your glmnet model’s theoretical performance on unseen data.

Submit your report as a PDF, this should be submitted separately from the ZIP ﬁle.

The Background

Covid-19 data, all measurements are as at 2019. The supplied CSV contains the following variables:

CODE	Anonymised state or country
GOVERN_TYPE	Type of government: "STABLE DEM", "UNSTABLE DEM", "DICTATORSHIP", "OTHER"
POPULATION	Total population
AGE25_PROPTN	The proportion of the population that is at or below 25
AGE_MEDIAN	The median age of the population
AGE50_PROPTN	The proportion of the population that is at or above 50
POP_DENSITY	The population density
GDP	The Gross National Product
INFANT_MORT	The infant mortality rate
DOCS	The number of doctors per 10,000
VAX_RATE	The mean vaccination rate for Covid-19
HEALTHCARE_BASIS	Type of healthcare system "INSURANCE", "PRIVATE", "FREE"
HEALTHCARE_COST	Healthcare costs per person where applicable
DEATH_RATE	The projected death rate (across ten years)
OBS_TYPE	The allocation to test or train

The outcome variable is the DEATH_RATE.

The Details

Steps

1. Create a shell of a Shiny app. Plan whether you want a sidepanel/main layout or a ﬂuidpage layout or something more ambitious. Design how the user would progressively get more information by interacting with the page. Try to avoid long pages - instead use a tabset to control your navigation through the charts. You can continue developing your app from assignment 1 if you feel this is a good starting point.

Add your name to the title part of the UI so it is clear to see whose app is running.

2. Identify all possible missing value placeholders eg ’ ’, ’na’,’N/A’, -1, -99 etc.

3. Place the CSV ﬁle in the same location as the ui.R, server.R ﬁles. Load the CSV ﬁle using something like:

dat ← read.csv(、data.csv、, header = TRUE, na.strings = c(、NA、,、N/A、), stringsAsFactors = TRUE)

4. Additionally replace numeric missing values with NA using something like:

dat[dat == −1] ← NA

5. Identify any categorical missing values that are “Not Applicable” and create new levels for these values using something like:

# convert away from factor

dat$cat ← as.character(dat$cat)

data$cat[is.na(data$cat)] ← “none、、

# convert back to factor

dat$cat ← as.factor(dat$cat)

6. Identify any numeric missing values that are “Not Applicable” and create new levels for these values using something like:

# create a shadow variable

dat$num_shadow ← as.numeric(is.na(data$num))

# Assign missing to zero

dat$num[is.na(dat$num)] ← 0

7. Provide a set of EDA visualisations of the data set. Make a note of any curious things and add these to your report. Use lots of controls to vary the behaviour and scope of the visualisations. Remember to align the visualisation style to the variable type.

8. Identify any excessively missing variables for some threshold. Remove these.

9. Identify any excessively missing observations for some threshold. Remove these.

10. Determine if the missingness has pattern. Try something like:

library(rpart)

library(rpart.plot)

dat$MISSINGNESS ← apply(X = is.na(data), MARGIN = 1, FUN = sum)

tree ← train(MISSINGNESS ∼ . − CODE, −OBS_TYPE,

data = covid,

method = “rpart、、,

na.action = na.rpart)

rpart.plot(tree$finalModel,

main = “TUNED : Predicting_the_number_of__missing_variables_in_an_observation、、,

roundint = TRUE,

clip.facs = TRUE)

11. Create a test - train split. Something like:

train ← dat[dat$OBS_TYPE == “Train、、, ]

test ← dat[dat$OBS_TYPE == “Test、、]

12. Develop a recipe-based processing pipeline. Something like:

# id is not a predictor

# obs__type is not a predictor

rec ← recipes :: recipe(Target ∼ ., data = dat)% > %

update_role(“ID、、, new_role = “id、、)% > %

update_role(“OBS_TYPE、、, new_role = “split、、)% > %

step_knnimpute(all_predictors(), neighbors = 5)% > %

step_center(all_numeric_predictors())% > %

step_scale(all_numeric_predictors())% > %

step_dummy(all_nominal_predictors())

13. Feed your recipe into a training operation that will optimise the hyperparameters of glmnet by resampling the train data. Do some research on the caret package. Choose a suitable metric to use in evaluating the model. Something like:

library(caret)

library(glmnet)

model ← caret :: train(rec, data = train, method = “glmnet、、)

14. Generate the predictions for the test cases. Generate an appropriate visualisation for these predictions.

15. Display the test-RMSE statistic.

16. Generate a residual box-plot for the test data, the train data and both the test & train data and label the outliers based upon a slider for the IQR-multiplier.

Considerations

Here are some things to consider:

Interactive

The app should allow you to choose your strategy. By trying diﬀerent strategies it should be possible to quantify whether they improve the model or not. It is not suﬃcient to just hard-code your optimal decisions.

How?

How can you make an app that allows your variable-missingness-threshold (for example) to aﬀect your choice of whether to centre and scale? The key to this is to use reactive expressions.

Reactive expressions

The code below is a reactive expression that puts the data through a variable-missingness cleaning process:

getClean ← reactive({

#this is another reactive expression

d ← getData()

#process columns

vRatio ← apply(d, 2, pMiss)

d ← d[, vRatio < input$VarThresh]

#process rows

oRatio ← apply(d, 1, pMiss)

d ← d[oRatio < input$ObsThresh, ]

})

Whenever the function getClean() is called in Shiny code, the latest values of the VarThresh and ObsThresh sliders are used to generate the data.

Strategies

It may be necessary to have a cascade of reactive expressions that perform, in an appropriate sequence, the various strategies needed to optimally clean the data for a glmnet model.

Marking

This assignment is worth 20% of your ﬁnal grade.

We mark the assignment out of 100 with 65% of the mark for the Shiny app and 35% for the report.

The Shiny App should run without errors. Test that it does.

Your code, and the use of your app, should allow, exploration of diﬀerent strategies.

The PDF document should show consistent and correct thinking about outliers, missing data, pre-processing steps and the assessment of a model.

The order of steps in the processing pipeline is important.

2024-04-09

Java

物理(Physical)

LINUX

C++