QBUS2810 Statistical Modelling for Business Semester 2, 2021
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
QBUS2810
Statistical Modelling for Business
Semester 2, 2021
Group Assignment
This group assignment is worth 20% of your final result in the unit, of which 5% is your team’s assessment of your contribution.
The deadline is Monday Wednesday of Week 13, October 10 by 11:59pm. Sub- mission is via Canvas and Turnitin.
This assignment must be completed in your Canvas group.
Maximum Length: There is no maximum page length for this assignment. If you have something interesting and worthwhile to include, then please do so without worry- ing about a page limit. However, irrelevant or overly long-winded material will reduce your overall mark. As a guideline, I expect the typical report to have between 20-25 pages, excluding Python code.
Notes on Marking:
· The assignment will initially be marked out of 55.
· Up to an additional five (5) marks will be awarded based on the overall pre- sentation quality of your report. Thus, you will receive a total mark for this assignment out of 60. You will lose some of these 5 presentation marks for poor, inefficient, unclear and/or unprofessional presentation. You will be rewarded for professional, efficient and clear presentation methods. I expect your final report to be done in a professional editing package and to be submitted in pdf only. Html files of jupyter notebooks are not suitable.
· You must use Python for this assignment. You are being assessed on how well you can use Python to complete the assignment tasks. NB: You can use Excel for simple data manipulations and clean-up; but Python is better at these tasks too! All plots and statistical output in the assignment must have been produced in Python, though you can of course make nicer tables in a text editor to include in your assignment. Please include an appendix in your assignment that contains the Python code your group used to produce ALL outputs in your assignment. A heavy penalty will apply if the Python code is not supplied (or the code supplied does not run or work when the marker tries to run it).
Pre-analysis instructions for data:
Please include the python code from the Jupyter notebook file “grp assnt gendata.ipynb” in your Jupyter notebook file to input and clean the data. Collect the student ID num- bers for the members of your group and then add these numbers together. Input the result into the python code where instructed. Run the subsequent code to generate two datasets: “train” and “test” . Most analysis you do will only use the “train”data set. Any forecasting your group does will only use the “test” dataset. The purpose of these commands is to ensure that each group receives different randomly selected datasets for “train”ing and “test”ing purposes. Two other python codes are included in case you need it: forward selection.py and backword selection.py
Business problem:
The US Department of Education and US Department of Labor are interested in the effect of schooling on subsequent earnings of individuals. They also wish to build a model that can accurately predict income of individuals so that they can better under- stand the determinants of job market outcome and estimate the return to education. Your group has been commissioned to research on and analyse the data provided and then report back to the two Departments.
Data and Description:
Please see the file WAGE2 Data Set.pdf for information on the data collected. The data used here are from a survey of income earners from across the US. The dataset is in the file “WAGE2.csv” . Please see WAGE2 Data Set.pdf for description of the variables in the study and for more information. The measure of job market outcome to be used is the monthly wages (in $) or log monthly wages of individuals.
Goals and primary questions:
There are four primary goals that the two Departments would like your group to focus on:
(a) Develop an optimal model for predicting the wages of individuals;
(b) Understand the relationship between years of education and monthly wages and estimate the return to additional year of education in presence of an available career path;
(c) Understand the relationship between work experience and monthly wages in presence of available further education opportunities;
(d) If the Departments are awarded some extra funding, should they spend the money on the educational component (e.g., by promoting additional education) or on the labor component (e.g., by facilitating early start of work experience, internships, apprenticeships, etc) in order to best improve the earnings prospects of individuals.
Tasks:
1. Conduct a suitable exploratory analysis on this dataset that is relevant to the goals of this study (5 marks).
2. Analyse the relationship between educ and wage and test the significance of this relationship using an SLR. Include a discussion of whether the assumptions of your analysis and test could hold for this data and whether and how strongly the data actually fits the model. (5 marks)
3. Discuss which variables in the dataset could be causing omitted variable bias in your analysis in task 2, and justify clearly why you think that. Include these omitted variables, together with educ, in an MLR model, without any transformations or inter- actions or nonlinear effects; then fit the model. Again, test for a relationship between wage and education. Also test for a relationship between experience and wage. Also include a discussion of whether the assumptions of your test could hold for this data and whether and how well the data actually fits the model. Also discuss the level and sources of multi-collinearity present and whether you think this is problematic, or not, and why; and if so, problematic for what? (10 marks)
4. Conduct a variable and model selection exercise, including at least two potential in- teraction effects as potential predictors and also at least two transformations/nonlinear effects on regressors and/or response variable. You must properly motivate and discuss your choices here. Then, report a summary of the comparison of fit over at least 8 different models/transformations/variable sets that you tried, all while forcing educ (and, if needed, exper) to stay in the model in some form. The goal is to find an optimal model that is highly accurate, but also parsimonious, to predict and explain wage. Finally, fully report and give diagnostics on the final optimal model, as well as briefly discussing any collinearity issues it may have . Also, if there are any nonlinear effects in this model, clearly discuss and illustrate their effects on wage. (15 marks)
5. Discuss your results and conclusions regarding the overall goals of this study, in light of the results from your overall analysis of the “train” dataset. Be technical but clear here. Also, include a prediction of what would result if education is increased by one year and what if experience is increased by one year, using at least the optimal model so far (5 marks)
6. Using (at least) the 3 best model specifications considered so far (and any others you think relevant), generate forecast predictions in the“test” dataset for the wages. Present a summary table, and suitable plot(s), of the forecasts and their accuracy for these models, using the forecast measures RMSE, MAD and forecast 42 . Re-discuss your results and conclusions regarding the overall goals of this study, in light of these results and your overall analysis . Be technical but clear here . (10 marks)
7. Write a final report, in as close to plain English as is practical and possible, that discusses and summarises your analysis above and gives conclusions on the overall goals of this study. Address the report to, and write it at a level appropriate for, the Department of Education and Department of Labor. Include in your report a prediction of what would likely occur if the Departments spent money on increasing average years of education vs improving early job experience, and whether you recommend they take one or the other action; plus any suggestions for how the two Departments could further assist the population in improving job prospects and also any suggestions, if you have any, for any future studies they should do to facilitate that. (5 marks)
2022-11-04