Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CISC 271, Winter 2024

Assignment #2: Regression and Cross-Validation

Due by 4:00PM on Wednesday, February 7, 2024

The subject matter for this assignment is the relationship between social-sciences estimation of the fragility of a country and its male population.  The included article, from quartz.com, describes a postulated relationship between age cohorts of the male population of a country and how “fragile” the country maybe.

Coding for this requires multiple uses of linear regression, which is the solution of an over- determined linear equation. You may use the MATLAB builtin function linsolve, or the “back- slash” operator \, or any builtin function.  You may not use functions for cross-validation, or any other functions from a MATLAB toolbox, because you are expected to code these by yourself.

The technical problem in linear regression, for this assignment, is to estimate a weight vector  for a design matrix A that is “tall thin”, and a data vector .  The vector  is an approximate solution to the regression problem

Please read the details and instructions carefully before you begin to work on the problem. The second question in this assignment is modestly dificult because it is intended to be a practical introduction to a method of evaluating algorithms for linear regression.  There must be a single results section and a single discussion section in your report. The results section of the report must contain two tables and up to one igure; more or fewer, of either tables or igures, may produce deductions from your grade on this assignment.

Statement of Academic Integrity

This assignment is copyrighted by the instructor, so unauthorized dissemination of this assign- ment maybe a violation of copyright law and may constitute a departure from academic integrity.

Sharing of all or part of a solution to this assignment, whether as code or as a report, will be interpreted as a departure from academic integrity. This includes sharing of the assignment after the due date and after completion of this course.

Learning Outcomes

On completion, a successful student will be able to:

.  Standardizedata for use in linear regression

. Compare different regressions in a consistent manner

. Implement a 5-fold cross-validation of linear regression

. Evaluate the results of across-validation

Preliminary: The Data

This assignment uses data that the instructor gathered from the Fund For Peace and the United Nations world populations estimates. For consistency with the accompanying article, data for the year 2013 were extracted and pre-processed by the instructor.  The “starter” code has an internal function that reads the data and separates the relevant parts.

The data are in the ile fragility2013male.csv that is in the ZIP ile for this assignment. The data can be loaded using the fragilitydata function that is provided in the starter code. For your report, you will need to understand the names of the variables, which are age groups.

Preliminary: The Code

The starter code is structured as three functions. The base function, which is the function that will be called when the graders invoke your ile name, is a2.

You can, if you wish, change the name of this function to match the name of your submission ile.  MATLAB will execute the code correctly even if the function name and the ile name differ. The base function will then invoke two functions in succession.

The base function will invoke the code for the irst question.  This irst code will return two variables: the RMS errors of linear regression and the index of the lowest RMS error.  The base function will then invoke the code for the second question. This invocation uses the index from the irst question and computes the 5-fold cross-validation for the corresponding variable.

The base function will return four values. The TAs will examine these values and use them as part of the grade for your assignment.

DO NOT MODIFY THE BASE FUNCTION. MODIFY ONLY THESE FUNCTIONS:

a2q1   a2q2   mykfold (optional)

The internal function fragilitydata will return four output arguments.  The irst output argument is the dependent vector, in which each entry is the fragility indexof a country. The second output argument is the independent data matrix; each column of the observation of a country is the proportion of the male population that is in an age group. The third output argument is a cell array of names of the countries.  The fourth output argument is a matrix of the age groups;  for each column, the irst entry is the lower range of the age group and the second entry is the upper range of the age group.

You will perform computations that use the dependent vector and the independent matrix. The other arguments may help you to understand the data. For this assignment, the fragility values are the dependent data that we will call the yc vector.

Question 1: Variable of Best Regression                             8% of Final Grade

For this question you will need to modify the function a2q1 in the starter code.


The technical problem for this question is to ind the variable in the data that best explains the dependent variable. You are expected to do this by selecting the variable that, when chosen as the A matrix, has the lowest RMS error of it to the dependent data yc. Each variable is an age group.

You should describe what, if any, data standardization you used. You should also describe and justify whether or not you used an intercept term in your linear regression.  Recall that a positive correlation is one in which the slope of the line of best it is a positive value and that a negative correlation is one in which the slope of the line of best it is a negative value.

The starter code for this assignment will return a row of small numbers; your completed code should return the RMS error for the regression of each variable in terms of all of the other variables.

The starter code will also return the indexes of the variables with the smallest RMS error for positive correlations and for negative correlations; by default it selects the irst and last variables, which you will ind computationally.

You may report RMS errors either in the units of age or in standardized units, depending on your choice of whether or not to use data standardization.

In your report, the values of the RMS errors must be presented in Table 1.  The caption of the table must state the index of the age group that best explains the fragiliy indexes and the age range of the age group. You can, optionally, include a single igure that has plots of the dependent variable and the linear regression for the best positive correlation and the best negative correlation. An example of one part of the optionaligure for this assignment is Figure 1.

Figure 1: Example of an optional plot of given data and a linear regression to the dependent data.


1.1: Linear Regression

The fragility indexes will need to be considered as a dependent variable, in this document referred to as cvec. The proportions of age groups of the overall male populations will need to be considered as the independent variables, in this document referred to as Xmat.  You may wish to standardize the data. If you choose to use an intercept term, you will need to augment Xmat with

a column that is the 1(y) vector.

The RMS error of regression of cvec must be computed and returned as the corresponding entry of the variable rmsvars of the function a2q1.  The indexes of the smallest RMS error of the positive correlations and of the negative correlations must be computed and returned as the variables lowIndexPositive and lowIndexNegative of the function a2q1.

1.2: Methods

Your methods must include a narrative description of your computation. For example, in ind- ing the best data variable, do not simply say something like “I used a for loop”. You should give a reader the logic that underlies your code, not a line-by-line description of its implementation.

1.3: Discussing The Results

Your discussion should compare and contrast your numerical results and any observation you have on, for example, how the proportion of the male population is related to the estimated fragility of the country.  You should describe any other effects related to the choices you made in your implementation.

For this assignment, you are encouraged to provide a modest amount of creativity it is a good idea to try to engage the readers of your report.

Question 2: Cross Validation of Regression                        7% of Final Grade

The problem for this question is to determine the reliability of choosing one age group to act as a proxy for the fragility indexes.  This reliability will be found by performing a 5-fold cross- validation of the given data.

Important: RMS errors must be reported in the units of the fragility index. If you standardized the data for linear regression, the standardization may need to be inverted, or recomputed with an intercept term, to ind the correct units.

You can select the ive folds of data by a method of your choice.  In the introduction of your report, you must state your choice and verbally defend the choice.  The instructor’s notes recom- mend a random selection, which requires a form of indirect indexing that is more complex than the indexing that was used in the previous assignment.

For each fold, you must irst use 4/5 of the data to “train” your regression. This training is the computation of the weight vector of a linear regression.

For each fold, you must then use 1/5 of the data to “test” your regression.  This testing is the application of the weight vector from the training phase; you must use the same choices of standardization and an intercept term that you used in the training phase.

For each fold, the RMS error of training must be computed and returned as the corresponding entry of the variable rmstrain of the function a2q2.  For each fold, the RMS error of testing must be computed and returned as the corresponding entry of the variable rmstest of the function a2q2.

In your report, the values of the RMS errors of the folds must be presented in Table 2.  The caption of the table must state the index of the age group that best explains the fragility indexes and the range of the age group. You can, optionally, summarize the RMS errors of testing and the RMS errors of testing as two overall values.

2.1: Loading The Data

The data must be loaded exactly as they were loaded for Question 1 of this assignment.

2.2: Fold Selection

There are many ways to select folds, including algorithmic selection and random selection. You must describe and verbally defend your choice. Some choices require more elaborate coding than others and some choices introduce biases into the results, so this choice has consequences.

The starter code includes the MATLAB function mykfold, which is the instructor’s starter code for randomly selecting the folds.  Because you do not need to use this code, this code is not called from the other starter code.

3: Grading Guide

We will test your code by invoking the function that you uploaded. Your grade will be reduced if: you plot more or fewer than the speciied number ofigures; your code outputs anything other than the speciied values; or you otherwise deviate in your implementation from these speciica- tions.

The TA’s have been instructed to use this guide when they mark your assignment. Your grade will be based on the numerical results and on the report. The distribution of points for the assign- ment grade are:

9/60 points: all and only the numerical values that are produced by the code and that arepresented in the results

15/60 points: quality of the code in the modiied starter” functions, and any other changes in the submission ile that was used to generate values and plots for the report

36/60 points: quality of the report, especially including the igures and descriptions; clarity may be assessed, in part, by the written introduction, verbal defense of choices, and the discussion of results

What to turn in:

. You will submit your answers electronically as two iles.  The code will be tested by one or more graders.  The PDF report will be read by one or more graders and will be checked, using electronic methods, to ensure that it meets professional standards for originality.

. The code must be in one MATLAB ile (a2-xxxxxxxx.m).  This ile will contain all of the code needed to verify that the values and tables in the report can be reproduced.  The functions must produce the values for your tables and the igure.

. Your functions must take no arguments, return the speciied values, and require no user input or action such as using the “enter” key.  Running this function should produce, on the console, every value that is in the report; the function should also produce any plot that is in your report.   The function should produce no other values or igures.   The graders will compare your computed values to the values in the report and may deduct marks from the report for differences between any reported value/plot and the corresponding computed value/plot.

. The report must be in a single PDF ile (a2-xxxxxxxx.pdf). The PDF ile must include a description of how you tested your code.  You can also include notes, comments on the problems, and assumptions you have made, as appropriate.

. The assignment must be submitted using the Queen’s “onQ” software.

Grading Considerations:

. The quality of your report will be considered.  You need, at minimum, to conform to the

“student version” of the report style in the onQ website; you may wish to consider the “grader version” that we will use for assessing your report.

. The quality of your MATLAB code will be considered.  Your code should be appropriately indented, suficiently commented, and otherwise be appropriate software.

. The output of your code will be considered.

. Your code can use functions provided by MATLAB, but the code that you submit must be your original work. You may not use any builtin functions that perform k-fold cross-validation.

. Code that causes MATLAB to produce an error or warning will result in a failing grade.

. You may assume that the ile goods.csv is in the current directory when a grader tests your code.

Policies:

. You must complete these questions individually.

. Although you are allowed to discuss the questions with other students, you must write your own answers and MATLAB code.

. The syllabus standards apply to this assignment.

. Lateness policy applies starting the minute after the submission deadline, at a rate of 20% off the assignment value per calendar day. Please note: the time in the onQ system is beyond your control, so submitting within an hour of the deadline is inherently a risky process for which you assume full responsibility.