Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Econ 424. ML for Economists

Prediction Competition 5: Feature Construction and Text Analysis

November 14, 2022

Timeline:

• Training data sets (DataSet1, DataSet2, DataSet3) are already posted on Learn.   These data sets have 10,000, 130,000, and 40,000 observations, respectively.  The test data set (DataSet4) has 40,000 observations. Test data set without the response variable is also already posted on Learn.

• Submission 1: Due Monday November 28, 5pm. This submission must have 3 elements:

–  (Top of PDF) Your pseudonym to be used on leaderboard.

–  (Rest of PDF) Code for your algorithm that produces test set predictions  (40,000) based on training data set (features and response variable) and based on test data set features.

– Data file (csv) with 40,000 predictions for the response variable in the test set.  THIS FILE MUST ONLY HAVE 40,000 zeros/ones:  just one number per line.   Nothing else.

These 40,000 numbers must be in the same order as the observations are in the distributed ”test data without stars” data set.

• There is no submission 2. In contrast with previous prediction competitions, now you submit predic- tions for each individual observation in the test set, and the instructor/TA calculates the accuracy of 40,000 predictions.

Selected submissions from Submission 1 will be posted on Learn. Every submission shared with class receives 10 bonus percentage points.

To receive credit for any submission, the code must be reported clearly enough to enable replication.

Do not include any identifying information on your submissions (e.g.  name or student ID). Name the file same as your pseudonym;  Choose the pseudonym to be an unusual but pronounceable (English) letter combination such as BellKor” .  These made-up names and performance of each algorithm will be revealed to the class.

Collaboration is encouraged but everyone must run their own code and write up their own answers. Please submit your answers in one PDF file.

The prediction task: Construct the response variable as an indicator variable that is 1 for 5-star reviews and 0 otherwise.  Build an ML model to predict this 5-star status based on votes and review  text.  Use training data sets to estimate this model, and then calculate your predictions for all 40,000 observations in

the test data set. You are free to use any algorithm (Boosting, bagging, random forests is allowed). Performance of the algorithm is measured by the number of correct predictions.