Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

QMSS 301 - Quantitative Social Science Analysis and Big Data

Project 3 - Sentiment and Predictive Analyses

Assigned: Apr. 9, 2024; Due: Apr. 22, 2024 at 11:59 PM

The aim of this project is in twofold.  First, it aims to get the students comfortable with Web Scraping social media data (Reddit, to be precise), run sentiment analysis of the data, and do some visualizations with it.  Second, it aims to also make the students practice predictive analytics using the sentiment result of web-scrapped text data.

Therefore, your tasks are in two parts. First, identify a social issue (e.g, student loan forgiveness) and find a reddit post about it. The post should have at least 1000 comments (let me know if there is any post you are really interested in but have less than 1000 comments).  You are expected to scrape the comments (include the date created and the number of ups and downs the comments have attracted), pre-process the data, run a sentiment analysis (using both text blob and VADER), and conduct exploratory data analysis. In addition to this, create two columns - the first to consist of the number of characters in each comment, and the other to include the number of words in each comment. Second, you will build a Machine Learning model to predict the chances that a comment has a positive or negative sentiment.  To do this, filter only the comments with positive and negative sentiments. Use the sentiment variable (with two categories - positive sentiment and negative sentiment) as your target variable, and use at least any three of number of ups, downs, character, words, as your features. As expected, you are to check for your model’s accuracy by presenting the necessary evaluation metrics (e.g, accuracy, precision, etc).

At the end, you are expected to submit two things:  a python script - used for scrapping the Reddit data, running the analysis (both sentiment analysis and predictive analysis), and a report of about 4-5 pages pages. Ensure that your script is well documented.

GUIDELINES

Report

In addition to running the codes, I am very much interested in your ability to bring out meaningful insight from the analyses. This is where you will explain things like; your word cloud, polarity and subjectivity result, etc. At the same time, this is where you will bring meaningful insight from the predictive result. In the report, I expect you to speak to the data and social issues you are researching on when interpreting the result of your analysis. This information will be written up in a brief 4-5 pages (excluding images, tables and chart), with interpretations and recommendations based on your findings. You can decide to include images, chart, etc, within the report, or put them in the appendix. A suggested outline for this report would include;

. Introduction/Problem Statement: Provide the background to the social problem you will be analyzing with your data.

.  Data Collection and intention: Tell us about the post in relation to the problem. How many comments are in the Reddit post you scrapped?  What methods do you intend to use for your analysis?

.  Sentiment Analysis:  What are the findings of your sentiment analysis? What were your findings when you compared the results of Textblob and VADER? What are the findings from your exploratory analysis? etc.

.  Predictive Analysis: What are the findings from your machine learning model?

.  Recommendation/Conclusions: This is your chance to draw conclusions on your results and add your own voice to it. Are your findings surprising? etc.

Grading Rubrics

This assignment is 20% of your final grade.  It is an independent project, but you can work with your classmates.

Item

Measurements

Weights

Web Scrapping

. Successfully scrapped the data (7 points).

. Convert the date created to regular time, and delete duplicates (3 points).

. Code well documented.  (5 points)

15%

Sentiment Analy- sis

.  Successfully  carry out  the necessary pre-processing (cleaning).  (4 Points)

. Word cloud successfully generated (4 Points)

. Generates polarity and subjectivity scores using textblob. (6 Points).

. Generates the positive, negative, neutral and com- pound score of the VADER sentiment.  (5 Points)

. Descriptive statistics, frequency table and crosstabs of polarity and subjectivity result. (6 Points)

. At least 3 exploratory chart of the data.  (5 Points).

30%

Predictive Analy- sis

. Good visualizations (at least two) (5 Points)

. Successfully build your predictive model using sklearn. (8 Points)

. Analyze the success of your predictive model using dif- ferent evaluation metrics (7 Points)

20%

Report

. Clarity and coherence of the content.  (10 Points)

. Paragraph flow and wording allows for easy readability and understanding.  (10 Points).

. Detailed and correct explanation of the sentiment and predictive analyses (10 points)

. Minimal grammatical errors. (5 Points)

35%

Extra points

. Ability to include some other concepts that was not done in class (3 Points for each - to a maximum of 15 points). For example, creating a word cloud for either positive or negative sentiment comments.

Note: It is still at that discretion of the GSI to decide if you will get the full point for each concept you introduced.

15% Max