DAT 560M – Big Data and Cloud Computing 2023 – Lab #4
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
DAT 560M: Big Data and Cloud Computing
Fall 2023, Mini B
Lab #4
INSTRUCTIONS
1. This is a group assignment, to be worked during the lab.
2. ONLY utilize the codes we practice.
3. Please submit the answers on Canvas.
4. Only one per groups is sufficient.
ASSIGNMENT
In this assignment, we are going to work on a dataset called auction.csv and the file is located in
dataset folder on the server. The datasets contain eBay auction information on Cartier
wristwatches, Palm Pilot M515 PDAs, Xbox game consoles, and Swarovski beads. It has the following columns:
auctionid: unique identifier of an auction
bid: the proxy bid placedby a bidder
bidtime: the time in days that the bid was placed, from the start of the auction
bidder: eBay username of the bidder
bidderrate: eBay feedback rating of the bidder
openbid: the opening bid set by the seller
price: the closing price that the item sold for
item: auction item
auction_type: The type of auction (3 days auction, 7 days auction …)
Part 1- Initialization (10pts)
1- Start the PySpark engine and load the file into it. (5 pts)
2- Get to know the dataset and do a preliminary examination (for example type of columns, summary, …) (5 pts)
Part 2- Feature Engineering (20pts)
3- Create a new feature to see how many bids have been given for each item.
Part 3- Linear Regression (70pts)
4- Make the data ready for a linear regression. We are interested to find the final price of item based on “openbid”, “”auction_type”, and the new added feature in part 2. (40 pts)
5- Run the linear regression on the data by splitting data to 70% training, and 30% testing. (20 pts)
6- Report MSE and R2 of the model. (10 pts)
2023-12-10