Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CMPUT291 - Fall 2023

Mini Project II

(group project)

Due: Nov 28th at 5pm

Clarifications:

You are responsible for monitoring the course discussion forum in eclass and this section of the project specification for more details or clarifications. No clarification will be posted after 5pm on Nov 26th.

● na

Introduction

The goal of this project is to teach the concept of working with data stored in files and NoSQL databases. This is done by building and operating on a document store, using MongoDB. Your job in this project is to write programs that store data in MongoDB and provide basic functions for searches and updates.

Group work policy

You will be doing this project with two or three other partner from the 291 class (i.e., groups of size 3-4). It is assumed that all group members contribute somewhat equally to the project, hence they would receive the same mark. In case of difficulties within a group and when a partner is not lifting his/her weight, make sure to document all your contributions. If there is a break-up, each group member will get credit only for his/her portion of the work completed (losing the mark for any work either not completed or completed by a partner).

Task

You are given a json file, which you will be loading into MongoDB. Samples of the file are available at google drive (use your ualberta account to access the files). The data is obtained from Kaggle and includes a set of tweets. Each tweet is given in a line  and includes the date and the content of the tweet, the user who posted it and a number of additional fields. Check out the shared json files on google drive for more information about    the file format and the fields of a tweet and a user. Your job is to create a MongoDB collection, following Phase 1, and support searches and updates in Phases 2.

Phase 1: Building a document store

For this part, you will write a program, named load-json with a proper extension (e.g. load-json.py if using Python), which will take a json file in the current directory and constructs a MongoDB collection. Your program will take as input in the command line a json file name and a port number under which the MongoDB server is running, will connect to the server and will create a database named 291db (if it does not exist). Your program    then will create a collection named tweets. If the collection exists, your program should drop it and create a new collection. Your program for this phase ends after building the collection.

Data should be inserted in small batches (say 1k-10k tweets per batch) using insertMany command in   MongoDB. The input file is expected to be too large to fit in memory. You may also use Mongoimport (if available on lab machines).

Phase 2: Operating on the document store

Write a program that supports the following operations on the MongoDB database created in Phase 1. Your program will take as input a port number under which the MongoDB server is running, and will connect to a database named 291db on the server.

Next, users should be able to perform the following tasks.

1. Search for tweets The user should be able to provide one or more keywords, and the system should retrieve all tweets that match all those keywords (AND semantics). A keyword matches if it appears in the content field. For each matching tweet, display the id, date, content, and username of the person who posted it. The user should be able to select a tweet and see all fields.

2. Search for users The user should be able to provide a keyword  and see all users whose displayname or location contain the keyword. For each userl list the usernamel displaynamel and location with no duplicates. The user should be able to select a user and see full information about the user.

3. List top tweets The user should be able to list top n tweets based on any of the fields retweetCountl likeCountl quoteCountl to be selected by the user. The value of n will be also entered by the user. The result will be ordered in a descending order of the selected field. For each matching tweetl display the idl datel contentl and username of the person who posted it. The user should be able to select a tweet and   see all fields.

4. List top users The user should be able to list top n users based on followersCount with n entered by user. For each userl list the usernamel displaynamel and followersCount with no duplicates. The user should be able to select a user and see the full information about the user.

5. Compose a tweet The user should be able to compose a tweet by entering a tweet content. Your system should insert the tweet to the databasel set the date filed to the system date and username to "291user". All other fields will be null.

After each actionl the user should be able to return to the main menu for further operations. There should be also an option to end the program.

Keyword matching. A keyword is an alphanumeric sequence of characters. You can assume multiple keywords in a tweet are separated by spaces or punctuations.  Keyword matches in (1) and (2) are case insensitive matches. Case insensitive indexes in MongoDB can be created by setting the collation option.

Testing

At development timel you will be testing your programs with your own data sets but conforming to the project specification.

At demo timel we will be testing your programs with our test data files with names given in Phase 1. Using your submitted codel we will (1) build a MongoDB database in Phase 1l and (2) perform search and update operations in Phase 2. We typically follow a 5 minutes rule for Phase 1l meaning your database should be built in less than 5min. If notl we may have to use our own databasel in which case you would lose the whole mark for Phase 1.

Every group will book a time slot convenient to all group members to demo their projects. At demo time, all   group members must be present. Our TAs will be asking you for instruction to perform various tasks and to test how your application is handling each task. A mark will be assigned to your demo on the spot after the testing.

Here are some important details about our testing process and your choices (same as in Project 1):

1. The demo will be run using the source code submitted and nothing else. It is essential to include every file that is needed to compile and run your code.

2. We must be able to compile and run your code under our account on undergrad machines and using our own database. You are not allowed to make any changes to the code without a hefty penalty.

3. Our test data and our test cases will be published after the project due date but before our demo times. This meansI you have a chance to test your application and learn about possible issues (if any) before    your demo time.

4. Your code cannot be demoed on a laptop (yours or ours) or any machine other than the lab machine with only one exception. The exception is if you are developing your application using a less traditional programming language or tool that is not available on lab machinesI you MAY be allowed to demo your    application on a laptop. Those cases should be discussed with the instructor well before the project due date and an approval must be obtained. OtherwiseI you cannot demo your project on any machine other than the lab machines.

Instructions for Submissions

Create your Github repo:

1. Read Instructions for creating GitHub repositories

2. Go to https://classroom.github.com/a/WaxloQed to create your repository. Choose a unique name for your team.

3. Update the README.md file on your Github repository with your name and ccid and those of your group members.

4. Share your repository with your group members. This can be accessed from your repo pageI under ⅡSettings Ⅱ then ⅡCollaborators and teams Ⅱ. Each group is expected to have one repository shared among the members.

5. SUBMIT the repository URL through eClass immediately.

6. The deadline of your Github repo page is set to our late submission deadline. Unless you are submitting late, you cannot make changes to your repo after the project official deadline listed on top of this page.

Github submission of code

Your code will be uploaded to GitHub. Your README.md file  lists the names and ccids of all group members    as well as the names of anyone you have collaborated with (as much as it is allowed within the course policy) or a line saying that you did not collaborate with anyone else. This is also the place to acknowledge the use of any source of information besides the course textbook and/or class notes. If you have used any AI tool, more detail  must be provided in a separate file called LLM.md (the file is already on GitHub), including the name, URL, all the input given and all the output received.

Your repository must also include your project report, which  must be type-written, saved as PDF. Your report cannot exceed 3 pages.

The report should include (a) a general overview of your system with a small user guide, (b) details of your algorithms, (c) your testing strategy, and (d) your group work break-down strategy. The general overview of the system gives a high level introduction and may include a diagram showing the flow of data between different components; this can be useful for both users and developers of your application. The details of your algorithms should describe your search and insert algorithms, your approach to scale them to large data collections, the indexes you are creating to speed up your searches, your search algorithms for single and multiple keywords, etc. The testing strategy discusses your general strategy for testing, with the scenarios being tested and the coverage of your test cases. The group work strategy must list the break-down of the work items among partners, both the time spent (an estimate) and the progress made by each partner, and your method of coordination to keep the project on track. The report should also include any assumption you have made or any possible limitations your code may have.

Marking

84% of the project mark would be assigned to your implementation, which would be assessed in a demo session, and is further broken down to two phases with 10% of the mark allocated for Phase 1 and 74% for Phase 2. Another 12% of the mark will be assigned for the documentation and quality of your source code. 4% of the mark is assigned for your project task break-down and your group coordination.