Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Fall 2022 CS157C:NoSQL Database Systems

Programming Assignment 4

In this assignment, we will exercise creating a table, populating the table through data wrangling, and manipulate data using CQL commands.

1. Create the following Cassandra table with a primary key consisting of state, city and zip. The state serves as a partition key and the city and zip together serve as a clustering key.

create table citylist

(

city varchar,

loc List <float>,

pop int,

state varchar,

zip varchar,

primary key (state,city,zip)

) ;

2. Get a JSON data from http://media.mongodb.org/zips.json. You may use the wget command.

3. Populate the table with this data. Notice that the format of the given JSON data does not follow the given schema. Therefore, you need to wrangle the given data to populate the given table. This process may involves converting JSON data to CSV and cleaning and reorganizing them using regular expressions. It is a good practice to work out your logic with a small datasets before moving to large data set. Briefly describe how you wrangle the data.

4. Get the screenshot of the first 15 of the table content.

5. Write a CQL command that gets all rows where city is 'NEW YORK'. The CQL command should not come with the state = 'NY' predicate so that the query itself doesn't work without using one of the following methods.

  Method 1: Writing a query with a predicate city = 'NEW YORK' only with ALLOW FILTERING

  Method 2: Writing a query with a prediate city = 'NEW YORK' only with a secondary index built on the city column

  Method 3: Using a materialized view of which partition key is city (You need to define a materialized view and select * from it.)

1. For each method, show the first 15 results and show the execution time of the query. (It is your job to find the way to measure the execution time of a CQL   query.)

2. Analyze the execution time and describe your rationale behind of your analysis.

6. Write a CQL command that gets all the zip codes of the NEW YORK city with a population greater than 20,000 and less than 30,000. Answer which method(s) of the previous question work(s) for this query? Also, show the first 15 results of the query.

7. Define another schema for citylist2 table in a way that data can be queried in the

ascending order of city and descending order of zip code

8. Populate citylist2 with the given JSON data and get the table content.

9. Get all the zip codes of the state California. Show the first 15 results.

10. Get all the zip codes of the city San Jose of which population is greater than 5,000 and less than 10,000. Show the first 15 results.

Your report should be organized as shown below.

3. Describe your data wrangling method.

4. Screenshot of results 5.

Method 1 : Allow Filtering

A CQL query to get the result

Result

Execution Time

Method 2 : Secondary Index

CQL command to create an index and a CQL query to get the result.

Result

Execution Time

Method 3 : Materialized View

CQL command to create a materialized view and a CQL query to get the result

Result

Execution Time

Your analysis goes here.

6. Which method(s) work for this question? CQL query to get the result

Result

7. Schema declaration

8. Show the table content.

9. CQL query to get the result Result

10. CQL query to get the result Result

Write your report in yourlast4_digits_SID.pdf and zip it into hw4.zip. Submit hw4.zip through the course web site.