Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP5338:  Advanced Data Models

Sem.  2/2022

MongoDB Project

1   Introduction

In this assignment, you will demonstrate that

• You are able to implement query workload using MongoDB’s aggregate framework.

• You are able to use the performance monitoring tools provided by MongoDB to in-

spect query execution plans and statistics.

• You understand how index can be used in finding and sorting query results

• You are able to write an alternative implementation to improve the query perfor-

mance

The data set you will work with consists of climate data objects, city objects, and user objects.  In the first part of the assignment, you are asked to implement a set of target workloads to explore different features of the data. In the second part of the assignment, you are asked to investigate and compare the performance of a few find queries. You also need to provide an alternative implementation of an aggregation query.

2   Data set

The data set consists of three JSON files. The climate.json file is the same JSON file that you have encountered in Assignment 1. The coords.json file contains a few JSON objects where each object stores latitude, longitude, and elevation data for a particular city. This information is extracted from relevant Wikipedia articles for these cities. An example of such an object is as follows:


{

"_id" :  1 ,

"city_id" :  2 ,

"city" :  "Athens" ,

"latitude" :  37 .98 ,

"longitude" :  23 .73 ,

"elevation" :  153

} 


Each JSON object in coords.json contains the following fields:

. _id: A unique integer identifier;

.  city_id: The unique integer identifier for the object where climate data for the city can be found;

.  city: The name of the city;

. latitude: The city’s latitude in degrees;

. longitude: The citys longitude in degrees; and

. elevation: The city’s elevation in metres.

For the latitude fields, positive values indicate locations in the northern hemisphere while negative values indicate locations in the southern hemisphere.  A latitude of zero corresponds to the equator.

For the longitude fields, positive values indicate locations in the eastern hemisphere while negative values indicate locations in the western hemisphere. A longitude of zero corresponds to the Greenwich meridian (the prime meridian); a longitude of 180 or -180 corresponds to the antimeridian (we will assume that the Earth is a perfect sphere for simplicity).

For example, the above coords JSON object gives the following location information on Athens:

. In the climates collection, the JSON object whose _id is 3 contains climatic informa- tion on this city;

. Athens’s latitude is 37.98 °N;

. Athens’s longitude is 23.73 °E; and

. Athens’s elevation is 153 metres.

The users .json file contains a few JSON objects where each object stores the last name, given name, and the current city of residence for a particular user. This information is entirely fictional. An example of such an object is as follows:


{

"_id" :  257 ,

"lastName" :  "Smith" ,

"givenName" :  "Tanya" ,

"city_id" :  90

} 


Each JSON object in users.json contains the following fields:

• _id: A unique integer identifier;

• lastName: The user’s last name;

• givenName: The user’s given name; and

•  city_id: The city_id identifier for the object in the coords collection representing the city where the user currently lives.

For example, the above users JSON object gives the following information:

• The user’s last name is "Smith";

• The user’s given name is "Tanya"; and

• In the coords collection, the JSON object whose city_id is 90 contains latitude, longitude, and elevation information about the city where the user is currently living.

3   Aggregation workload Task

The first part of the assignment requires you to implement five aggregation workloads. This section includes the workload description, preparation and submission guide.

3.1   Workload Description

•  [W1] For each region that appears in the climates collection, find the number of cities in this region that are in this collection. For each region in the climates collection, your result must be a document in the format

{

region:  region_name,

numCities:  city_count

}

where region_name is the name of a region and city_count is the number of cities in this region that are in the climates collection. Your results must be sorted by region in ascending order.

•  [W2] Find the number of cities whose average monthly rainfall for January is greater than their average monthly rainfall for July.  Your result must consist of one docu- ment in the format:  {count:    number_of_cities} where number_of_cities is the number of cities satisfying the rainfall condition.

•  [W3] Find the number of people who live in cities in the northern hemisphere (it does not matter whether your aggregation includes the equator or not). Your result must consist of one document in the format: {count:    number_of_people}, where number_of_people is the number of people satisfying the location condition. In your aggregation workload, cities that do not have a coords entry are not considered to be in the northern hemisphere.

•  [W4] Tropical climates are defined by a monthly average temperature of 18 °C or higher in the coolest month.   The average monthly temperature for a particular month is defined as the arithmetic mean of the average minimum and average max- imum temperatures for that month. The coolest month is defined as the month with the lowest monthly average temperature.  Find out the cities in the data set that have a tropical climate. For each city with a tropical climate, your result must be a document in the format

{

city:  city_name,

coolestMthAvgTemp:  coolest_month_average_temperature }

where city_name is the name of the city and coolest_month_average_temperature is the average monthly temperature of the coolest month in that city.  Your results  must be sorted by the city name in ascending order.

•  [W5] For tropical climates, a dry season month is defined as a month when average precipitation is below 60 millimetres. Find out the number of dry months for each tropical climate city in the data set. For each tropical climate city, your result must be a document in the format

{

city:  city_name,

numDryMth:  number_of_dry_months

}

where city_name is the name of the city and number_of_dry_months is the number of dry months in that city. Your results must be sorted by the number of dry months and the city name in descending order.

3.2   Preparatory Work

Before you implement the workloads given in the previous section, you need to import the  data in climate .json, coords .json and users .json into collections called climates,  coords and users respectively. Note the collection name for climate data is called climates in this assignment to make it consistent with the name of the other collections. These col-  lections must be in a database called city_data.

3.3   Implementation Requirements

Implement each aggregation using a single MongoDB aggregate command. You must not use other commands such as find.  Each workload implementation must be placed in a separate JavaScript file using the following filenames:

• w1.js - W1 implementation;

• w2.js - W2 implementation;

• w3.js - W3 implementation;

• w4.js - W4 implementation; and

• w5.js - W5 implementation

No other files are permitted (other than a2_driver .js). You must not modify a2_driver .js.

Your JavaScript files must be able to execute correctly when they are called from the supplied a2_driver .js file. To test your solutions, you should run this driver file. When you run this file, it will print out the results of each query.

For each workload, you must store your result in a variable called res. You need to use var  res  =  db .collection .aggregate() to ensure the variable res is updated properly in the driver script.  Dummy w1 .js - w5 .js scripts are provided to show the required format. All scripts print out the number of documents in different collections.

Do not include configuration statements in your JavaScript files; include only the ag- gregation statements in your JavaScript scripts.  The driver file contains all the required configuration statements.

4   Performance Observation Task

The second part of this assignment involves performance observation and investigation. In particular, you are asked to investigate the usage of various indexes in find queries. You are also asked to provide an alternative implementation of a given aggregation and compare the performance of the different implementations. You need to document your observation and analysis in a report. The report should have the following sections:

• Introduction

• Single field index and compound index usage

• Multikey Index Usage

• Aggregation Performance

• Conclusion

The introduction and conclusion sections should be very brief.  They do not carry any marking point by themselves. They contribute to the professionalism of the entire report. The details of the three middle sections are given as follows.

4.1   Single field index and compound index usage

Run the following aggregation to create a new collection called climates2. This collection adds two new fields to store yearly average dry days and snow days.

 

db .climates .aggregate(

[

{  $addFields:{

yearlyAvgDryDays:    {$sum :  "$monthlyAvg .dryDays"},

yearlyAvgSnowDays:  {$sum :  "$monthlyAvg .snowDays"}

}

},

{$out :  "climates2"}

])

 

Create a single field index on yearlyAvgDryDays, a single field index on yearlyAvgSnowDays,

and a compound index on {yearlyAvgDryDays:1,  yearlyAvgSnowDays:1} on the col- lection climates2.

Compare the execution of query 1 and query 2.

 

db .climates2 .find(

{yearlyAvgDryDays:{$gt:150}

}) .sort({yearlyAvgDryDays : - 1})

 

Query 1: Find and Sort by One Field

 

db .climates2 .find(

{$or:[    {yearlyAvgSnowDays:{$gt:100}},

{yearlyAvgDryDays:{$gt:150}}

]

}) .sort({yearlyAvgDryDays : - 1,  yearlyAvgSnowDays : - 1})

 

Query 2: Find and Sort by Two Fields

For each query, describe all plans evaluated.  If multiple plans are evaluated, give the reason a particular plan is chosen as the winning plan.  If not all relevant indexes are evaluated, explain why some indexes are not evaluated as part of a candidate plan.  If an index is used in the chosen plan, describe also if the sort order can be obtained from the index.  You should include output from MongoDB’s performance monitoring tool to support your description. However, outputs should only be used as evidence; they are not a substitute for textual description.

Give a summary of the difference observed in the two executions. In particular, explain why certain indexes are evaluated in one query but not in the other.

4.2   Mulitkey index usage

Create a multikey index on field monthlyAvg .dryDays in the collection climates. Com- pare the execution of query 3 and query 4.

 

db .climates .find({

region : 'United  States ' ,

"monthlyAvg":  {$elemMatch:{dryDays:  {$gt:  20,  $lt:30}  }}})

 

Query 3: Query Individual Elements in an Array Field

 

db .climates .find({

region : 'United  States ' ,

"monthlyAvg .dryDays" :  {$gt :  20,  $lt:30}})

 

Query 4: Query an Array Field

For each query, describe all plans evaluated. You should read the MongoDB documentation on multikey index usage to get a general understanding on how multkey indexes can be used in different query conditions. Identify the winning plan and if there is any, the rejected plan. If multiple plans are evaluated, give the reason a particular plan is chosen.

4.3   Aggregation performance

MongoDB provides a rich set of aggregation stages and operators. A workload could be implemented by different aggregation pipelines with very different execution plans and performances. Query 5 is an initial attempt to find from the climates collection any city where the wettest month also has the lowest minimum temperature. Investigate the per- formance of this query, paying special attention to the number of stages, memory usage and documents processed at each stage. You are asked to provide an alternative implementa- tion that would produce similar results (the outputs should contain the same information, but could be structured differently). The alternative implementation should use a single aggregation command to find the results.  Ideally, your implementation should be more efficient than the given one.  If your implementation is able to utilize indexes, you can create them as well.

Include the actual aggregation command of your alternative implementation in the re- port. Describe and compare the execution statistics of the two implementations. Highlight the parts or main reason that one implementation is more efficient than the other.

5   Deliverables and Submission Rules

There are two deliverables for this assignment. They should be submitted using different submission inboxes.

The deliverable for the first part is a single zip file that contains the JavaScript files of your solutions as described in section 3. Do not include a2_driver .js or the code for importing JSON data source in your submission.  Submit your file to the appropriate Assignment submission inbox in Canvas.

You must follow the implementation requirement and the submission guide. Your sub- mission will be marked using a marking script.  Deviation from the implementation re- quirement or submission guide may result in zero points for certain queries. Testing your scripts with the provided a2_driver .js script before submission can minimize the chance of making accidental errors.

The deliverable for the second part is a single pdf file of your report. You should submit the pdf file to the appropriate submission inbox.

The submission deadline is week 7  15 September 2022, 23:59 Sydney time. Late penal- ties apply for late submissions.

 

db .climates .aggregate([

{  $unwind:    {  path :  "$monthlyAvg" ,  includeArrayIndex :     "month"  }  },

{  $group :

{  _id :  "$city" ,

monthlyAvg:  {

$push:  {

low : "$monthlyAvg .low" ,

rainfall :  "$monthlyAvg . rainfall" ,

month :  "$month"

}

}

}

},

{  $project:{

lowestMinTmpMonth:  {

$first:{

$sortArray:{

input :  "$monthlyAvg" ,

sortBy:  {low:1,  rainfall : - 1}

}

}

},

wettestMonth:  {

$first:{

$sortArray:{

input :  "$monthlyAvg" ,

sortBy:  {rainfall : - 1,  low:1}

}

}

}

}

},

{  $match:{

$expr:  {$eq:   ["$lowestMinTmpMonth .month" ,  "$wettestMonth .month"]} }

},

{  $sort:  {_id:  1}}

])

 

Query 5: Finding the wettest month that also has the lowest minimum temperature