Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

SIT743 Bayesian Learning and Graphical Models

Assignment-2

2022

Q1) [14 Marks]

Consider the hourly pedestrian count data collected at the Melbourne Central station in Melbourne over the one-month period in March 2022. This dataset is given as a    CSV file, named MelbCentPedCntMarch2022.csv” .

1. 1)     Plot the histogram for the count data. Comment on the shape. How many

modes can be observed in the data?

1.2)     Fit a single Gaussian model     (, ) to the distribution of the data, where

is the mean and is the standard deviation of the Gaussian distribution.

Find the maximum likelihood estimate (MLE) of the parameters, i.e., the mean and the standard deviation .

Plot the obtained (single Gaussian) density distribution along with the histogram on the same graph.

1.3)     Fit a mixture of Gaussians model to the distribution of the data using number

of Gaussians equal to 4 (four). Use R programming to perform this.

Provide the mixing coefficients, mean and standard deviation for each of the Gaussians found.

Plot all these Gaussians on top of the histogram plot.

Include a plot of the combined density distribution as well (use different colors for the density plots in the same graph).

1.4)     Provide a plot of the log likelihood values obtained over the iterations and

comment on them.

1.5)     Comment on the distribution models obtained in Q1.2 and Q1.3. Which one is

better?

[Marks 2+3+6+2+1 = 14]

Q2) [43 Marks]

A study is performed to monitor the water quality of a river site in the Goulburn River catchment area in Australia. A list of factors that influence the water quality, along with their possible values, and a Bayesian network that represents the relationship between these factors (variables) are given below.

W (Water Quality) ∈ {Low, Medium, High}

O (Dissolved Oxygen) ∈ {Extremely-Low, Low, Medium, High, Extremely-High} P (PH) ∈ {Acidic, Neutral, Basic}

S (Season) {Winter, Spring, Summer, Autumn}

T (Turbidity) ∈ {Low, Medium, High}

A (Anthropogenic Inputs) ∈ { Low, High}

F (Water flow rate) ∈ {Slow, Medium, Fast}

Figure 1

2. 1)     Write down the joint distribution    (S, F, O, A, P, T, W) for the above network.

2.2)     Find  the  minimum  number  of  parameters  required  to  fully  specify  the

distribution according to the above network.

2.3)

a)   Write   down   a  joint   probability   density   function   if   there   are no independence among the variables is assumed.

b)  How  many  parameters  are  required,  at  a  minimum,  if  there  are no independencies among the variables is assumed?

c)   Compare with the result of the above question (Q2.2) and comment.

2.4)       The council, belonging to the site in the Goulburn River catchment area, from a

previous  study,  found  out  that  the PH is conditionally independent of dissolved oxygen, given the water flow rate. The council wants to modify the Bayesian network given in Figure  1 by incorporating this new information. Assume now that the PH is conditionally independent of dissolved oxygen, given the water flow rate, perform the following.

a)   What change will happen to the Bayesian network (shown in Figure 1) when the above assumption is considered. Draw the new Bayesian network considering the  above  assumption  (you may  draw this by hand).

b)  Compute the change in the minimum number of parameters required for this new Bayesian network, compared to the minimum number of parameters  required  for  the  Bayesian  network  shown  in  Figure  1. Comment on the results.

2.5) d-separation method  can  be  used  to  find  two  sets  of  independent  or

conditionally independent variables in a Bayesian network. Use the Bayesian network given in Figure 1 to answer the following:

For each of the statements/questions given below from (a) to (b), perform the following:

•   List all the possible paths from the first (set of) node/s to the second (set of) node/s considered for the independence check.

•   State if each of those paths is blocking or non-blocking with reasons.

•   Hence, answer the question about independence.

a)   Is water quality (W) conditionally independent of Season (S) given water flow rate (F) and dissolved oxygen (O)?

b)  Is | {, }  ?

2.6) For the  Bayesian  network  shown  in Figure 1,  find  all  the  nodes that  are

conditionally  independent  of Anthropogenic Inputs (A)  given PH (P), dissolved Oxygen (O) and water quality (W).

2.7)     Write a R-Program to produce the Bayesian network shown in Figure 1, and

perform the d-separation tests for cases given below. Show the plot of the network you  obtained  and  the output (of d-separation test) from  your program.

a)   S ⊥ {T, F} | {O, P}

b)  {S, A} ⊥  P | {O, F}

2.8)     For the Bayesian network shown in Figure 1,

a)   find the Markov blanket of PH (P).

b)  find all the nodes that are conditionally independent of PH (P) given its Markov blanket.

c)   use  R  program  to  find  the  Markov blanket  of dissolved Oxygen (O). Plot the Bayesian network  and  show the Markov blanket nodes in the network using different colour.

2.9)   For the Bayesian network shown in Figure 1,

a)   show the step by step process to perform variable elimination to

compute ( | = !,  " = #$%&'(,  ) = *&+,).            Use the following variable ordering for the elimination process: S, A, W.

b)  what is the treewidth of the network, given the above elimination ordering?

[Marks 2+4+6+5+10+2+4+5+5 = 43]

Q3)  [27 Marks]

A belief network shown below (Figure 2) describes the relation between four variables A, B, C, and D, along with their conditional probability tables (CPT). Each variable takes different states as given below.

A {0, 1}

B ∈ {0, 1}

C ∈ {0, 1, 2}

D ∈ {0, 1, 2}

Figure 2

3. 1) Obtain an expression (in a simplified form) for

(- . | / ., 0 1, 2 3 Show the steps clearly).

3.2) The table shown below provides 30 simulated data obtained for the above Bayesian network. Use this data to find the maximum likelihood estimates of  4, 5 , 6 and .

3.3)     Find the value of    (- . | / = ., 0 = 1, 2 = 3)   using the appropriate values

obtained from the above question Q3.2.

3.4)      Now consider the following values for 4, 5, 6 789 ,  to answer the following

question:  : = .. 1 ,   < = .. =, > = .. ?,  @ = .. A.

Use the below libraries in R to create the above belief network along with the probability values using R-programming, as shown in the CPT table of Figure 2.

You may use the following libraries for this:

#https://www.bioconductor.org/install/

#BiocManager::install(c("gRain",  "RBGL",  "gRbase"))

#BiocManager::install(c("Rgraphviz"))

library("Rgraphviz")

library(RBGL)

library(gRbase)

library(gRain)

#define       the       appropriate       network       and       use       the

compileCPT()”function    to    Compile    list    of    conditional

probability  tables,  and  create  the  network .

a)   Write R code and show the obtained belief network for this distribution.

b)  Show the probability tables obtained from the R output, (and verify with the above table).

c)   Use R program to compute the following probabilities:

i)      Find the joint distribution of A, C and D.

ii)      Find the marginal distribution of C.

iii)      Find P(B=1 | C=2, D=2).

[Marks  9 + 4 + 2 + (5 +4 + 3) = 27]

Q4)  Bayesian Structure Learning [27 Marks]

For this question, you will be using a dataset, called insurance available from the ‘bnlearn’ R package. which contains 27 variables.

Use the following R code to load the insurance dataset:

library (bnlearn)

# load the data.

data(insurance)

The true network structure of this dataset can be viewed (plot) using the following R code.

library(bnlearn)

#create and plot the network structure .

modelstring =

paste0("[Age][Mileage][SocioEcon |Age][GoodStudent |Age:SocioEcon]",

"[RiskAversion |Age:SocioEcon][OtherCar |SocioEcon][VehicleYear |SocioEcon:RiskAve

rsion]",

"[MakeModel |SocioEcon:RiskAversion][SeniorTrain |Age :RiskAversion]",

"[HomeBase |SocioEcon:RiskAversion][AntiTheft |SocioEcon:RiskAversion]",

"[RuggedAuto |VehicleYear:MakeModel][Antilock |VehicleYear:MakeModel]",

"[DrivingSkill |Age:SeniorTrain][CarValue |VehicleYear:MakeModel:Mileage]",

"[Airbag |VehicleYear:MakeModel][DrivQuality |RiskAversion:DrivingSkill]",

"[Theft |CarValue:HomeBase:AntiTheft][Cushioning |RuggedAuto:Airbag]",

"[DrivHist |RiskAversion:DrivingSkill][Accident |DrivQuality:Mileage:Antilock]",

"[ThisCarDam|RuggedAuto:Accident][OtherCarCost |RuggedAuto:Accident]",

"[MedCost |Age:Accident:Cushioning][ILiCost |Accident]",

"[ThisCarCost |ThisCarDam:Theft:CarValue][PropCost |ThisCarCost:OtherCarCost]")

dag = model2network(modelstring)

graphviz .plot(dag)

Use R programming, as appropriate, to answers the following questions.

4. 1)

Use the insurance dataset to learn Bayesian network structures using hill-climbing (hc) algorithm, utilizing two different scoring methods, namely Bayesian Information Criterion score (BIC score) and the Bayesian Dirichlet equivalent score (Bde score), for each of the following sample sizes of the data:

a) 200 (first 200 data)

b) 2000 (first 2000 data)

c) 10000 (first 10,000 data)

For each of the above cases,

•   provide the scores obtained for BIC and BDe,

•   Plot the network structure obtained for the BIC and BDe scores.

4.2)     Based on the results obtained for the above question (Q 4. 1),  discuss how the

BIC score compare with BDe score for different sample sizes in terms of structure and score of the learned network.

4.3)

a) Find the Bayesian network structures utilising the full dataset, and using both BIC and Bde scores. Show the scores and the obtained networks.

b) Compare the networks obtained above (in Q4.3.a) for each BIC and Bde   scoring methods with the true network structure and comment.  Use the  “compare()”  function and graphviz.compare()” function  available in the “bnlearn” R package to perform these comparisons and comment.

c) Fit the data to the network obtained using the BIC score in the above question (Q4.3.a)  in order to compute the conditional probability      distribution table entries (CPD table values).

Show the obtained CPD table entries for the variable SocioEcon” .

d) Use the above learned network obtained (in Q4.3.c) to find the probability of :

P(Accident="Severe" | DrivQuality ="Normal", HomeBase ="Suburb") . [Marks (3*4) + 3 + (4+3+3+2) = 27]

Q5) Real world application of Bayesian networks [9 Marks].

Download the following article from the link provided below. Read that article and answer the following questions. This article provides a real life case  study on creating and using a Bayesian network for road accident data analysis.

Ali Karimnezhad & Fahimeh Moradi (2017), Road accident data analysis using Bayesian networks, Transportation Letters, 9:1, 12- 19,

DOI: 10.1080/19427867.2015.1131960

Web: https://www.tandfonline.com/doi/full/10.1080/19427867.2015.1131960

Note that you will be able to download this paper via Deakin library using your Deakin credentials (username and password).

(https://www.deakin.edu.au/library/help/add-browser-bookmarklet)

a)   What are the variables used in this anlysis?

What is the name of the algorithm used for learning the Bayesian network structure?

b)  In the learnt Bayesian network provided in Figure 4 (in the paper), is Injury     type independent of Sex given the knowledge about Seat belt, licence type and vehicle type? Explain.

c)   In Figure 5 (in the paper), explain what the probabilities shown for Injury Type means.

d)   Read the section titled Parameter learning in the road accident network” in that paper and find the following probabilities:

I.   The probability of being injured while wearing seat belt and driving a car, knowing that the driver has a diploma degree and a type 2 driving license.

II.   The probability of not-dead while wearing  seat belt  and  driving  a car, knowing that the driver has a diploma degree and a type 2 driving license

[Marks  2+2+1+(2+2) = 9]