ECS7005P Risk and Decision-Making for Data Science and AI 2021
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Summer Examination Period 2021 — May — Semester B
ECS7005P Risk and Decision-Making for Data Science and AI
Question 1
A new virus is affecting the population. People who have the virus will normally have specific symptoms such as a cough and the loss of the sense of taste and/or smell.
It is estimated that 1 in 5 of people who suffer these symptoms have the virus and 1 in 2000 people without these symptoms have the virus.
A test for the virus has the following accuracy
· For people with symptoms, the true positive rate is 90% and the false positive rate is 5%
· For people without symptoms, the true positive rate is 80% and the false positive rate is 1%
Answer the following questions:
a) If we know that 5% of the population have symptoms, what percentage of the population has the virus? [2 marks]
b) What is the probability that a person with symptoms will test positive? [2 marks]
c) What is the probability that a person without symptoms will test positive? [2 marks]
d) A person with symptoms tests positive. What is the probability they have the virus? [2 marks]
e) A person with symptoms tests negative. What is the probability they have the virus? [2 marks]
f) A person without symptoms tests positive. What is the probability they have the virus? [2 marks]
g) A person without symptoms tests positive and is subject to an additional test. Assuming that a second test is independent of the first, what is the probability they test positive in this second test? [4 marks]
h) A person without symptoms tests positive in both the first and second test. What is the probability they have the virus? [4 marks]
[Question 1 Total: 20 marks]
Question 2
Table 1 summarizes the results from an observational study into the effectiveness of two drugs A and B for treating migraine
|
Patients aged < 50 |
Patients aged 50+ |
||
|
Effective |
Non-effective |
Effective |
Non-effective |
Drug A |
420 |
80 |
70 |
30 |
Drug B |
85 |
15 |
150 |
50 |
The ‘success rate’ is the percentage of effective outcomes.
Answer the following questions:
a) What was the ‘success rate’ for Drug A for the study participants overall? [1 mark]
b) What was the ‘success rate’ for Drug B for the study participants overall? [1 mark]
c) What was the ‘success rate’ for Drug A for the study participants aged < 50? [1 mark]
d) What was the ‘success rate’ for Drug B for the study participants aged < 50? [1 mark]
e) What was the ‘success rate’ for Drug A for the study participants aged 50+? [1 mark]
f) What was the ‘success rate’ for Drug B for the study participants aged 50+? [1 mark]
g) What can you conclude from the above results? [2 marks]
h) Name the paradox evident in this study. [1 mark]
i) What is the main cause of the paradox in this example? [3 marks]
j) Draw the causal model that explains the data and write down the probability tables for each node in that model. [6 marks]
k) How would you amend the model to one that avoids the paradox? [2 marks]
l) By doing what you proposed in k) (or by other means) estimate the ‘true’ success rate for each drug for the whole population. [4 marks]
m) Suppose you know that a patient took Drug A and the outcome was not effective. We don’t know the patient’s age, but we want to answer the counterfactual question; “Would the outcome have been effective if this patient had taken Drug B instead of Drug A?”. In your answer to this question provide a sketch of a causal model that supports your reasoning. [6 marks]
[Question 2 Total: 30 marks]
It is known that about 2.3% of people who have sleeping disorders have severe insomnia (defined as going more than 36 hours without being able to sleep at all)
A study of 1000 people who have sleeping disorders discovered that tea-drinkers (classified as those who drink more than 2 cups of tea a day) are more likely to suffer severe insomnia.
|
Tea-drinkers |
Not tea-drinkers |
Severe insomnia |
9 |
14 |
Other sleeping disorders |
291 |
686 |
Total |
300 |
700 |
a) Answer the following about people with sleeping disorders:
i) What is the relative increase in risk of having severe insomnia for tea drinkers compared to non-tea drinkers? [3 marks]
ii) What is the absolute increase in risk of having severe insomnia for tea drinkers compared to those who are not tea-drinkers? [3 marks]
b) Suppose we know that 10% of the population have sleep disorders. Of those with sleeping disorders, 30% are tea—drinkers. Of those with no sleeping disorders only 20% are tea drinkers. Answer the following questions about the whole population:
i) What is the relative increase in risk of having severe insomnia for tea-drinkers compared to those who are not tea-drinkers? [5 marks]
ii) What is the absolute increase in risk of having severe insomnia for tea drinkers compared to those who are not tea-drinkers? [5 marks]
Hint: you should assume a population size of 100,000 and create two tables like above for people with and without sleep disorders.
c) What paradox could be triggered if you used the above 1000-person study to make inferences about the risk of severe insomnia caused tea-drinking to the entire population? [2 marks]
d) Which of the following headlines is the most misleading? [2 marks]
i) “Study shows people with sleeping disorders should consider cutting down on the amount of tea they drink”.
ii) “Drinking more than 2 cups of tea a day more than doubles the risk of having the most severe form of sleep disorder”.
iii) “People with sleeping disorders who drink more than 2 cups of tea a day are at increased risk of the most severe sleep deprivation”.
iv) “Drinking more than 2 cups of tea a day may lead to severe sleep deprivation”.
[Question 3 Total: 20 marks]
Question 4
The following algorithm is ‘learnt’ from a subset of the dataset of passengers on the Titanic cruise liner which sank after hitting an iceberg on 15 April 1912:
If Sex = “Male” then Probability (survive) = 0.2
If Sex = “Female” and Class = 1 or 2 then Probability (survive) = 0.8
If Sex = “Female” and Class = 3 then Probability (survive) = 0.6
The relevant information in the different test dataset is summarized as:
|
Male |
Female Class 1 or 2 |
Female Class 3 |
Survived |
75 |
75 |
60 |
Did not survive |
225 |
15 |
50 |
Based on this test set data, the accuracy of the algorithm for cut-off value 0.1 can be represented in the following format, where “YES” means survive and “NO” means not survive.
|
Number predicted YES |
Number predicted NO |
Total |
Number YES’s |
210 |
0 |
210 |
Number NO’s |
290 |
0 |
290 |
This enables us to compute:
Sensitivity:100%; Specificity: 0%; False positive rate: 100%; Accuracy:42%
a) For each of the different cut-off values 0.5, 0.7, 0.9 complete the following table and fill in all the missing ?? values
|
Number predicted YES |
Number predicted NO |
Total |
Number YES’s |
?? |
?? |
210 |
Number NO’s |
?? |
?? |
290 |
Sensitivity: ??%; Specificity: ??%; False positive rate: ??%; Accuracy: ??%
You will need to complete three tables and in each case the sensitivity specificity, false positive and accuracy percentages (8 marks each). [24 marks]
b) Sketch the ROC curve for this algorithm. [6 marks]
[Question 4 Total: 30 marks]
Solutions
Question 1
a) If we know that 5% of the population have symptoms, what percentage of the population has the virus? (0.05 x 0.2)+(0.95 x 0.0005) = 0.010475 = 1.0475% [2 marks]
b) What is the probability a person with symptoms will test positive? 22% [2 marks]
c) What is the probability a person without symptoms will test positive? 1.04% [2 marks]
d) A person with symptoms tests positive. What is the probability they have the virus? 81.8% [2 marks]
e) A person with symptoms tests negative. What is the probability they have the virus? 2.6% [2 marks]
f) A person without symptoms tests positive. What is the probability they have the virus? 3.8% [2 marks]
g) A person without symptoms tests positive. Assuming that a second test is independent of the first, what is the probability they test positive in a second test? 4.04% [4 marks]
h) A person without symptoms tests positive in both the first and second test. What is the probability they have the virus? 76.2% [4 marks]
Question 2
a) Drug A overall? 81.7% [1 mark]
b) Drug B overall? 78.3% [1 mark]
c) Drug A for the study participants aged < 50? 84% [1 mark]
d) Drug B for the study participants aged < 50? 85% [1 mark]
e) Drug A for the study participants aged 50+? 70% [1 mark]
f) Drug B for the study participants aged 50+? 75% [1 mark]
g) in each age subcategory Drug B was more effective than drug A, but overall Drug A was more effective [2 marks]
h) Simpson’s paradox [1 mark]
i) Age is a confounder. There were fewer older people in the study and older people were more likely to take Drug B than Drug A [3 marks]
j) The model [6 marks]
k) Cut the link into node “Drug” [2 marks]
l) A: 79.3% B: 81.7% [4 marks]
m) [6 marks]
Question 3 (TOTAL 20 marks)
a) People in the study
i) Tea drinkers 3% non-tea drinkers 2%, so 50% relative risk increase. [3 marks]
ii) Absolute risk increase is 1% [3 marks]
b) Whole population
|
Sleep disorders (10,000) |
No Sleep disorders (90,000) |
||
|
Tea drinkers (3,000) |
Non-tea drinkers (7,000) |
Tea drinkers (18,000) |
Non-tea drinkers (72,000) |
Most Severe |
90 |
140 |
0 |
0 |
Not most severe |
2100 |
6,860 |
18,000 |
72,000 |
|
Sleep disorders (100,000) |
|
|
Tea drinkers (21,000) |
Non-tea drinkers (79,000) |
Most Severe |
90 |
140 |
Not most severe |
20910 |
78,860 |
i) 90 out of 21,000 tea drinkers (=0.4286%) have the most severe form of sleep deprivation; 140 out of 79,000 non-tea drinkers (=0.1772%) have the most severe form of sleep deprivation So relative risk increase is (0.4286-0.1772)/0.1772= 142% [5 marks]
ii) But absolute risk increase is just 0.25% [5 marks]
c) Berkson’s or Collider paradox [2 marks]
d) (ii) is the most misleading? [2 marks]
Question 4
a) The accuracy for cut-off value 0.5 is:
|
Number predicted YES |
Number predicted NO |
Total |
Number YES’s |
135 |
75 |
210 |
Number NO’s |
65 |
225 |
290 |
Sensitivity: 64%
Specificity: 78%
False positive rate: 22%
Accuracy:72%
b) The accuracy for cut-off value 0.7 is:
|
Number predicted YES |
Number predicted NO |
Total |
Number YES’s |
75 |
135 |
210 |
Number NO’s |
15 |
275 |
290 |
Sensitivity: 36%
Specificity: 95%
False positive rate: 5%
Accuracy:70%
c) The accuracy for cut-off value 0.9 is:
|
Number predicted YES |
Number predicted NO |
Total |
Number YES’s |
0 |
210 |
210 |
Number NO’s |
0 |
290 |
290 |
Sensitivity: 0%
Specificity: 100%
False positive rate: 0%
Accuracy: 58%
d) ROC curve for this algorithm
2022-07-21