Discussion: April 18th
Deadline: April 18th, 9am
If you haven’t done so, please see the general assignment notes posted with Assignment 0.
Consider a dangerous and/or common illness that people are being tested for to recognize it early (e.g. cancer) and/or prevent its spread (e.g. COVID). The test is either positive or negative. We make the following assumptions:
You take part in a study where a random, representative sample of the population is tested for the illness. Your test result is positive. What is the probability that you have the illness?
Next (mathematical solution is sufficient, no need for more simulation):
As we can see, the second test is much more prone to errors than the first.
However, assume that the results of the second test are independent of the
first. That is, whether the second test makes an error does not depend
on whether there is an error on the first test and vice versa.
Now, both of your tests come back positive. Given this information,
what is the probability that you are indeed sick?
The purpose of this part is for you to walk through a basic probabilistic modeling task yourself.
Say you are at the doctor or maybe a government office and are waiting for your appointment. You would like to get an estimate of how long you will have to wait. The best way seems to be to base this on other people’s waiting times.
First, we need to decide how to model the data. This essentially boils down to finding a sensible probability distribution. Consider questions such as
An internet search should help you here; it’s a good idea to choose a well-known “standard” distribution, as these are often relatively easy to work with.
Derive the maximum likelihood solution for the parameter(s) of your model/distribution given the data. To do this, compute the log-likelihood of the dataset to receive a term that can be maximized. Next, you can use basic calculus to derive a solution for the optimal parameters analytically.
To try out our model, we need some data. You have two options:
If you decide to generate data (option 2 above), choose some parameters for your
distribution and draw random samples – if your distribution is not too exotic, there
should be functions e.g. in numpy or scipy.
Finally, compute your model solution for the data. You can use your analytical solution and/or use gradient ascent to iteratively arrive at a solution. Does the solution make sense to you? In particular, you might compare measures such as the expected value of the distribution with quantities such as the mean or median of the dataset. In case you generated data from a distribution, you can also check if your derived result matches the actual distribution parameters.
Submission: You should include a write-up of your modeling decisions as well
as the derivation of the maximum likelihood solution. Also include any experiments
you conduct with the toy data you created (e.g. finding the optimal parameter values).
You can do the mathematical derivations (also for part 1) on paper
and upload a scan/photograph, typeset using markdown or latex, or just sketch
them in a Python notebook.