Assignment 1: Probability Review

Discussion: April 22nd
Deadline: April 21st, 23:59

NOTE the changed deadline!! To give Jens a chance to look at the submissions before the exercise, the deadline will now be the evening before the exercise!

Testing Illnesses

Consider a dangerous and/or common illness that people are being tested for to recognize it early (e.g. cancer) and/or prevent its spread (e.g. COVID). The test is either positive or negative. We make the following assumptions:

You take part in a study where a random, representative sample of the population is tested for the illness. Your test result is positive. What is the probability that you have the illness?

  1. (Submission) Solve this via simulation.
    1. Take a “population sample” of a specific size (experiment with different sizes!) where every “generated person” has 1% chance of turning out sick.
    2. Test your “people” – if they are sick, the test should have a 99.9% chance of returning a positive result; if they are healthy, it should be 1%.
    3. Out of all people that have been tested sick, get the proportion of people that are actually sick.
  2. (Submission) Solve this via mathematics. This requires a basic grasp of marginal and conditional probabilities as well as Bayes’ theorem. These are fundamental concepts without which you will be lost in this class! The corresponding wiki article should be sufficient.

Next (mathematical solution is sufficient, no need for more simulation):

  1. (Submission) Conversely, assume the test result is negative. What is the probability that you have the illness anyway?
  2. (Submission) To bullet-proof the results of their study, the researchers decide to administer two tests to each participant. The second test has the following properties:
    • If a sick person is tested, the test returns a positive result 96% of the time.
    • If a healthy person is tested, the test still returns a positive result 2% of the time.

    As we can see, the second test is much more prone to errors than the first. However, assume that the results of the second test are independent of the first. That is, whether the second test makes an error does not depend on whether there is an error on the first test and vice versa.
    Now, both of your tests come back positive. Given this information, what is the probability that you are indeed sick?

More on Conditional Distributions

So far we have mainly looked at conditional distributions of the kind p(x|y), i.e. “probability of data given label” or something similar. There is a different kind, however: Consider high-dimensional data x = (x1, x2, ..., xn). Often we need to ask (and answer): What is p(x1 | x2, ..., xn) (or any other combination)?

Let us then consider a simpler special case: A mixture of Gaussians. You are familiar with the basic multidimensional Gaussian distribution; a mixture of Gaussians is simply a weighted sum of different Gaussians, each with its own mean and covariance matrix.

Note: To compute the actual probabilities of Gaussians, you can of course use software such as scipy. See here – the pdf function should be useful.

Submission note

For the mathematics parts of this assignment, you can hand them in a separate document (i.e. not a Jupyter notebook). A photograph of an on-paper solution is okay, but it has to be high quality and clearly readable. Better alternatives would be a document with a hand-written solution on something like a tablet, or, best, using a typesetting program like LaTeX capable of producing mathematical formulas.