6 – Probability, Distributions

Introduction

Probability is how likely something, an event, is likely to occur, ie, a prediction of the future. Related, we can ask of the likelihood of an event, which is like working backwards. The event has already occurred and we ask how well a model (parameters) explain the observed data.

Working with probability allows us to contextualize the inherent uncertainty in any measure, which in turn allows us to interpret results from hypothesis tests of observational or experimental data sets. Thus, an important concept to appreciate is that in many cases, like R.A. Fisher’s Lady tasting tea analogy, we can count in advance all possible outcomes of an experiment.

For many more experiments, we cannot count all possible outcomes of the sample space, either because they are too numerous or simply unknowable. In such cases, applying theoretical probability distributions allow us to circumvent the countability problem. Whereas empirical probability distributions are  frequency counts of observations, theoretical probabilities are based on mathematical formulas.

At this stage of your training in statistics, you would spend time with the foundations of probability. Scenarios of likelihood of single events regardless of other events are called marginal probability, whereas the union refers to at least one event occuring. The likelihood of two or more events occuring at the same time is called joint probability. Worked examples are presented in the chapter, although the bulk of this chapter concerns conditonal probability, the likelihood of one event occuring given another event already happened. Working with conditional probability is essential for risk analysis, which we introduce in Chapter 7 – Probability, Risk Analysis and expand in Chapter 7.3 – Conditional Probability and Evidence Based Medicine and Chapter 7.4 – Epidemiology: Relative risk and absolute risk, explained.

Probability distributions are key to the null hypothesis significance testing framework for statistical inference. Given assumptions about the data, probability distributions are used to evaluate the likelihood of observing data under the null hypothesis, eg, no difference between means of a control group compared to a treatment group. Much of classical inferential statistics, especially the kind one finds in introductory courses like ours, are built on probability distributions. ANOVA, t-tests, linear regression, etc., are parametric tests and assume errors are distributed according to a particular type of distribution, the normal or Gaussian distribution.

A probability distribution is a list of probabilities for each possible outcome of a discrete random variable in an entire population. Depending on the data type, there are many classes of probability distributions. In contrast, probability density functions are used to for continuous random variables. This chapter begins with basics of probability then gently introduces discrete and continuous probability distributions. In the other sections of this chapter we describe several probability density functions. Emphasis is placed on the normal distribution, which underlies most parametric statistics.

Homework to go with this topic

Homework 3: Distributions & Probability in Mike’s Workbook for Biostatistics.

Quizzes in this chapter

A total of 88 questions among the several subchapters, a mix of true or false and multiple choice question format.


Chapter 6 contents