2.1 – Why (Bio)Statistics?

What are the basic elements of biostatistics?

What will you learn from an introductory course on (bio)statistics? Concepts and skills: An understanding (concepts) about the purpose and limits of descriptive statistics, inferential statistics, and statistical modeling for answering questions in biology and how to use elements of R programming (skills) applied to biology-related data sets to accomplish these tasks (see Table 1). Related, you will learn how (skills) to collect and manage data from projects, and why (concepts) statistical analysis can help answer questions, even inform decisions in public health. Thus, assessment in BI-311 of student learning includes both formative and summative assignments, exams, and culminates with a student research project.

Note 1: From time to time in this book I write “(bio)statistics” instead of “biostatistics.” It’s my subtle way of  highlighting that whatever argument or statement is attached applies more generally to the statistical discipline, not just biostatistics. However, I’m not particularly consistent with this typology in the book.

What skills are you going to get from all of this? You will get different opinions on the elements of an essential first course in statistics or biostatistics. Certainly the basics are a foundation in probability and a breadth of classical elementary statistical procedures, which will include descriptive statistics, analysis of variance and linear regression, and an introduction to multivariate analysis. In preparation for your course in epidemiology you will also be introduced to risk analysis and survival analysis. However, the primary return for your time, I hope, will be a deeper appreciation for how to think about problems in biology from experimental design and data analysis perspectives. Practical skills you will learn include how to process and clean data for analysis, data visualization, and a foundation in parametric and nonparametric statistical methods.

The data sets described and provided in this course are small, with sample sizes in the tens to thousands (rarely). This framework is consistent with the development of the discipline of (bio)statistics, which originated to help interpret experiments conducted in agriculture, industry, and medical research (see Chapter 2.3). But, as you now know, because of genomics and other data-intensive disciplines, data sets can be very large, much beyond the kinds of data sets used prior to use of the computer in data analysis. However, as we will argue later on this page, learning foundational skills in (bio)statistics with small data sets facilitates working with and gaining insights from larger data sets. For example, we will emphasize during our work this semester how to develop a workflow with small data sets; these same workflows then can be scaled up and would include pipelines — based on tools and techniques developed in the broader discipline of bioinformatics — to large data sets, including use of Artificial Intelligence and machine learning.

Note 2: In data science, a workflow refers to the entire, end-to-end process needed in a project, from data collection, data processing, to descriptive statistics and modeling or inferential statistics, and report writing. A pipeline described elements of a workflow that can be automated. 

Why we require you to take (Bio)statistics as part of your major?

At Chaminade University we require all biology students to take biostatistics, and we do so with an emphasis on use of data analysis skill development. This requirement aligns our program to national expectations of biology undergraduate education (eg, AAAS, NAS, NIH, NSF). As stated in Bio2010: Transforming Undergraduate Education for Future Research Biologists,

“Biology majors should be adept at using computers to acquire and process data, carry out statistical characterization of the data and perform statistical tests, and graphically display data in a variety of representations (p. 15).”

Learning biostatistics from a course like BI311 — which relies heavily on use of the R programming language and data sets — helps the biology student develop these skills.

In the next pages I will outline a history of statistics (Chapter 2.3), but here I wish to make the point that biostatistics is now considered to be a core skill set for biologists. Biostatistics as a discipline came into its own in the 1930’s, but extensive reliance on statistics in research really dates to more recent times because of the ubiquity of personal computers (Salsburg 2002). Modern biological and biomedical research requires computational and quantitative methods to collect, process, analyze, and interpret large data sets. And yet, even a casual survey of required courses for entry into graduate programs in biology (eg, only recommended for admissions to Marine Biology and Cell and Molecular Biology programs at University of Hawaiʻi – Manoa) will reveal that biostatistics is not required of candidates; so, what gives?

The first point is that programs list only minimum requirements. The second point is that many programs (genomics, ecology, etc) will expect the graduate student to take a year or more of (bio)statistics. The need is so crucial that at Harvard Medical School, all biology graduate students are expected to take a crash-course in computing and statistics to work with data (Stefan et al 2015).

Moreover, while graduate programs are not listing statistics as a requirement, many biology undergraduate curricula now require a course in (bio)statistics to reflect the increasingly data driven modern biology — where the jobs are! 

My BI-311 students, I’ll make you a bet — or at least, I’ll make this part of your required homework (see BI311 Workbook)! Even a causal search of a research journal article in a biology discipline of your choosing will prove that there is no doing biology research today without an understanding of statistics.

But, I’m pre-med and plan to apply to medical school …

Even a cursory look at the literature will result in finding many authors strongly calling for this kind of preparation for a successful career in medicine (eg, Brieger and Hardin 2012). It’s obvious, but needs stating — you’re applying to medical school to become a doctor — you’ll spend the majority of your adult life as a doctor. Statistical thinking is crucial to answering the daily question: “My patient tested positive for biomarker X, what’s the chance that the patient has disease Y?” If your answer today is, the patient has the disease, then you definitely need this course! Hint: there are four possible outcomes, not two, of a test, see Chapter 7.3 – Conditional Probability and Evidence Based Medicine.

Need more convincing? Take a look at the targets of questions intended to evaluate Skill 4 of the Scientific Inquiry and Reasoning Skills standard of the revised MCAT2015 Exam (p. 107, What’s on the MCAT2015 exam?).

  • Using, analyzing, and interpreting data in figures, graphs, and tables.
  • Evaluating whether representations make sense for particular scientific observations and data.
  • Using measures of central tendency (mean, median, and mode) and measures of dispersion (range, interquartile range, and standard deviation) to describe data.
  • Reasoning about random and systematic error.
  • Reasoning about statistical significance and uncertainty (eg, interpreting statistical significance levels, interpreting a confidence interval).
  • Using data to explain relationships between variables of make predictions.
  • Using data to answer research questions and draw conclusions.
  • Identifying conclusions that are supported by research results.
  • Determining the implications of results for real-world situations.

Before we move on, there’s another justification for learning biostatistics we need to discuss.

With machine learning and AI, why learn biostatistics?

Oh, my goodness. If I was starting over, I would definitely want to pursue expertise in machine learning, ML (Fig 2). Machine learning is about finding patterns in large data sets, and modern biological research may be characterized by the impact of large datasets. The machine learning field, and artificial intelligence, AI, in general, are rapidly evolving, but built on more than 20 years of algorithm and computer discoveries. So why not skip biostatistics and go directly to data science?

xkcd comic titled "Machine learning," https://xkcd.com/1838/ . A character labeled “Algorithm” stands in front of a plot of scattered points and sketches a simplistic line through them while thinking of complicated mathematical expressions. The caption below reads “Machine learning.”

Figure 2. “Machine learning,” https://xkcd.com/1838/.

The parable presented in Figure 2 is about right: Without understanding the statistical foundations behind machine learning algorithms, you’re using tools without really knowing how they work (Lin et al 2025, more generally, see Downey 2024). We have a lot of work ahead of us, but, from time to time, we will exploit the power of AI and ML. For a start, I asked ChatGPT (August 2025), “What statistics is most relevant for ML?”, AI listed responses mapped to “Mike’s Biostatistics Book” are provided in Table 1.

Table 1. AI recommended core statistical topics matched to Mike’s Biostatistics Book

ChatGPT recommended core statistics topicsMike's Biostatistics Book
Descriptive statisticsChapter 3
Probability theoryChapter 6 - 7
Common probability distributionsChapter 6
Statistical inferenceChapter 8
Estimation and LikelihoodChapter 3, 6, 17 - 20
Regression analysisChapter 17 - 18
Analysis of Variance (ANOVA)Chapter 12 - 14
Resampling methodsChapter 19
Dimensionality reductionChapter 16, 20
Time series and autocorrelationChapter 17 - 18, 20
Optional: Bayesian statisticsChapter 8 - 9, 18, 20

Biostatistics is embedded in bioinformatics and data science, in general. Thus, as implied by Figure 2, a critical step in working with datasets, whether small or large, is to ensure the inputs are valid. This holds across the dat science disciplines. Learning data processing tasks, data cleaning and, more generally, data wrangling are included in our biostatistics course. 

Note 3: For larger data sets in the course we will introduce you to rattle, an R package that helps with data processing and data mining with access to simple hypothesis tests (eg, do groups differ?), and pattern detection (eg, correlations and dendrograms). For their required project, students in BI311 may elect to work with the R package targets to build automated data pipelines (eg, calculating growth rates for tomato germination experiments or the essential genes project in Genomics class (BI-308). 

In what disciplines are biostatisticians employed?

One way to begin this discussion is to think about where statisticians work. The job market includes:

Health Science.

  • Drug design, causes of diseases (many “causes” of cancers).
  • Health Professional (nurses, physical therapists).
  • Type of care and recovery period (importance of a person’s mood on health).
  • Exercise regime and recovery from injury.
  • Nutrition:- vitamins and health: diet and health.

Ecology & Evolution.

  • Causes of changes in population sizes (conservation biology).
  • Effects of pollution on organisms and ecosystems.
  • Evolution of traits in populations over time.
  • Global environmental changes and changes in population sizes or species diversity.

Genetics & Molecular Biology.

  • Identifying genes that influence traits (eg, breast cancer, cystic fibrosis).
  • Nature vs. nurture (heredity and environment effects on phenotypes).
  • Multiple sequence alignment in comparative genomics.

Agriculture.

  • Fertilizer effects on plant growth and productivity.
  • Compare farming and harvesting methods (eg, organic vs conventional farming).
  • Compare plant hybrids for differences in productivity.

Here’s a web site that keeps track of statistics jobs Biostatistics. I would go on to add that experience and competence in statistics would also translate to employment in non-biology fields, eg, business analytics

Conclusions.

Moving forward, we have much to do — you will be exposed to many specific examples of statistical tests, how to calculate estimators, and how to make inferences from experiments.  An important goal of this course is for you to be introduced and develop your ability to design experiments. why should you, as biologists and future health care providers, learn biostatistics?

  1. Develop statistical reasoning skills. Most, if not all graduate students will need to take several courses in statistics.
    • Statements about research findings, new and better products, sociological and political issues often depend in large part on some form of statistical analysis.
    • By learning a little about experimental design, sampling, and statistical testing, you will be much closer to being able to participate fully in these debates.
  2. Most, if not all graduate students will need to take several courses in statistics.
  3. Most, if not all jobs in biology require some training in statistics.

So, there’s really no doing biology without at least some knowledge of statistics. You’re getting a head start!

Questions

  1. Write up three learning outcomes for this page. Hint: Point your favorite generative AI to this page and ask for help.
  2. Explore current “biostatistics,” “bioinformatics,” and “data science” job prospects at Indeed.com or other recruiting sites. Search your area/city and also try a wider search.
  3. Given the rise of AI, now is a great time to explore current thinking about the future and need for biostatistics skills.  Write a one paragraph perspective, your “worldview” about pros/cons of developing statistics and data science skill sets. 

 

Quiz Chapter 2.1

Why (bio)statistics?

Chapter 2 contents