20.14 – Binary classification
draft
Introduction
For much of this textbook we emphasized the search for causal explanations — applied (bio)statistics. Machine learning refers to any system used that churns data into predictions, while not necessarily explaining how the collection of predictors effect the outcome (Li and Tong 2022).
Classification in science is the act of grouping observations based on common features. In data science, classification procedures are used to predict which category a data point belongs to by finding the lines or planes — the decision boundary — that best divide the different classes.
Prediction or decision – use data to build or “train” a model that is then used to make future predictions or decisions given new data. Technically, prediction and decision are different concepts — prediction refers to what might happen given new information while decision is about what we may do given new information. We discussed prediction as a goal of regression models in some detail in Chapter 17 and Chapter 18.
Linear discriminant analysis (LDA) is a well known classification algorithm, whose roots trace back to R.A. Fisher. Linear discriminant analysis finds a linear combination of features that characterizes or separates two or more classes of objects or events. It builds on the idea of multiple linear regression: for multiple linear regression we predict a continuous outcome variable from a set of predictor variables; for discriminant analysis, we predict discrete outcomes, two or more mutually exclusive groups, from a set of predictor variables.
Logistic regression (LR) is another common classification algorithm. We previously introduced LR as a statistical method for modeling the dependence of a binomial outcome variable on one or more categorical or continuous predictor variables (see Chapter 18.3 – Logistic regression). LR shares many features with linear discriminant analysis, but while LDA models the distribution of the data and assumes normality for the predictor variables, LR directly models the probability of the outcome, making fewer assumptions about the distribution of the data. LDA returns a discriminant function, a linear combination of the features, while LR returns a probability between 0 and 1 via the logistic function, estimating the probability of the data point belonging to a specific class.
Both LDA and LR return single decision models — not the same classification model, of course. The downside for use of a single model to predict based on new data is that the assumptions used to build the model may not hold for the new data. Enter learning methods like random forest. Random forest is an example of supervised learning methods — it combines the predictions of multiple models to get a better result than any single model could achieve alone. The idea is that a “forest” of multiple decision trees are generated, trained on different subsets of data, which then contribute towards a classification choice based on the consensus of those many decision trees. By “training the model” in machine learning we imply a process by which we help an algorithm to recognize patterns in data to make predictions on new, unseen data. Model parameters are learned from the data during training, hyperparameters are settings that control the model training process.
Note 1: Supervised learning implies we use “labeled” data, where we already know to which group outcomes belong. Unsupervised learning uses unlabeled data to find hidden patterns or structures within the data.
R packages
caret — a “comprehensive package” for training and tuning predictive models. “Tuning” refers to optimizing the performance of a model by adjusting its hyperparameters.
ranger — for building random forest model of classification and regression models.
Questions
pending
Quiz
pending
References and suggested readings
Blei, D. M., & Smyth, P. (2017). Science and data science. Proceedings of the National Academy of Sciences, 114(33), 8689–8692.
Boehmke, B., & Greenwell, B. N. (2019). Hands-On Machine Learning with R. Chapman and Hall/CRC. https://bradleyboehmke.github.io/HOML/
Bzdok, D., Altman, N., & Krzywinski, M. (2018). Statistics versus machine learning. Nature Methods, 15(4), 233–234.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.
Li, J. J., & Tong, X. (2020). Statistical hypothesis testing versus machine learning binary classification: Distinctions and guidelines. Patterns, 1(7).
Lin, X., Cai, T., Donoho, D., Fu, H., Ke, T., Jin, J., Meng, X.-L., Qu, A., Shi, C., Song, P., Sun, Q., Wang, W., Wu, H., Yu, B., Zhang, H., Zheng, T., Zhou, H., Zhou, J., Zhu, H., & Zhu, J. (2025). Statistics and AI: A Fireside Conversation. Harvard Data Science Review, 7(2).
Chapter 20 contents
- Additional topics
- Area under the curve
- Peak detection
- Baseline correction
- Surveys
- Time series
- Dimensional analysis
- Estimating population size
- Diversity indexes
- Survival analysis
- Growth equations and dose response calculations
- Plot a Newick tree
- Phylogenetically independent contrasts
- How to get the distances from a distance tree
- Binary classification
- Meta-analysis