3 – Exploring data
Describing data: An introduction to summary statistics and basic visualization.
.
Scientists answer questions about natural phenomena using the scientific method. The questions in the form of hypotheses are tested by data, the raw information collected either by observing natural outcomes among groups (observational studies) or by manipulating variables (experimental studies) and recording the outcomes.
Note 1: We discussed the scientific method in Chapter 2.5. The distinction between experimental and observational studies was discussed in Chapter 2.4.
Regardless, after a study has been completed, the collection of observations needs to be summarized or described, a process now referred to as data exploration. Interpretation of the data constitutes the evidence used to evaluate the hypothesis. How many mice grew tumors in the treated versus control group? Did all mice respond to treatment? What is the typical cost of a new home in Hilo, HI? How large of a difference is there between the most expensive home and the middle? These questions require simple, summary statistics also called descriptive statistics and informative data visualizations. Both are required of any data analysis report (students: see Making a report in Mike’s Workbook for Biostatistics) — to communicate the findings of the experiment or observational study.
Note 2: Throughout Chapter 3, we present use of R Commander and R functions to accomplish our data exploration objectives. It should come as not surprise that folks have provided many solutions to improve the workflow of data exploration. rattle (R Analytic Tool To Learn Easily) is a great “data-mining” package. Rattle (version 6.5.8 as of August 2025) provides a graphical user interface, which makes it straightforward to work with. It doesn’t work well with Rcmdr, but can be used along with RStudio. rattle is an appropriate option for accomplishing tasks needed for Chapter 3, Chapter 13, and Chapter 16.
BI-311: Installing rattle is optional. Because rattle’s programming history, installation of rattle is a little different from other R packages we have described thus far. Follow instructions for your operating system at https://rattle.togaware.com/. On WinPC, download and install rattle-dev-windows-inno.exe. If you followed my R installation instructions you may need to edit the PATH environment variable on your WinPC so that R will be found by rattle. On macOS, download and extract the archived (zip) file, rattle-dev-macos.zip, and run the app (rattle.app). You will need to inform gatekeeper to allow rattle.app to run — go to Settings > Security & Privacy, and find the request to permit running the app.
In Chapter 3 we introduce two aspects of summary statistics expected in any report about a data set (see Chapter 4 — How to report statistics). Summary statistics provide a brief overview of relevant characteristics of observations in the data set to provide the reader to get a quick but meaningful look at the data set. Data set refers to a collection of data, usually a collection of related, ordered observations, measurements, and related information.
Summary statistics vary according to the needs of the reporting vehicle, but may generally address two characteristics of the data set
- Central tendency or the description of the middle of the data. See Chapter 3.2 – Measures of Central Tendency.
- Data dispersion, the variability around the middle of the data set. See Chapter 3.3 – Measures of dispersion.
- The other two sections in Chapter 3 introduce estimation, the role random sampling plays, and the kinds of error in statistics (Chapter 3.4 — Estimating parameters).
- The final section, 3.5 – Statistics of error, introduces incorporating estimates of error in measurement calculations are used to report confidence — confidence intervals — in the estimates we make.
Before we can talk about descriptive statistics we need to introduce the concept of data types, which we do in Chapter 3.1 — Data types. Understanding data types is crucial for accurately calculating and interpreting measures of central tendency and dispersion, which are essential for effective data visualization.
We introduce statistical graphics, a specialized kind of data visualization, in Chapter 4 – How to report statistics . Statistical graphs are utilized to describe data sets, but also to communicate statistical inference, which we address formally beginning in Chapter 7 – Probability, Risk Analysis.
Formulas — a reminder.
Most of you have been asked at some point to calculate the average or the sample standard deviation. We will provide these again, but with additional statistics. Textbooks may present calculator formulas — nothing wrong with them, although we have to watch significant figures. But computer statistical packages generally do not use these formulations as the approach to calculate — the algorithms are typically much more involved. So why present formulas throughout the book as I do? Formulas often are the best definition of the statistical concept. I haven’t investigated this last point in any systematic way, but, because of access by scientists to increasingly powerful computers since the 1980s, I doubt anyone in the business of data analysis in the biological sciences has much use of the hand calculator or the formulas except as definitions.
Note 3. Chaminade University students: On homework, quizzes and exams, you may be asked to calculate these descriptive statistics. The data sets will always be simple ones, simple enough that calculators should not be needed.
Homework to go with this topic.
Mike’s Workbook for Biostatistics:
Homework 2A: Descriptive statistics
Quizzes in this chapter
A total of 56 questions among the several subchapters, a mix of true or false and multiple choice question format.
Chapter 3 contents
.