3 – Exploring data
Describing data: An introduction to summary statistics and basic visualization
.
Scientists answer questions about natural phenomena using the scientific method. The questions in the form of hypotheses are tested by data, the raw information collected either by observing natural outcomes among groups (observational studies) or by manipulating variables (experimental studies) and recording the outcomes. Regardless, after a study has been completed, the collection of observations needs to be summarized or described, a process now referred to as data exploration. Interpretation of the data constitutes the evidence used to evaluate the hypothesis. How many mice grew tumors in the treated versus control group? Did all mice respond to treatment? What is the typical cost of a new home in Hilo, HI? How large of a difference is there between the most expensive home and the middle? These questions require simple, summary statistics or descriptive statistics and informative data visualizations. Both are required of any data analysis report (students: see Making a report in Mike’s Workbook for Biostatistics) — to communicate the findings of the experiment or observational study.
Note 1. We discussed the scientific method in Chapter 2.5. The distinction between experimental and observational studies was discussed in Chapter 2.4.
In Chapter 3 we introduce two aspects of summary statistics expected in any report of a data set. Summary statistics provide a brief overview of relevant characteristics of observations in the data set to provide the reader to get a quick but meaningful look at the data set. Data set refers to a collection of data, usually a collection of related, ordered observations, measurements, and related information.
Summary statistics vary according to the needs of the reporting vehicle, but may generally address two characteristics of the data set
- Central tendency or the description of the middle of the data. See Chapter 3.2 – Measures of Central Tendency.
- Data dispersion, the variability around the middle of the data set. See Chapter 3.3 – Measures of dispersion.
- The other two sections in Chapter 3 introduce estimation, the role random sampling plays, and the kinds of error in statistics (Chapter 3.4 — Estimating parameters).
- The final section, 3.5 – Statistics of error, introduces incorporating estimates of error in measurement calculations are used to report confidence — confidence intervals — in the estimates we make.
Before we can talk about descriptive statistics we need to introduce the concept of data types, which we do in Chapter 3.1 — Data types. Understanding data types is crucial for accurately calculating and interpreting measures of central tendency and dispersion, which are essential for effective data visualization.
We introduce statistical graphics, a specialized kind of data visualization, in Chapter 4 – How to report statistics . Statistical graphs are utilized to describe data sets, but also to communicate statistical inference, which we address formally beginning in Chapter 7 – Probability, Risk Analysis.
Most of you have been asked at some point to calculate the average or the standard deviation. We will provide these again, but with additional statistics. Textbooks may present calculator formulas — nothing wrong with them, although we have to watch significant figures. But computer statistical packages generally do not use these formulations as the approach to calculate — the algorithms are typically much more involved. So why present formulas throughout the book as I do? The formulas often are the best definition of the statistical concept. I haven’t investigated this last point in any systematic way, but, because of access by scientists to increasingly powerful computers since the 1980s, I doubt anyone in the business of data analysis in the biological sciences has much use of the hand calculator or the formulas except as definitions.
Note 2. Chaminade University students: On homework, quizzes and exams, you may be asked to calculate these descriptive statistics. The data sets will always be simple ones, simple enough that calculators should not be needed.
Chapter 3 contents
.