4.10 – Graph software

Introduction

You may already have experience with use of spreadsheet programs to create bar charts and scatter plots. Microsoft Office Excel, Google Sheets, Numbers for Mac, and LibreOffice Calc are good at these kinds of graphs — although arguably, even the finished graphics from these products are not suitable for most journal publications. 

For bar charts, pie charts, and scatter plots, if a spreadsheet app is your preference, go for it, at least for your statistics class. This choice will work for you, at least it will meet the minimum requirements asked of you.

However, you will find spreadsheet apps are typically inadequate for generating the kinds of graphics one would use in even routine statistical analyses (e.g., box plots, dot plots, histograms, scatter plots with trend lines and confidence intervals, etc.). And, without considerable effort, most of the interesting graphics (e.g., box plots, heat maps, mosaic plots, ternary plots, violin plots), are impossible to make with spreadsheet programs.

At this point, you can probably discern that, while I’m not a fan of spreadsheet graphics, I’m also not a purist — you’ll find spreadsheet graphics scattered throughout Mike’s Biostatistics Book. Beyond my personal bias, I can make the positive case for switching from spreadsheet app to R for graphics is that the learning curve for making good graphs with Excel and other spreadsheet apps is as steep as learning how to make graphs in R (see Why do we use R Software?). In fact, for the common graphs, R and graphics packages like lattice, ggplot2, make it easier to create publishing-quality graphics.

Alternatives to base R plot

This is a good point to discuss your choice of graphic software — I will show you how to generate simple graphs in R and R Commander which primarily rely on plotting functions available in the base R package. These will do for most of the homework. R provides many ways to produce elegant, publication-quality graphs. However, because of its power, R graphics requires lots of process iterations in order to get the graph just right. Thus, while R is our software of choice, other apps may be worth looking at for special graphics work. 

My list emphasizes open source and or free software available both on Windows and macOS personal computers. Data set used for comparison from Veusz (Table).

Bees Butterflies
15 13
18 4
16 5
17 7
14 2
14 16
13 18
15 14
14 7
14 19

1. GrapheR — R package that provides a basic GUI (Fig. 1) that relies on Tcl/Tk — like R Commander — that helps you generate good scatter plots, histograms, and bar charts.  Box plot with confidence intervals of medians (Fig. 2).

Screenshot GrapheR menu

Figure 1. Screenshot of GrapheR GUI menu, box plot options

GrapheR box plot

Figure 2. Box plot made with GrapheR.

2. RcmdrPlugin.KMggplot2 — a plugin for R Commander that provides extensive graph manipulation via the ggplot2 package, part of the Tidyverse environment (Fig. 3). Box plot with data point, jitter (Fig. 4)

Screenshot KMggplot2 menu

Figure 3. Screenshot of KMggplot2 GUI menu, box plot options

KMggplot2 boxplot

Figure 4. Box plot graph made with GrapheR with jitter applied to avoid overplotting of points.

Note: If data points have the same value, overplotting will result — the two points will be represented as a single point on the plot. The jitter function adds noise to points with the same value so that they will be individually displayed. (Fig. 4) The beeswarm function provides an alternative to jitter (Fig. 5).

box plot, beeswarm, KMggplot2

Figure 5. Box plot graph made with GrapheR with beeswarm applied to avoid overplotting of points.

3. A bit more work, but worth a look. Use plotly library to create interactive web application to display your data. 

install.packages("plotly")
library(plotly)
fig <- plot_ly(y = Bees, type = "box", name="Bees")
fig <- fig %>% add_trace(y = Butterflies, name="Butterflies")
fig

code modified from example at https://plotly.com/r/box-plots/

Screenshot box plot made with plotly package

Figure 6. Screenshot of plotly box plot. Live version, data points visible when mouse pointer hover.

4. Veusz, at https://veusz.github.io/. Includes a tutorial to get started. Mac users will need to download the dmg file with the curl command in the terminal app instead of via browser, as explained here.

box plot example, Veusz software

Figure 7. Screenshot of box plot example in Veusz GUI.

5. SciDAVis is a package capable of generating lots of kinds of graphs along with curve fitting routines and other mathematical processing, https://scidavis.sourceforge.net/. SciDAVis is very similar to QtiPlot and OriginLab.

 

Figure 8. Add screenshot

 

More sophisticated graphics can, and when you gain confidence in R, you’ll find that there are many more sophisticated packages that you could add to R to make really impressive graphs. However, the point is to get the best graph, and there are many tools out there that can serve this end.


Chapter 4 contents

4.3 – Box plot

Introduction

Box plots, also called whisker plots, should be your routine choice for exploring ratio scale data. Like bar charts, box plots are used to compare ratio scale data collected for two or more groups. Box plots serve the same purpose as bar charts with error bars, but box plots provide more information.

Purpose and design criteria

Box plots are useful tool for getting a sense of central tendency and spread of data. These types of plots are useful diagnostic plots. Use them during initial stages of data analyses. All summary features of box plots are based on ranks (not sums). So, they are less sensitive to extreme values (outliers). Box plots reveal asymmetry. Standard deviations are symmetric.

The median splits each batch of numbers in half (center line). The “hinge” (median value) splits the remaining halves in half again (the quartiles). The first, second (median), and third quartiles describes the interquartile range, or IQR, 75% of the data (Fig. 1). Outlier points can be identified, for example, with an asterisk or by id number (Fig. 1).

Elements of a box plot

Figure 1. A box plot. Elements of box plot labelled. 

We’ll use the data set described in the previous section, so if you have not already done so, get the data from Table 1, Chapter 4.2 into your R software.

See Chapter 4.10 — Graph software for additional box plot examples, but made with different R packages or software apps.

R Code

Command line

We’ll provide code for the base graph shown in Figure 2A. At the R prompt, type

boxplot(OliveMoment~Treatment)
boxplot, default graph in base package

Figure 2A. Box plot, default graph in base package

Boxplot is a common function offered in several packages. In the base installation of R, the function is boxplot(). The car package, which is installed as part of R Commander installation, includes Boxplot(), which is a “wrapper function” for boxplot(). Note the difference: base package is all lower case, car package the “B” is uppercase. One difference, base boxplot() permits horizontal orientation of the plot (Fig. 2B).

Wrapper functions are code that links to another function, perhaps simplifying working with that function.

boxplot(OliveMoment ~ Treatment,  horizontal=TRUE, col="steelblue")

Horizontal boxplot, base R boxplot()

Figure 2B. Same graph, but with color and made horizontal; boxplot(), default graph in base package

Base package boxplot() has additional features and options compared to Boxplot() in the car package. i.e., not all barcode() options are wrapped. For example, I had more success adding original points to boxplot() graph (Fig. 2C) following the function call with stripchart().

stripchart(OliveMoment ~ Treatment, method = "overplot", pch = 19, add = TRUE)

Added data to boxplot() via stripchart()

Figure 2C. Same graph, added original points; boxplot(), default graph in base package.

boxplot and stripchart functions part of ggplot2 package, part of tidyverse, easily used to generate graphs like Fig 2B and Fig 2C. The overplot option was used to jitter points to avoid overplotting. See below: Apply tidyverse-view to enhance look of boxplot graphic and Fig. 9.

Jittering adds random noise to points, which helps view the data better if many points are clustered together. Note however that jitter would add noise to the plot — if the objective is to show an association between two variables, jitter will reduce the apparent association, perhaps even compromising the intent of the graph. Beeswarm also can be used to better visualize clustered points, but uses a nonrandom algorithm to plot points. 

Rcmdr: Graph → Boxplot…

Select the response variable, then click on the Plot by: button

Figure 23. Popup menu in R Commander: Select the response variable and set the Plot by: option.

Figure 3. Popup menu in R Commander: Select the response variable and set the Plot by: option.

Next, select the Groups (Factor) variables (Fig. 4). Click OK to proceed

Figure 24. Rcmdr boxplot menu, Select the group variable

Figure 4. Select the group variable

Back to the Box Plot menu, click “Options” tab to add details to the plot, including a graph title and how outliers are noted (Fig 5), 

Figure 25. Rcmdr boxplot menu, Options tab, enter labels for axes and a title.

Figure 5. Options tab, enter labels for axes and a title.

And here is the resulting box plot (Fig 6)

Figure 26. Resulting box plot from car package implemented in R Commander. Outliers are identified by row id number.

Figure 6. Resulting box plot from car package implemented in R Commander. Outliers are identified by row id number.

The graph is functional, if not particularly compelling. The data set was “olive moments” from Comet Assays of an immortalized rat lung cell line exposed to dilute copper solution (Cu), Hazel tea (Hazel), or Hazel & Copper solution. 

Apply Tidyverse-view to enhance look of boxplot graphic

Load the ggplot2 package via the Rcmdr plugin to add options to your graph. As a reminder, to install Rcmdr plugins you must first download and install them from an R mirror like any other package, then load the plugin via Rcmdr Tools → Load Rcmdr plug-in(s)… (Fig 6, Fig 7).

Figure 27. Screen shot of Load Rcmdr plug-ins menu, Click OK to proceed (see Figure 28)

Figure 6. Screen shot of Load Rcmdr plug-ins menu, Click OK to proceed (see Fig 7)

Figure 28. To complete installation of the plug-in, restart R Commander.

Figure 7. To complete installation of the plug-in, restart R Commander.

Significant improvement, albeit with an “eye of the beholder” caveat, can be made over the base package. For example, ggplot2 provides additional themes to improve on the basic box plot. Figure 8 shows the options available in the Rcmdr plugin KMggplot2, and the default box plot is shown in Fig 9.

Figure 29. Menu of KMggplot2. A title was added, all else remained set to defaults.

Figure 8. Menu of KMggplot2. A title was added, all else remained set to defaults.

The next series of plots explore available formats for the charts.

Figure 30. Default box plot from KMggplot

Figure 9. Default box plot from KMggplot

Figure 31. "Economist" theme box plot from KMggplot2

Figure 10. “Economist” theme box plot from KMggplot2

And finally, since the box plot is often used to explore data sets, some recommend including the actual data points on a box plot to facilitate pattern recognition. This can be accomplished in the KMggplot2 plugin by checking “Jitter” under the Add data points option (see Fig 8). Jitter helps to visualize overlapping points at the expense of accurate representation. I also selected the Tufte theme, which results in the image displayed in Figure 11. 

igure 32. Tufte theme and data points added to the box plot.

Figure 11. Tufte theme and data points added to the box plot.

Note. The Tufte theme is so named for Edward Tufte (2001), Chapter 6 Data-Ink Maximization and Graphical Design.” In brief, the theme follows the “maximal data, minimal ink” principle.

Conclusions

As part of your move from the world of Microsoft Excel graphics to recommended graphs by statisticians, the box plot is used to replace the bar charts plus error bars that you may have learned in previous classes. The second conclusion? I presented a number of versions of the same graph, differing only by style. Pick a style of graphics and be consistent.

Questions

  1. Why is a box plot preferred over a bar chart for ratio scale data, even if an appropriate error bar is included?
  2. With your comet data (Table 1, Chapter 4.2), explore the different themes available in the box plot commands available to you in Rcmdr. Which theme do you prefer and why?

Chapter 4 contents