20.11 – Plot a Newick tree

Introduction

The phrase “paradigm shift”, attributed to Kuhn (1962, see Wikipedia), may be well-worn and even abused today (Naughton 2012), but the shift in thinking from essential types and group thinking (essentialism) to viewing species as varying individuals in populations (populating thinking) revolutionized biology (O’Hara 1997, Sandvik 2008). Tree thinking is the manifestation of Charles Darwin’s “descent with modification” metaphor (Gregory 2008). Thus, every biology student should have ability to work with, and interpret, phylogenetic trees (tree thinking). The subject of creating and working with phylogenetic graphs is complicated with an extensive library. A good review is available from Holder and Lewis (2003) and readers should know Felsenstein’s book (2004).

Here, I include a modest, incomplete primer on working with trees in R.

  • Loading the tree file
  • Change tip names
  • Write tip names to a text file
  • Plot the tree as phylogram or cladogram
  • Get node labels
  • Re-root the tree
  • Write a tree to a file

I assume that the student already has a set of species or other taxa, gathered sequences (DNA or protein), aligned the sequences, estimated gene or phylogeny tree, and wishes to view and manipulate the tree in R. While these kinds of analyses can be done with R and R packages (see Task view: Phylogenetics), other software may be better choice for the student just beginning with phylogenetic tree building (see Unipro UGENE and MEGA, for examples). If the goal is just to view a tree file, or add annotations, then I recommend the iTOL tools.

Data formats

Phylogeny and gene trees are special cases of network graphs. Newick format (Wikipedia) is a common but limited representation of the tree which uses parentheses (groupings) and commas (branching). Other formats permit additional information; examples are Nexus file (Wikipedia) and the extension of Nexus to XML, NeXML (Wikipedia), and phyloXML (Wikipedia) formats. Our example uses Newick format.

Data set

I’ll use a “time tree” for an example. Tree from timetree.org, list of species (copy/paste list to a text file, load the text file Load list of of Species, then save the tree as a Newick file).

Alligator mississippiensis
Felis catus
Bos taurus
Gallus gallus
Pan troglodytes
Canis lupus
Homo sapiens
Anolis carolinensis
Macaca mulatta
Mus musculus
Didelphis virginiana
Sus scrofa
Oryctolagus cuniculus
Rattus norvegicus

R code

Requires the ape package. Phylotools and Phytools packages provide additional handy functions. References for these packages are listed at the end of this page.

require(ape)
require(phytools)
require(phylotools)
#If tree file, then
read.tree(file="tree14.nwk")
or
tree14 <- read.tree(file.choose())
#If no tree file saved, copy the Newick data use text="", replace example tree with your Newick tree
tree14 <-read.tree(text="((Anolis_carolinensis:279.65697667,(Gallus_gallus:236.50266286,Alligator_mississippiensis:236.50266286)'14':43.15431381)'13':32.24694470,(Didelphis_virginiana:158.59758758,(((Felis_catus:54.32144118,Canis_lupus:54.32144118)'11':23.43351523,(Bos_taurus:61.96598852,Sus_scrofa:61.96598852)'10':15.78896789)'19':18.70743276,((Oryctolagus_cuniculus:82.14079889,(Rattus_norvegicus:20.88741740,Mus_musculus:20.88741740)'9':61.25338149)'22':7.68238853,(Macaca_mulatta:29.44154682,(Pan_troglodytes:6.65090500,Homo_sapiens:6.65090500)'8':22.79064182)'6':60.38164060)'30':6.63920175)'29':62.13519841)'27':153.30633379);")
#return information about the object
tree14

Output returned by R

Phylogenetic tree with 14 tips and 13 internal nodes.
Tip labels:
Anolis_carolinensis, Gallus_gallus, Alligator_mississippiensis, Didelphis_virginiana, Felis_catus, Canis_lupus, ...
Node labels:
, 13, 14, 27, 29, 19, ...
Rooted; includes branch lengths.

Change the tip names. Create a data frame with the tip labels and new tip names.

require(phylotools)
timeTreeTips <- tree14$tip.label
replaceTips <- c("Alligator", "Cat", "Chicken", "Chimpanzee", "Cow", "Dog", "Human", "Lizard", "Macaque", "Mouse", "Opossum", "Pig", "Rabbit", "Rat")
myDat <- data.frame(timeTreeTips,replaceTips)
ntree14<- sub.taxa.label(tree14,myDat)

Collect and write the tip names to a text file

#Extract tips from newick file, write to text file
require(ape)
my.tips <- sort(tree14$tip.label)
#option 1
cat(my.tips,file="outfile.txt",sep="\n")
#option 2
my_conn = file("outfile.txt")
writeLines(my.tips,my_conn)
close(my_conn)

Next, make the plot.

plot(ntree14)

Result, a simple phylogram, i.e., a tree diagram with branching patterns and branch lengths proportional to amount of character change.

phylogram

Figure 1. Phylogram plot of 14 taxa

Or, change from default “phylogram” to “cladogram” view.

plot(tree14, type="cladogram")

diagonal cladogram

Figure 2.  Cladogram view, same 14 taxa.

Note that while the tree is rooted, it’s a midpoint rooting, the default setting in Newick files. For true root based on outgroup(s), identify the nodes, then select root.

Add node labels; plot() must be run first.

nodelabels()

cladogram with nodes

Figure 3. Plot of tree with labeled nodes.

The outgroup(s) were the reptiles (Alligator, Chicken, Lizard), so reroot at node 16.

rrTree<- root(tree14, node=16)
plot(rrTree, type="cladogram")

rooted tree

Figure 4. Re-rooted tree.

Write tree to a file

require(ape)

To export tree to Newick format

write.tree(tree14, file = "filename.nwk")

for Nexus format

write.nexus(tree14, file = "filename.nex")

Star phylogeny

Collapse the tree to a star phylogeny, an unlikely evolutionary model in which the species resulted from “… a single explosive adaptive radiation” (Felsenstein 1985). Star phylogeny is an extreme tree shape, or multifurcation (polytomy), where all tips derive from the same node (Colijn and Plazzotta 2018). This type of phylogeny can be viewed as a null model for inference (but see Bayesian “star phylogeny paradox,” cf. Kolaczkowski and Thornton 2006).

require(phytools)
ctree14 <-collapse.to.star(tree14, 15)
plot(ctree14, type="cladogram")

star phylogeny

Figure 5. Star phylogeny

Under a star phylogeny model, all taxa are assumed independent of each other, in contrast to the nested hierarchical model of evolution (e.g., Fig. 4), which shows a lack of independence among the taxa. More succinctly, a fitted to uncorrected taxa comparisons may violated the assumption that errors are independent and identically distributed. Phylogenetically correct methods attempt to address the lack of independence among taxa for comparative analysis (Felsenstein 1985, Uyeda et al 2016). Biologists should know about Felsenstein’s 1985 paper. Felsenstein’s paper created a paradigm shift in how to analyze comparative datasets and has been cited more than ten thousand times (1 August 2023, Google Scholar). To put that number in context, the 1986 paper by Kary Mullis et al., which announced invention of PCR with thermally stable polymerase that has revolutionized molecular biology, has been cited 6721 times over that same period.

Additional packages of note

The R package tanggle works with the package ggtree and advantage of the ggplot2 environment. Contains many functions to work with phylogeny graphs including re-rooting and swapping nodes. The package is available from Bioconductor,

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("tanggle")

ggtree is also a Bioconductor package, not available at CRAN.

Online viewers

Many browser-based tree viewers online, including icytree.org and iTOL tools. Additional tree viewers listed at Wikipedia.

References and suggested readings

Colijn C, Plazzotta G. A Metric on Phylogenetic Tree Shapes. Syst Biol. 2018 Jan 1;67(1):113-126.

Felsenstein, J (1985) Phylogenies and the comparative method. American Naturalist 125(1):1-15.

Felsenstein, J. (2004). Inferring phylogenies. Sunderland, MA: Sinauer associates.

Gregory, T. R. (2008). Understanding evolutionary trees. Evolution: Education and Outreach1(2), 121-137.

Holder, M., & Lewis, P. O. (2003). Phylogeny estimation: traditional and Bayesian approaches. Nature reviews genetics4(4), 275-284.

Kolaczkowski, B., & Thornton, J. W. (2006). Is There a Star Tree Paradox? Molecular Biology and Evolution, 23(10), 1819–1823.

Kuhn, T. S. (1962). The structure of scientific revolutions. University of Chicago Press: Chicago.

Mullis, K., Faloona, F., Scharf, S., Saiki, R. K., Horn, G. T., & Erlich, H. (1986, January). Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction. In Cold Spring Harbor symposia on quantitative biology (Vol. 51, pp. 263-273). Cold Spring Harbor Laboratory Press.

Naughton, J. (2012, August 18). Thomas Kuhn: The man who changed the way the world looked at science. The Guardian. https://www.theguardian.com/science/2012/aug/19/thomas-kuhn-structure-scientific-revolutions

O’Hara, R. J. (1997). Population thinking and tree thinking in systematics. Zoologica scripta26(4), 323-329.

Paradis, E. (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). New York: Springer.

Paradis, E. and Schliep, K. (2019) ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics, 35, 526–528.

Revell, L. J. (2012) phytools: An R package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution, 3, 217-223.

Sandvik, H. (2008). Tree thinking cannot taken for granted: challenges for teaching phylogenetics. Theory in Biosciences127(1), 45-51.

Uyeda, J. C., Zenil-Ferguson, R., & Pennell, M. W. (2018). Rethinking phylogenetic comparative methods. Systematic Biology67(6), 1091-1109.

Zhang, J., Pei, N., & Mi, X. (2012). phylotools: Phylogenetic tools for Eco-phylogenetics. R package version 0.1, 2.


Chapter 20 contents

13.1 – ANOVA Assumptions

Introduction

Like all parametric tests, assumptions are made about the data in order to justify and trust estimates and inferences drawn from ANOVA. These are

  1. Data come from normal distributed population. View with a histogram or Q-Q plot. Test with Shaprio-Wilks or other appropriate goodness of fit test. Normality tests are the subject of Chapter 13.3.
  2. Sample size equal among groups.
    • This is an example of a potentially confounding factor — If sample sizes differ, then any difference in means could be simply because of differences in sample size! This gets us into weighed versus unweighted means.
    • You shouldn’t be surprised that modern implementations of ANOVA in software easily handle (adjust for) these known confounding factors. Depending on the program, you’ll see “Adjusted means,” “least squares means,” “Marginal means” etc. This just implies that the group means are compared after accounting for confounding factors.
    • Importantly, as long as sample sizes among the groups are roughly equivalent, normality assumption is not a big deal (low impact on risk of Type I error).
  3. Independence of errors. One consequence of this assumption is that you would not view 100 repeated observations of a trait on the same subject as 100 independent data points. We’ll return to this concept more in the next two lectures. Some examples:
    • Colorimetric assay where the signal changes over time, and you measure in order (e.g., samples from group 1 first, samples from group 2 second, etc.) — this confounds group with time.
      • The consequence is that you are far more likely to reject the null hypothesis, committing a Type I error.
    • Let’s say you are observing running speeds of ten mongoose. However, it turns out that five of your subjects are actually from the same family, identical quintuplets! Do you really have ten subjects?
    • Compare brain-body mass ratio among different species; this is a classic comparative method problem (Fig. 3). Since 1985 (Felsenstein 1985), it was recognized that the hierarchical evolutionary relationships among the species must be accounted for to control for lack of independence among the taxa tested. See Phylogenetically independent contrasts, Chapter 20.12.
  1. Equal variances among groups. See Chapter 13.4  for how to test this for multiple groups.

Impact of assumptions

Note that R (and pretty much all statistics packages) will calculate the ANOVA and the p-value, but it is up to you to recognize that the P-value is accurate only if the assumptions are met. Violation of the assumption of normality can lead to Type I errors occurring more often than the 5% level. What to do if the assumptions are violated?

If the violation is due to only a handful of the data, you might proceed anyway. But following a significant test for normality, we could avoid the ANOVA in favor of nonparametric alternatives (Chapter 15), or, we might try to transform the data.

Consider a histogram of brains weight measures in grams for a variety of mammals (Fig. 1).

Histogram of brain mass (g) for 24 mammals (data from Boddy et al 2012).

Figure 1. Histogram of body mass (g) for 24 mammals (data from Boddy et al 2012).

We will introduce a variety of statistical tests of the assumption of normality in Chapter 13.3, but looking at a histogram as part of our data exploration, we clearly see the data are right-skewed (Fig. 1). Is this an example of normal distributed sample of observations? Clearly not. If we proceed with statistical tests on the raw data set, then we are more likely to commit a Type I error (i.e., we will reject the null hypothesis more often than we should).

A note on normality and biology. It is VERY possible that data may not be normally distributed or have equal variances on the original scale, but a simple mathematical manipulation may take care of that. In fact, in many case in biology that involve growth, many types of variables are expected to not be normal on the original scale. For example, while the relationship between body mass, M, and metabolic rate, MR, in many groups of organisms is allometric and increases positively, the relationship

    \begin{align*} MR = a\cdot M^b \end{align*}

is not directly proportional (linear) on the original scale. By taking the logarithm of both body mass and metabolic rate, however, the relationship is linear

    \begin{align*} log\left ( MR \right ) = log\left ( a \right ) + b\cdot log\left ( M \right ) \end{align*}

In fact, taking the logarithm (base 10, base 2, or base e) is often a common solution to both non-normal data (Fig. 2) and unequal variances.

Histogram of log-transformed body mass observations from Figure 1

Figure 2. Histogram of log10-transformed body mass observations from Figure 1.

Other common transformations include taking the square root or the inverse of the square-root for skewed or kurtotic sample distributions, and the arcsine for frequencies (since frequencies can only be from 0 to 1 — need to “stretch the data” to make frequencies fit procedures like ANOVA). There are many issues about data transformation, but keep in mind three points. After completing the transformation, you should check the assumptions (normality, equal variances) again.

You may need to recode the data before applying a transform. For example, you cannot take the square root or logarithm of negative numbers. If you do not recode the data, then you will lose these observations in your subsequent analyses. In many cases, this problem is easily solved by adding 1 to all data points before making the transform. I prefer to make the minimum value 1 and go from there. The justification for data transformation is basically to realize that there is no necessity to use the common arithmetic or linear scale: many relationships are multiplicative or nonadditive (e.g., rates in biology and medicine).

Statistical outlier

Another topic we should at least mention here is the concept of outliers. While most observations tend to cluster around the mean or middle of the sample distribution, occasionally one or more observations may differ substantially from the others. Such values are called outliers, and we note that there are two possible explanations for an outlier

  1. the value could be an error
  2. it is a true value†

†and there may be an interesting biological explanation for its cause.

We encountered a clear outlier in the BMI homework. If the reason is (1), then we go back and either fix the error or delete it from the worksheet. If (2), however, then we have no objective reason to exclude the point from our analyses.

We worry if the outlier influences our conclusions — so it is a good idea to run your analyses with and without the outlier. If your conclusions remain the same, then no worries. If your conclusions change based on one observation, then this is problematic. For the most part you are then obligated to include the outlier and the more conservative interpretation of your statistical tests.

ANOVA is robust to modest deviations from assumptions

A comment about ANOVA assumptions ANOVA turns out to be robust to violations of item (1) or (2). That means unless the data are really skewed or the group sizes are very different, ANOVA will perform well (Type I error rate stays close to the specified 5% level). We worry about this however when p-value is very close to alpha!!

The third assumption is more important in ANOVA.

Like the t-test, ANOVA makes the assumption of equal variances among the groups, so it will be helpful to review why this assumption is important to both the t-test and ANOVA. In the two-sample independent t-test, the pooled sample variance, s_{p}^2, is taken as an estimate of the population variance, \sigma^{2}. If you recall,

where SS_{1} refers to the sum of squares for the first group and SS_{2} refers to the second group sum of squares (see our discussion on measures of dispersion) and v_{1} refers to the degrees of freedom for the first group and v_{2} refers to the second group degrees of freedom. We make a similar assumption in ANOVA. We assume that the variances for each sample are the same and therefore that they all estimate the population variance \sigma^{2}. To say it in another way, we are assuming that all of our samples have identical variability

Once we make this assumption, we may pool (or combine) all of the SSs and DFs for all groups as our best estimate of the population variance, \sigma^{2}. The trick to understanding ANOVA is to realize that there can be two types of variability: there is variability due to being part of a group (e.g., even though ten human subjects receive the same calorie-restricted diet, not all ten will loose the same amount of weight) and there is variability among or between groups (e.g., on average, all subjects who received the calorie-restricted diet lost more weight than did those subjects who were on the non-restricted diet).

Example

The encephalization index (or encephalization quotient) is defined as the ratio of size the brain compared to body size. While there is a well-recognized increase in brain size given increased body size, encephalization describes a shift of function to cortex (frontal, occipital, parietal, temporal) from noncortical parts of the brain (cerebellum, brainstem). Increased cortex is associated with increased complexity of brain function; for some researchers, the index is taken as a crude estimate of intelligence. Figure 3A shows plot of brain mass in grams versus body size (grams) for 24 mammal species (data sampled from Boddy et al 2012); figure 3B shows the same data, but following log10-transform of both variables.

encephalization index

Figure 3. Plot of brain and body weights (A) and log10-log10 transform (B) for a variety of species (data from Boddy et al 2012). The ratio is called encephalization index.

Looking at the two figures the linear relationship between the two variables is obvious in Figure 3B, less so for Figure 3A. Thus, one biological justification for transformation of the raw data is exemplified with the brain-body mass dataset: the association is allometric not additive. The other reason to apply a transform is statistical; the log10-transform improves the normality of the variables. Take a look at the Q-Q plot for the raw data (Figure 4) and for the log10-transformed data (Fig. 5).

Q-Q plot, raw body mass

Figure 4. Q-Q plot, raw data. Compare to Figure 1.

Note the data don’t fall on the straight line, a few fall outside of the confidence interval (the curved dashed lines), which suggests the data are not normally distributed (see histogram, Figure 1). And for the transformed data, the Q-Q plot is shown in Fig. 5.

Q-Q plot, log-transformed body mass

Figure 5. Q-Q plot same data, log10-transformed, compare to Figure 2.

Compared to the raw data, the transformed data now fall on the line and none are outside of the confidence interval. We would conclude that the transformed data are more normal, thus, better meeting the assumptions of our parametric tests.

Lack of independence among data

Species comparisons are common in evolutionary biology and related fields. As noted earlier, comparative data should not be treated as independent data points. For our 24 species, I plotted the estimate of the phylogeny (timetree.org), see 20.11 – Plot a Newick tree for how to plot phylogeny trees in R.

phylogeny

Figure 6. Phylogenetic tree of 24 species used in this report.

The conclusion? We don’t have 24 data points, more like 8 points. Because the species are more or less related, there are fewer than 24 independent data points.  Statistically, this would mean that the errors are correlated. Various approaches to account for this lack of independence have been developed; perhaps the most common approach is to apply phylogenetically independent contrasts, a topic discussed in Chapter 20.12. (Boddy et al 2012 used this approach.)

See Chapter 20.11 for help making a plot like the one shown in Figure 6.

Questions

  1. †Shapiro-Wilks is one test of normality. Can you recall the name of the other normality test we named?

Data set and R code used in this page

species, order, body, brain
'Herpestes ichneumon', Carnivora, 1764, 24.1
'Potos flavus', Carnivora, 2620, 31.05
'Vulpes vulpes', Carnivora, 3080, 49.5
'Madoqua kirkii', Cetartiodactyla, 4570, 37
'Sus scrofa', Cetartiodactyla, 1000, 47.7
'Tragulus napu', Cetartiodactyla, 2510, 18.5
'Casinycteris argynnis', Chiroptera, 40.5, 0.92
'Cynopterus brachyotis', Chiroptera, 29, 0.88
'Potorous platyops', Chiroptera, 718, 6.5
'Cercartetus nanus', Diprotodontia, 12, 0.44548
'Dactylopsila palpator', Diprotodontia, 474, 7.15876
'Vombatus ursinus', Diprotodontia, 1902, 11.396
'Crocidura fuscomurina', Eulipotyphla, 5.6, 0.13
'Scalopus aquaticus', Eulipotyphla, 39.6, 1.48
'Sorex hoyi', Eulipotyphla, 2.6, 0.107
'Elephantulus fuscipes', Macroscelidea, 57, 1.33
'Rhynchocyon cirnei', Macroscelidea, 490, 6.1
'Rhynchocyon petersi', Macroscelidea, 717.3, 4.46
'Aotus trivirgatus', Primates, 480, 15.5
'Cebus capucinus', Primates, 590, 53.28
'Leontopithecus rosalia', Primates, 502.5, 11.7
'Dipodomys agilis', Rodentia, 61.4, 1.34
'Peromyscus melanocarpus', Rodentia, 58.8, 1.03
'Sciurus carolinensis', Rodentia, 367, 6.49

The Newick code for the tree in Figure 6.

((Vombatus_ursinus:48.94499077,((Potorous_platyops:47.59556667,Cercartetus_nanus:47.59556667)'14':0.66887333,Dactylopsila_palpator:48.26444000)'13':0.68055077)'11':109.65259681,(((((Crocidura_fuscomurina:33.74066667,Sorex_hoyi:33.74066667)'10':33.03022424,Scalopus_aquaticus:66.77089091)'19':22.55292749,(((Herpestes_brachyurus:54.32144118,(Vulpes_vulpes:45.52834967,Potos_flavus:45.52834967)'9':8.79309151)'22':23.43351523,((Tragulus_napu:43.96862857,Madoqua_kirkii:43.96862857)'8':17.99735995,Sus_scrofa:61.96598852)'6':15.78896789)'30':0.77378567,(Casinycteris_argynnis:35.20000000,Cynopterus_brachyotis:35.20000000)'29':43.32874208)'27':10.79507632)'35':7.13857077,(((Peromyscus_melanocarpus:69.89837667,Dipodomys_agilis:69.89837667)'43':0.64655123,Sciurus_carolinensis:70.54492789)'42':19.27825952,((Leontopithecus_rosalia:18.38385647,Aotus_trivirgatus:18.38385647)'40':1.29720005,Cebus_capucinus:19.68105652)'48':70.14213090)'51':6.63920175)'47':8.99739162,(Elephantulus_fuscipes:39.23366667,(Rhynchocyon_cirnei:15.34500000,Rhynchocyon_petersi:15.34500000)'39':23.88866667)'56':66.22611412)'55':53.13780679);

Chapter 13 contents