In this vignette, we will show how to work with morphological data in phangorn (Schliep 2011). In most cases the different morphological characters or character states are encoded with the numbers 0:9 (or less, if there are less differences). Morphological data can come in different formats. The most common ones are .csv and .nexus.
We start by loading the phangorn package and setting a random seed:
library(phangorn)
set.seed(9)
The dataset we’re using contains morphological data for 12 mite
species, with 79 encoded characters (Schäffer et
al. 2010). When reading in the .csv file,
row.names = 1
uses the first column (species) as row names.
To get a phyDat
object, we have to convert the dataframe
into a matrix with as.matrix
.
<- system.file("extdata", package = "phangorn")
fdir <- read.csv(file.path(fdir, "mites.csv"), row.names = 1)
mm <- phyDat(as.matrix(mm), type = "USER", levels = 0:7) mm_pd
The data can then be written into a nexus file:
write.phyDat(mm_pd, file.path(fdir, "mites.nex"), format = "nexus")
Reading in a nexus file is even easier than reading in a csv file:
<- read.phyDat(file.path(fdir, "mites.nex"), format = "nexus", type = "STANDARD") mm_pd
After reading in the nexus file, we have the states 0:9, but the data only has the states 0:7. Here is one possibility to change the contrast matrix:
<- matrix(data = c(1,0,0,0,0,0,0,0,0,
contrast 0,1,0,0,0,0,0,0,0,
0,0,1,0,0,0,0,0,0,
0,0,0,1,0,0,0,0,0,
0,0,0,0,1,0,0,0,0,
0,0,0,0,0,1,0,0,0,
0,0,0,0,0,0,1,0,0,
0,0,0,0,0,0,0,1,0,
0,0,0,0,0,0,0,0,1,
1,1,1,1,1,1,1,1,1),
ncol = 9, byrow = TRUE)
dimnames(contrast) <- list(c(0:7,"-","?"),
c(0:7, "-"))
contrast
## 0 1 2 3 4 5 6 7 -
## 0 1 0 0 0 0 0 0 0 0
## 1 0 1 0 0 0 0 0 0 0
## 2 0 0 1 0 0 0 0 0 0
## 3 0 0 0 1 0 0 0 0 0
## 4 0 0 0 0 1 0 0 0 0
## 5 0 0 0 0 0 1 0 0 0
## 6 0 0 0 0 0 0 1 0 0
## 7 0 0 0 0 0 0 0 1 0
## - 0 0 0 0 0 0 0 0 1
## ? 1 1 1 1 1 1 1 1 1
<- phyDat(mm_pd, type="USER", contrast=contrast) mm_pd
Now that we have our data, we can start the analyses.
For morphological data, one of the most frequently used approaches to
conduct phylogenetic trees is maximum parsimony (MP).
pratchet
(as already described in Estimating
phylogenetic trees with phangorn) implements the parsimony ratchet
(Nixon 1999). To create a starting tree,
we can use the function random.addition
:
<- random.addition(mm_pd) mm_start
This tree can then be given to pratchet
:
<- pratchet(mm_pd, start = mm_start, minit = 1000, maxit = 10000,
mm_tree all = TRUE, trace = 0)
mm_tree
## 19 phylogenetic trees
With all=TRUE
we get all (in this case 19) trees with
lowest parsimony score in a multiPhylo
object. Since we we
did a minimum of 1000 iterations, we already have some edge support. Now
we can assign the edge lengths.
<- acctran(mm_tree, mm_pd) mm_tree
In the case of our mites-dataset with 12 sequences, it’s also
possible to use the branch and bound algorithm (Hendy and Penny 1982) to find all most
parsimonious trees. With bigger datasets it is definitely recommended to
use pratchet
.
<- bab(mm_pd, trace = 0)
mm_bab mm_bab
## 37 phylogenetic trees
If we want our unrooted trees to be rooted, we have the possibility
to use midpoint
to perform midpoint rooting. Rooting the
trees with a specific species (we chose C. cymba here) can be
done with the function root
from the ape package
(Paradis and Schliep 2019). To save the
correct node labels (edge support), it’s important to set
edgelabel=TRUE
.
<- root(mm_tree, outgroup = "C._cymba", resolve.root = TRUE,
mm_tree_rooted edgelabel = TRUE)
With plotBS
, we can either plot all of the trees with
their respective edge support, or we can subset to only get a certain
tree. It is also possible to save the plots as .pdf (or various
other formats, e.g. svg, png, tiff) file. digits
is an
argument to determine the number of digits shown for the bootstrap
values.
# plot all trees
plotBS(mm_tree_rooted, digits = 2)
# subsetting for tree nr. 9
plotBS(mm_tree_rooted[[9]], digits = 2)
# save plot as pdf
pdf(file = "mm_rooted.pdf")
plotBS(mm_tree_rooted, digits = 2)
dev.off()
To look at the consensus tree of our 19 trees from
pratchet
, or of our 37 most parsimonious trees from
bab
, we can use the consensus
function from
ape.
# unrooted pratchet tree
<- consensus(mm_tree)
mm_cons
# rooted pratchet tree
<- consensus(mm_tree_rooted, rooted = TRUE)
mm_cons_root
# branch and bound, we root the consensus tree in the same step
<- root(consensus(mm_bab), outgroup = "C._cymba",
mm_bab_cons resolve.root = TRUE, edgelabel = TRUE)
plot(mm_cons, main="Unrooted pratchet consensus tree")
plot(mm_cons_root, main="Rooted pratchet consensus tree")
plot(mm_bab_cons, main="Rooted bab consensus tree")
We can clearly see that, as expected, the two rooted trees have the same topology.
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=de_AT.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_AT.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=de_AT.UTF-8 LC_MESSAGES=de_AT.UTF-8
## [7] LC_PAPER=de_AT.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] phangorn_2.11.1 ape_5.6-2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 bslib_0.4.2 compiler_4.2.2 jquerylib_0.1.4
## [5] highr_0.10 tools_4.2.2 digest_0.6.31 jsonlite_1.8.4
## [9] evaluate_0.20 lifecycle_1.0.3 nlme_3.1-161 lattice_0.20-45
## [13] pkgconfig_2.0.3 rlang_1.0.6 Matrix_1.5-3 fastmatch_1.1-3
## [17] igraph_1.3.5 cli_3.6.0 rstudioapi_0.14 yaml_2.3.6
## [21] parallel_4.2.2 xfun_0.36 fastmap_1.1.0 stringr_1.5.0
## [25] knitr_1.41 generics_0.1.3 vctrs_0.5.1 sass_0.4.4
## [29] grid_4.2.2 glue_1.6.2 R6_2.5.1 rmarkdown_2.20
## [33] magrittr_2.0.3 codetools_0.2-18 htmltools_0.5.4 quadprog_1.5-8
## [37] stringi_1.7.12 cachem_1.0.6