With the function pml_bb
from phangorn (Schliep 2011) a lot of steps have become easier
and shorter. If you want to have more control over all of the used
parameters, it is also possible to use the older functions,
e.g. optim_pml
. The data is the same as in the vignette
Estimating phylogenetic trees with phangorn:
library(ape)
library(phangorn)
<- system.file("extdata/trees", package = "phangorn")
fdir <- read.phyDat(file.path(fdir, "primates.dna"),
primates format = "interleaved")
As a starting tree, we calculate a neighbor joining tree:
<- dist.ml(primates)
dm <- NJ(dm) treeNJ
<- pml(treeNJ, data=primates)
fit fit
## model: JC
## loglikelihood: -3075
## unconstrained loglikelihood: -1230
##
## Rate matrix:
## a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
##
## Base frequencies:
## a c g t
## 0.25 0.25 0.25 0.25
The function pml
returns an object of class
pml
. This object contains the data, the tree and many
different parameters of the model like the likelihood. There are many
generic functions for the class pml
available, which allow
the handling of these objects.
methods(class="pml")
## [1] AICc BIC anova logLik plot print simSeq update vcov
## see '?methods' for accessing help and source code
The object fit just estimated the likelihood for the tree it got
supplied, but the branch length are not optimized for the Jukes-Cantor
(Jukes and Cantor 1969) model yet, which
can be done with the function optim.pml
.
<- optim.pml(fit, rearrangement="NNI") fitJC
## optimize edge weights: -3075 --> -3068
## optimize edge weights: -3068 --> -3068
## optimize topology: -3068 --> -3068 NNI moves: 1
## optimize edge weights: -3068 --> -3068
## optimize topology: -3068 --> -3068 NNI moves: 0
logLik(fitJC)
## 'log Lik.' -3068 (df=25)
With the default values pml
will estimate a Jukes-Cantor
model. That means equal base frequencies and all transition rates are
equal. The generic function update
allows to change
parameters manually. This is not what we usually want to do. However we
might want to supply a different tree or change the number of rate
categories.
<- update(fitJC, k=4, inv=0.2, bf=baseFreq(primates))
fitF81 fitF81
## model: F81+G(4)+I
## loglikelihood: -3037
## unconstrained loglikelihood: -1230
## Proportion of invariant sites: 0.2
## Discrete gamma model
## Number of rate categories: 4
## Shape parameter: 1
##
## Rate matrix:
## a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
##
## Base frequencies:
## a c g t
## 0.37481 0.40160 0.03911 0.18448
In the line above we changed the model to a (discrete) rate across site model with 4 rate categories (using the default shape parameter of 1), to 0.2 invariant sites and supply empirical base frequencies.
<- optim.pml(fitF81, model="GTR", optInv=TRUE, optGamma=TRUE,
fitGTR rearrangement = "NNI", control = pml.control(trace = 0))
fitGTR
## model: GTR+G(4)+I
## loglikelihood: -2610
## unconstrained loglikelihood: -1230
## Proportion of invariant sites: 0.005846
## Discrete gamma model
## Number of rate categories: 4
## Shape parameter: 2.89
##
## Rate matrix:
## a c g t
## a 0.000 1.211438 74.564568 1.025
## c 1.211 0.000000 0.006716 31.278
## g 74.565 0.006716 0.000000 1.000
## t 1.025 31.278075 1.000000 0.000
##
## Base frequencies:
## a c g t
## 0.39401 0.37870 0.04036 0.18692
We will change the model to the GTR + \(\Gamma(4)\) + I model and then optimize all the parameters.
With the control parameters the thresholds for the fitting process
can be changed. Here we want just to suppress output during the fitting
process. For larger trees the NNI rearrangements often get stuck in a
local maximum. We added two stochastic algorithms to improve topology
search. The first (set rearrangement="stochastic"
) performs
stochastic rearrangements similar as in (Nguyen
et al. 2015), which makes random NNI permutation to the tree,
which than gets optimized to escape local optima. The second option
(rearrangement="ratchet"
) perform the likelihood ratchet
(Vos 2003).
While these algorithms may find better trees they will also take more time.
<- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
fitGTR rearrangement = "stochastic", control = pml.control(trace = 0))
fitGTR
## model: GTR+G(4)+I
## loglikelihood: -2607
## unconstrained loglikelihood: -1230
## Proportion of invariant sites: 0.005648
## Discrete gamma model
## Number of rate categories: 4
## Shape parameter: 2.698
##
## Rate matrix:
## a c g t
## a 0.0000 0.791737 73.967841 0.6363
## c 0.7917 0.000000 0.005045 28.4997
## g 73.9678 0.005045 0.000000 1.0000
## t 0.6363 28.499694 1.000000 0.0000
##
## Base frequencies:
## a c g t
## 0.39628 0.37883 0.04027 0.18461
We can compare nested models for the JC and GTR + \(\Gamma(4)\) + I model using likelihood ratio statistic
anova(fitJC, fitGTR)
## Likelihood Ratio Test Table
## Log lik. Df Df change Diff log lik. Pr(>|Chi|)
## 1 -3068 25
## 2 -2607 35 10 923 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with the Shimodaira-Hasegawa test
SH.test(fitGTR, fitJC)
## Trees ln L Diff ln L p-value
## [1,] 1 -2607 0.0 0.5034
## [2,] 2 -3068 461.6 0.0000
or with the AIC
AIC(fitJC)
## [1] 6187
AIC(fitGTR)
## [1] 5283
AICc(fitGTR)
## [1] 5296
BIC(fitGTR)
## [1] 5404
At last we may want to apply standard bootstrap to test how well the edges of the tree are supported. This has already been shown in the vignette Estimating phylogenetic trees with phangorn.
<- bootstrap.pml(fitJC, bs=100, optNni=TRUE,
bs control = pml.control(trace = 0))
Now we can plot the tree with the bootstrap support values on the
edges and also look at consensusNet
to identify potential
conflict.
plotBS(midpoint(fitJC$tree), bs, p = 50, type="p")
<- consensusNet(bs, p=0.2)
cnet plot(cnet, show.edge.label=TRUE)
phangorn has several functions to generate tree topologies,
which may are interesting for simulation studies. allTrees
computes all possible bifurcating tree topologies either rooted or
unrooted for up to 10 taxa. One has to keep in mind that the number of
trees is growing exponentially, use howmanytrees
from
ape as a reminder.
<- allTrees(5)
trees par(mfrow=c(3,5), mar=rep(0,4))
for(i in 1:15)plot(trees[[i]], cex=1, type="u")
nni
returns a list of all trees which are one nearest
neighbor interchange away.
nni(trees[[1]])
## 4 phylogenetic trees
rNNI
and rSPR
generate trees which are a
defined number of NNI (nearest neighbor interchange) or SPR (subtree
pruning and regrafting) away.
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=de_AT.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_AT.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=de_AT.UTF-8 LC_MESSAGES=de_AT.UTF-8
## [7] LC_PAPER=de_AT.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] phangorn_2.11.1 ape_5.6-2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 bslib_0.4.2 compiler_4.2.2 jquerylib_0.1.4
## [5] highr_0.10 tools_4.2.2 digest_0.6.31 jsonlite_1.8.4
## [9] evaluate_0.20 lifecycle_1.0.3 nlme_3.1-161 lattice_0.20-45
## [13] pkgconfig_2.0.3 rlang_1.0.6 Matrix_1.5-3 fastmatch_1.1-3
## [17] igraph_1.3.5 cli_3.6.0 rstudioapi_0.14 yaml_2.3.6
## [21] parallel_4.2.2 xfun_0.36 fastmap_1.1.0 stringr_1.5.0
## [25] knitr_1.41 generics_0.1.3 vctrs_0.5.1 sass_0.4.4
## [29] grid_4.2.2 glue_1.6.2 R6_2.5.1 rmarkdown_2.20
## [33] magrittr_2.0.3 codetools_0.2-18 htmltools_0.5.4 quadprog_1.5-8
## [37] stringi_1.7.12 cachem_1.0.6