Metric learning for phylogenetic invariants: An algebraic approach to evolutionary
用于演化树代数不变量的度量结构学习——构造进化树的一种代数方法

Nicholas Eriksson; Yao Yuan 姚远

Abstract 摘要
Construction of phylogenetic trees from observations is a fundamental challenge in both evolutionary biology and evolutionary linguistics. Here we approach the problem from a new perspective by adopting algebraic invariants associated with topological structures of phylogenetic trees. Our key development is based on machine learning to optimize the power of phylogenetic invariants for the construction of phylogenetic tree quartets, the building blocks of general evolutionary trees. Phylogenetic invariants are polynomials in the joint probabilities which vanish under a model of evolution on a phylogenetic tree. We give algorithms for selecting a good set of invariants and for learning a metric on this set of invariants which optimally distinguishes the different models. Our learning algorithms involve linear and semidefinite programming on data simulated over a wide range of parameters. We provide extensive tests of the learned metrics on simulated data from phylogenetic trees with four leaves under the Jukes-Cantor and Kimura 3-parameter models of DNA evolution. Our method greatly improves other uses of invariants and is competitive with or better than the popular neighbor-joining method. In particular, we obtain metrics trained on trees with short internal branches which perform much better than neighbor joining on this region of parameter space. These results exhibit potential advantages of applying the new methodology to evolutionary linguistics.

从观测数据中构建演化树是生命演化和进化语言学的一个基础问题。本文试图从一个新角度来研究这个问题,即通过演化树的代数不变量来重建演化树的拓扑结构。我们关键的新发展是基于机器学习来优化选择演化树的代数不变量,针对四元演化树发展了一种新的构造方法。演化树代数不变量是指关于联合分布的一种特殊的代数多项式,其在树上的演化模型下恒等于零。本文主要贡献在于发展了一类算法,用于选择一组更好区分不同演化树模型拓扑结构的代数不变量以及相应的度量结构。我们的算法基于给定演化模型下的广泛参数变化而产生的仿真数据,采用线性规划和半正定规划来学习。文中对于DNA 演化的Jukes-Cantor 模型和Kimura 三参数模型进行了广泛的仿真试验测试。试验表明:本文方法整体上同目前广泛使用的Neighbor-Joining 算法相比,具有相似或者更好的性能;特别是对于四元树具有较短内部分支的Felsenstein 参数区,本文方法远远超过后者的性能。这些结果展示了将我们的新方法应用于进化语言学研究时可能具有的优势。

Keywords 关键词

Phylogenetic invariants 演化树代数不变量 Algebraic statistics 代数统计量 Semidefinite programming 半正定规划 Felsenstein zone Felsenstein参数区


Journal of Chinese Linguistics Monograph Series (ISSN 2409-2878), Number 27 (2017): 66-85
Copyright © 2017 Journal of Chinese Linguistices. All rights reserved.

Article 文章

<< Back 返回

Readers 读者