Abstract Detail

A synthesis of new paleontological and phylogenomic perspectives on gymnosperm evolution

Endara, Lorena [1], Cui, Hong [2], Burleigh, Gordon [1].

A phenomic matrix for Gymnosperms: Lessons and challenges of using a semi-automated Natural Language Processing approach.

Understanding the evolution of phenotypic characters is necessary to elucidate the genealogy of life. We used the gymnosperms as a large-scale case study to evaluate the performance of a Natural Language Processing (NLP) pipeline to extract phenotypic information from across the plant tree of life. We generated a phenotypic matrix for ~1100 extant species of gymnosperms using a semi-automated NLP pipeline specifically designed to extract phenotypic traits from the text in taxonomic descriptions, which are written in an abbreviated syntax and non-standardize language. First, we uploaded the text of the taxonomic descriptions to the ETC website (http://etc.cs.umb.edu/etcsite/); the software then used an unsupervised algorithm to semantically annotate the text. These annotations were used to generate a 'taxon x character' matrix that contained characters and character states extracted from the source text. We evaluated the usefulness and homology of the data from the preliminary matrix, and discretized and coded it. Compared to other approaches for assembling phenotypic data that target the extraction of predefined traits, this novel approach analyzes the complete information from the taxonomic descriptions, thus facilitating character discovery. We describe the different strategies used to optimize our ability to obtain phenotypic data for gymnosperms and the challenges we encountered as the NLP software dealt with increasing numbers of phylogenetically distant taxa and variation in the use of language across the taxonomic literature. This large scale analysis also offers insights on how can we improve the way we use language to write descriptions.

1 - University of Florida, Department of Biology, Gainesville, FL, 32611, USA
2 - University of Arizona, School of Information, PO Box 210074, Tucson, AZ, 85719, USA

Natural Language Processing
Phenomic matrix
technical botanical vocabulary.

