Research

I am generally interested in questions about the evolution of plants, genes, plant genomes and transcriptomes, and the predictive biology using machine learning.

What are the effects of genome assemblies, genotyping approaches, genetic variant types, population structures on genomic prediction? When predicting 20 complex traits in switchgrass (Panicum virgatum L.), a perennial biofuel feedstock, we found that short read-based genome assembly performs comparably to or even better than long read-based assembly; due to higher gene coverage, exome capture and multi-allelic variants outperform genotyping-by-sequencing and bi-allelic variants, respectively; tetraploid models show higher prediction accuracy than octoploid models for most traits, likely due to the greater genetic distances among tetraploids. These results have been published in Plant Physiol. pubmed

How do transcriptome and methylome contribute to trait prediction? Transcriptome- and methylome-based models had performances comparable to those of genome-based models in Arabidopsis thaliana, but identified different benchmark genes. Gene contributions to flowering time prediction are accession-dependent and distinct genes contribute to trait prediction in different genotypes. Models integrating multi-omics data perform best and reveal known and additional gene interactions, extending knowledge about existing regulatory networks underlying flowering time determination. These results have been published in Nat Commun. pubmed

Why are the structures of spiral flowers flexible? Using Nigella damascena (love-in-a-mist) as a model system, we revealed that the expression profiles of floral organ identify genes are highly dynamic, and the genes regulate the boundaries between different types of floral organs by forming a complex regulatory network. Notably, the AGL6-lineage member NdAGL6 is an A-function gene determining the identity of sepal and petal, rather than the AP1-lineage genes as in Arabidopsis thaliana. These results have been published in Nat Plants. pubmed

How can we measure Arabidopsis fitness efficiently? Using the Faster Region-based Convolutional Neural Network (R-CNN) algorithm, we established the image recognition models for Arabidopsis seeds and fruits, which yielded near perfect counts (r2 = 0.9996 and 0.980, respectively). These results have been published in New Phytol.pubmed

How well can the gene expression data be used to predict metabolic pathway memberships? To optimize the use of gene expression data to predict plant metabolic pathway memberships, we explored >600 tomato expression data combinations, three strategies for predicting memberships in 85 pathways. We found that optimal predictions for different pathways require distinct data combinations indicative of pathway functions. Unsupervised learning performed better than supervised approaches, while gene-to-pathway expression similarities led to prediction models that outperformed those based simply on expression levels. Predictions for experimental validated genes is significantly better than that for genes without experimental evidence, demonstrating the importance of data quality. Our study highlights the need to extensively explore expression-based features and prediction strategies to maximize the accuracy of metabolic pathway membership assignment. These results have been published in New Phytol. pubmed

How do gene family sizes vary among related plant species? Using the Solanaceae family as a model and Pfam domain families as a proxy for gene families, our findings reveal complex relationships between variation in gene family size, gene functions, duplication mechanism, and evolutionary rate: 1) genes in highly variable families have high turnover rates and tend to be involved in processes that have diverged between Solanaceae species, whereas genes in low-variability families tend to have housekeeping roles; 2) genes in high and low-variability gene families tend to be duplicated by tandem and whole genome duplication, respectively; 3) genes duplicated by different mechanisms experience different selection pressures. In addition, we found that a substantial number of pseudogenes are actually products of pseudogene duplication, contrary to the expectation that most plant pseudogenes are remnants of once-functional duplicates. These results can be available in Genome Biol Evol. pubmed

What are the factors influencing read coverage and misassembly in a genome assembled using short reads? In tomato, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively. We first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. We also found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads. These results have been published in BMC Genomics. pubmed

Towards the computational prediction of plant metabolic pathways Uncovering genes encoding enzymes responsible for the biosynthesis of diverse plant metabolites is essential for metabolic engineering and production of plant metabolite-derived medicine. With the availability of multi-omics data for an ever-increasing number of plant species and the development of computational approaches, the metabolic pathways of many important plant compounds can be predicted, complementing a more traditional genetic and/or biochemical approach. We summarized recent progress in predicting plant metabolic pathways using genome, transcriptome, proteome, interactome, and/or metabolome data, and the utility of integrating these data with machine learning to further improve metabolic pathway predictions. This review has been published in Current Opinion in Plant Biology. pubmed

Research progress on the molecular mechanisms underlying floral trait formation in Phalaenopsis orchids Phalaenopsis orchids are one of the most important ornamental crops, prized for their beautiful flowers and long flowering phase. Despite significant research efforts, progress has been hampered by several resource limitations, and the molecular mechanisms underlying the formation and diversity of most important horticultural traits in Phal. orchids remain largely unexplored. We summarized recent research advances, with a primary focus on the key floral traits in Phal. orchids, including inflorescence type, flowering time, floral organ organization, color patterning, size, longevity, scent, organ shape, cuticle production and wax biosynthesis. Furthermore, we offered perspectives on future research directions aimed at elucidating the genetic basis for the remarkable diversity of these traits and advancing molecular breeding in Phal. orchids. This review has been online in Horticulture Research. Horticulture Research

Peipei Wang

I am generally interested in questions about the evolution of plants, genes, plant genomes and transcriptomes, and the predictive biology using machine learning.