PlantAPAdb: A Comprehensive Database for Alternative Polyadenylation Sites in Plants1[CC-BY]

PlantAPAdb is a user-friendly database of alternative polyadenylation (APA) sites in plants, which will promote the elucidation of APA mechanisms, conservation, and gene expression regulation. Alternative cleavage and polyadenylation (APA) is increasingly recognized as an important regulatory mechanism in eukaryotic gene expression and is dynamically modulated in a developmental, tissue-specific, or environmentally responsive manner. Given the functional importance of APA and the rapid accumulation of APA sites in plants, a comprehensive and easily accessible APA site database is necessary for improved understanding of APA-mediated gene expression regulation. We present a database called PlantAPAdb that catalogs the most comprehensive APA site data derived from sequences from diverse 3′ sequencing protocols and biological samples in plants. Currently, PlantAPAdb contains APA sites in six species, Oryza sativa (japonica and indica), Arabidopsis (Arabidopsis thaliana), Medicago truncatula, Trifolium pratense, Phyllostachys edulis, and Chlamydomonas reinhardtii. APA sites in PlantAPAdb are available for bulk download and can be queried in a Google-like manner. PlantAPAdb provides rich information of the whole-genome APA sites, including genomic locations, heterogeneous cleavage sites, expression levels, and sample information. It also provides comprehensive poly(A) signals for APA sites in different genomic regions according to distinct profiles of cis-elements in plants. In addition, PlantAPAdb contains events of 3′ untranslated region shortening/lengthening resulting from APA, which helps to understand the mechanisms underlying systematic changes in 3′ untranslated region lengths. Additional information about conservation of APA sites in plants is also available, providing insights into the evolutionary polyadenylation configuration across species. As a user-friendly database, PlantAPAdb is a large and extendable resource for elucidating APA mechanisms, APA conservation, and gene expression regulation.

Alternative cleavage and polyadenylation (APA) is increasingly recognized as an important regulatory mechanism in eukaryotic gene expression and is dynamically modulated in a developmental, tissue-specific, or environmentally responsive manner. Given the functional importance of APA and the rapid accumulation of APA sites in plants, a comprehensive and easily accessible APA site database is necessary for improved understanding of APA-mediated gene expression regulation. We present a database called PlantAPAdb that catalogs the most comprehensive APA site data derived from sequences from diverse 39 sequencing protocols and biological samples in plants. Currently, PlantAPAdb contains APA sites in six species, Oryza sativa (japonica and indica), Arabidopsis (Arabidopsis thaliana), Medicago truncatula, Trifolium pratense, Phyllostachys edulis, and Chlamydomonas reinhardtii. APA sites in PlantAPAdb are available for bulk download and can be queried in a Google-like manner. PlantAPAdb provides rich information of the whole-genome APA sites, including genomic locations, heterogeneous cleavage sites, expression levels, and sample information. It also provides comprehensive poly(A) signals for APA sites in different genomic regions according to distinct profiles of cis-elements in plants. In addition, PlantAPAdb contains events of 39 untranslated region shortening/lengthening resulting from APA, which helps to understand the mechanisms underlying systematic changes in 39 untranslated region lengths. Additional information about conservation of APA sites in plants is also available, providing insights into the evolutionary polyadenylation configuration across species. As a user-friendly database, PlantAPAdb is a large and extendable resource for elucidating APA mechanisms, APA conservation, and gene expression regulation.
Cleavage and polyadenylation of precursor mRNA is a critical posttranscriptional process for mRNA maturation, which involves cleavage at the 39 end of a nascent transcript followed by the addition of a tract of adenosines [poly(A) tail]. Accumulating genomic studies, particularly those based on protocols capturing transcript 39 end (hereinafter referred to as 39 seq), have documented that most eukaryotic genes have multiple poly(A) sites (Ji et al., 2015;Tian and Manley, 2017;Gruber and Zavolan, 2019), offering the possibility of utilizing alternative cleavage and polyadenylation (APA) sites for gene expression regulation. APA has been indicated as a key layer of gene expression regulation by affecting mRNA stability, localization, exportation, and translational efficiency (Tian and Manley, 2017). APA also contributes to increased transcriptome complexity by generating diverse transcript isoforms with variable 39 untranslated regions (UTRs) and/or distinct coding potentials (Mayr, 2017;Tian and Manley, 2017). In animals, more than 70% of mammalian genes (Derti et al., 2012;Hoque et al., 2013) and ;50% of genes in fruit fly (Drosophila melanogaster) and zebrafish (Danio rerio; Li et al., 2012;Smibert et al., 2012;Ulitsky et al., 2012) possess more than one poly(A) site. APA is also prevalent in plants. For example, 55% of genes in moso bamboo (Phyllostachys edulis; Wang et al., 2017), more than 60% of genes in Medicago truncatula (Wu et al., 2014), and ;70% of genes in Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) undergo APA (Shen et al., 2008a(Shen et al., , 2011Wu et al., 2011;Fu et al., 2016;Zhou et al., 2019). An increasing body of evidence has indicated that APA is dynamically modulated in a developmental, tissue-specific, or environmentally responsive manner. Shortening or lengthening of 39 UTRs mediated through APA, also called APA site switching, has been found in a number of biological processes and diseases, such as cell activation, growth, proliferation, cancer, and neurodegenerative disorders (Tian and Manley, 2013;Gruber et al., 2014;Mayr, 2017). Alternative 39 UTRs can also differentially modulate gene expression through the binding of RNA-binding proteins (Mayr, 2017). Shorter 39 UTRs that escape regulation by 39 UTR elements, such as microRNA-binding sites, AU-rich elements, and GU-rich elements, may be more stable than their corresponding longer 39 UTRs and result in the production of more protein (Mayr and Bartel, 2009;Mayr, 2017). In plants, APA has been demonstrated in the regulation of genes that are involved in processes such as flowering time control, amino acid biosynthesis, plant incompatibility, and oxidative stress responses (Thomas et al., 2012;Lin et al., 2017;Hong et al., 2018;Fu et al., 2019;Zhou et al., 2019). Given the importance of APA for gene expression and cellular function as well as the rapid accumulation of APA sites in diverse species, it is necessary that we have a comprehensive and easily accessible catalog of poly(A) sites with quantified expression levels on a genome-wide scale.
Currently, several poly(A) site databases are available, the majority of which are for animal species. Early databases, such as polyA_DB (Zhang et al., 2005), PACdb (Brockman et al., 2005), and polyA_DB2 (Lee et al., 2007), were built upon cDNAs and ESTs, which are limited in data scale. With the development of various 39 seq technologies (for review, see Ji et al., 2015), a few databases recording poly(A) sites from 39 seq are emerging, such as APADB , APASdb (You et al., 2015), PolyAsite (Gruber et al., 2016), and PolyA_DB 3 (Wang et al., 2018b). However, these databases mainly target nonplant species and typically provide little or limited integration with poly(A) sites in plants. Previously, we presented the PlantAPA database , which stores poly(A) sites derived from RNA sequencing (RNA-seq), ESTs, 454, and 39 seq for four plant species. Whereas PlantAPA helps improve the initial understanding of the prevalence of APA in plants and contributes to the analysis of APA dynamics, it has several limitations. It is restrictive in data scale, as it fails to cover the entire set of 39 seq data available to date. For example, poly(A) sites from rice and Chlamydomonas (Chlamydomonas reinhardtii) were collected from ESTs and RNA-seq, which are of limited quantities and cover only a few samples. Moreover, there is no unified data-processing and identification process for poly(A) sites from 39 seq, hindering the integration of emerging 39 seq data. Although users are allowed to analyze APA site switching and extract relevant sequences by uploading their own data, results of APA switching, poly(A) signals, and sequences for public data are not available for download in PlantAPA. Furthermore, the annotation of poly(A) sites is not comprehensive and lacks information of poly(A) signal, conservation, or sample-specific expression levels.
Here, we present a database named PlantAPAdb (accessible through http://bmi.xmu.edu.cn/plantAPAdb or http://www.bmibig.cn/plantAPAdb), which provides a comprehensive and manually curated catalog of APA sites in plants based on a large volume of data derived from diverse biological samples generated by 39 seq. Currently, PlantAPAdb contains APA sites in six organisms, rice (japonica and indica), Arabidopsis, M. truncatula, Trifolium pratense (forage legume red clover), moso bamboo, and Chlamydomonas. It details APA sites in different tissues from diverse conditions, with rich information including their genomic locations, evidence of supported 39 seq reads, poly(A) signals, and conservation of APA between species. Data in PlantAPAdb are available for bulk download and can be queried in a Google-like manner. Plan-tAPAdb also stores detailed 39 UTR lengthening or shortening events between conditions, providing a rich resource for the study of mechanisms affecting the 39 UTR length and APA-mediated gene regulation. Moreover, PlantAPAdb provides conservation information of APA sites across plants, which helps elucidate the evolutionary significance of APA. In addition, poly(A) signals and related sequences for APA sites located in different genomic regions are available in PlantAPAdb, which contributes to indepth analysis of cis-elements related to APA, especially those in non-39 UTRs such as introns and intergenic regions. Particularly, PlantAPAdb incorporates a rich repertory of back-end programs and standard pipelines, which makes it highly expandable to incorporate emerging 39 seq data sets in the future. As a user-friendly database, PlantAPAdb is a large and extendable resource for improving genome annotation and for elucidating APA mechanisms, APA conservation, and APA-mediated gene expression regulation.

Poly(A) Site Data Sets in PlantAPAdb
Poly(A) sites from PlantAPAdb were all collected from 39 seq, which are of much higher confidence and larger scale than those from ESTs or RNA-seq. The homepage of PlantAPAdb (http://bmi.xmu.edu.cn/ plantAPAdb/index.php) summarizes curated poly(A) site data involved in diverse cell and tissue types, biotic and/or abiotic stress responses, and physiological conditions in six organisms (Table 1). Full information of the samples used in PlantAPAdb is presented in Supplemental Data Set S1. As an example, for Arabidopsis, poly(A) sites were obtained from a total of 37 samples from 86 experiments (Fig. 1A). Mapped 39 end reads within 24 nucleotides of each other were grouped into poly(A) site clusters (PACs; see "Materials and Methods"). After stringent filtering, the pooled sample contains 91,946 high-confidence PACs supported by more than 300 million reads. Detailed statistical results, such as the distribution of PACs in different genomic regions, number of genes with different number of PACs, and single-nucleotide profiles of PACs in different regions, are also available by clicking the Stats button on the summary table of a given species from the homepage (Fig. 1B).
The APA catalog page (http://bmi.xmu.edu.cn/ plantAPAdb/APAcatalog.php) catalogs all PAC data sets, which are well organized by different categories such as sequencing protocols, tissues, environmental conditions, plant types, and relevant studies ( Fig. 2A). For instance, if users are interested in rice PACs related to the study of Fu et al. (2016), they can click the Fu H, 2016 button in the Studies category to view all relevant data sets. Data sets are displayed by data set cards organized by tissue types. By clicking the Stats button at the bottom of a data set card [e.g. 20 Days Leaf (SRP073467)], users can view detailed descriptive and statistical information of the data set in a pop-up page named Statistics (Fig. 2B). Detailed information is given, including the sequencing protocol, tissue type, relevant study, literature, number of PACs, mapping information of the raw data, and external links to the National Center for Biotechnology Information (NCBI). Alternatively, by clicking the Download button in the data set card or the Statistics web page, users can download all potential PACs and high-confidence PACs in bed format or with full genome annotation (Fig. 2C). The list of heterogeneous cleavage sites before grouping into PACs is also available for download, which is especially useful for analyzing the microheterogeneity phenomenon of polyadenylation or testing the impact of a clustering procedure. Moreover, users can browse online high-confidence PACs in the genome browser by clicking the JBrowse button in the Statistics web page without downloading the full data (Fig. 2C).

Search and Browse
PlantAPAdb provides three strategies to query the database: full-text, limited-keyword, and sample-centric.
Users can conduct a full-text search by choosing All databases from the drop-down list next to the search box without specifying any field. When queried by the full-text strategy, PlantAPAdb returns search results relevant to the provided keyword by automatically scanning commonly used information such as gene identifier, gene symbol/alias, gene description, gene function, and Gene Ontology (GO) identifier. Users can also conduct a limited-keyword search by limiting a field of gene, GO identifier, or species. All queries except for the sample-centric search will lead to a media page that shows the number of PACs in each species. Take the Yellow-Green Leaf8 (YGL8) gene, for example, which has been reported to have a strong impact on chlorophyll content and photosynthesis in rice . This gene has multiple poly(A) sites and exhibits 39 UTR APA site switching between two rice subspecies . Users can directly search for YGL8 and then select Oryza sativa Japonica Group from the search result page (Fig. 3A). Then a media page opens, which shows a detailed gene list and the number of PACs in each gene associated with the input keyword in the selected species. By clicking a gene of interest (here Os01g0279100), a new page opens to show detailed information of the gene and its PACs, including the description of the gene, relevant GO identifiers, external links to Ensembl Plants and AmiGO (http://amigo.geneontology.org), and the PAC list ( Fig. 3B). There are four PACs in YGL8; all are located in 39 UTRs or extended 39 UTRs. Among the four PACs, the proximal one is the most dominant PAC, which has a much higher expression level than other PACs. This result is consistent with that in the previous study . In addition, by clicking the JBrowse link of a PAC item in the PAC list ( Fig. 3B), users can browse the PAC and the related genome sequence and gene model in the JBrowse genome browser. Alternatively, users can also access the browser by clicking the APA browse tab in the main menu. Selected samples (tracks) from each species can be quickly loaded and graphically browsed. Users can zoom in on a particular genomic region by searching a gene or chromosome fragment via the search box at the top of the browser. Data tracks of interest can be downloaded onto a local computer.
In addition to the full-text and limited-keyword strategies, users can also conduct a sample-centric search using a keyword of tissue/sample (e.g. root), which will lead to a media page that lists numbers of matched samples in all species (Supplemental Fig.  S1). Users can further click a species name to view descriptive information of all matched samples, including sequencing protocol, tissue type, experiment design, and external links to NCBI. By clicking the Catalog button on the page, users can view detailed information of the selected sample and download PACs as in the APA catalog module. Collectively, the one-click search provided in PlantAPAdb is convenient and powerful for data query, which returns informative and well-organized search results that can be easily browsed, downloaded, and visualized.

Poly(A) Signals of Poly(A) Sites
Over the last decade, numerous genomic studies have revealed various cis-elements that control polyadenylation in a wide range of mechanisms, suggesting both conserved and divergent poly(A) signals across species (Xing and Li, 2011;Tian and Graber, 2012;Ji et al., 2015). However, the majority of studies focused on poly(A) signals of canonical poly(A) sites that are located in 39 UTRs, whereas little emphasis has been laid on poly(A) signals of noncanonical sites. In Plan-tAPAdb, poly(A) signals for both 39 UTR and non-39 UTR APA sites for different plant species were identified, jointly considering various cis-element regions surrounding poly(A) sites that have been discovered in plants. Full lists of poly(A) signals for poly(A) sites in different genomic regions and sequences surrounding poly(A) sites are available for download in PlantAPAdb.
The APA signal module (http://bmi.xmu.edu.cn/ plantAPAdb/APAsignal.php) provides comprehensive poly(A) signals for different species. According to previous studies on plant poly(A) signals (for review, see Xing and Li, 2011), two schemes of poly(A) signals, one for Chlamydomonas and the other for other plant species, were present in PlantAPAdb. For each species, poly(A) signal patterns for APA sites in different genomic regions, including 39 UTR, 59 UTR, coding sequence (CDS), and intron, were provided. For example, following the information displayed on the web site, users can easily view poly(A) signals in near-upstream element (NUE) of 39 UTR poly(A) sites in Arabidopsis by choosing 39 UTR and NUE on the web page (Fig. 4, top). Single-nucleotide compositions and the 20 most frequently occurring patterns in the selected NUE region are intuitively visualized (Fig. 4, middle). Based on the poly(A) signal model in Arabidopsis (Loke et al., 2005), hexamers (e.g. AAUAAA) were scanned in NUE. Accordingly, statistical information of each hexamer is given, such as the frequency of occurrence, occurrence probability, and number of overlapping occurrences (Fig. 4,bottom). Similarly, users can also choose Chlamydomonas from the drop-down list on the web page to view poly(A) signals in the NUE of Chlamydomonas (Supplemental Fig.  S2). Different from Arabidopsis, the most dominant poly(A) signal pattern in Chlamydomonas is UGUAA instead of AAUAAA (Supplemental Fig. S2, bottom). Moreover, users can easily view poly(A) signals in different cis-element regions of noncanonical APA sites that are located in non-39 UTRs, such as 59 UTR, CDS, and intron.
In addition to the poly(A) signals provided in Plan-tAPAdb, users can also retrieve respective sequences for custom poly(A) signal analysis. Through the APA sequence page (http://bmi.xmu.edu.cn/plantAPAdb/ APAsequence.php), users can download genomic sequences surrounding PACs onto their local computers. Sequences of PACs from different species in different genomic regions, including 39 UTR, 59 UTR, CDS, and intron, were exportable. Particularly, sequences of PACs located in extended 39 UTRs were also provided, which may be useful for the study of cis-regulatory elements involved in 39 UTR extension. Moreover, if users would like to identify poly(A) signal patterns from a custom region, they could download poly(A) sequences and then use scripts provided in PlantAPAdb and/or other existing tools for poly(A) signal identification.
Collectively, PlantAPAdb provides a flexible and intuitive way to show comprehensive poly(A) signals in plants. The categories of these diverse poly(A) signals in different plant species would provide new insights into the conservation of APA and the effect of poly(A) signals on polyadenylation machineries. Particularly, poly(A) signals for APA sites in non-39 UTRs, such as introns, CDS, and 59 UTRs, would provide valuable resources for indepth analysis of noncanonical APA sites and their functions in gene regulation.

Poly(A) Site Usage
APA has been shown to be modulated in a tissueand/or developmental stage-specific manner (Tian and Manley, 2013;Gruber et al., 2014;Mayr, 2017). The APA metric module (http://bmi.xmu.edu.cn/plantAPAdb/ APAmetric.php) provides four PAC-level metrics and two gene-level metrics to quantify the relative poly(A) site usage, tissue specificity of a PAC, or 39 UTR length change of a gene across samples (see "Materials and Methods"). The PAC-level metrics include number of samples expressed, percentage of samples expressed, ratio, and sample specificity. The two gene-level metrics are relative usage of distal PAC and weighted 39 UTR length. Take the PAC-level metric of tissue specificity, for example. Users can view the tissue specificity of each PAC across all samples in a species (e.g. rice japonica) by choosing PAC index from the Type drop-down list and clicking the Tissue specificity radio button (Fig. 5, top). Different filtering conditions, such as total read number in all samples and percentage of expressed samples, can be set to filter the results. Top PACs ranked by the metric score (here the Shannon entropy score) can be visualized by a heatmap that shows the variation of tissue specificity across samples (Fig. 5, bottom). The full list of PACs and the data corresponding to the heatmap can also be downloaded. By clicking the JBrowse link of a PAC in the list, users are linked to the genome browser to view distributions of PACs and their expression levels across samples on the respective gene. Users can also filter PACs by their genomic locations (Fig. 5, top), which can further view the variability of tissue specificity across samples for both canonical 39 UTR PACs and noncanonical PACs located in non-39 UTRs. Therefore, through the APA metric module, users can easily retrieve PACs or genes specific to given samples by different methods and filtration conditions. 39 UTR Shortening/Lengthening Events APA has been indicated to regulate the expression of genes containing APA sites through impacting mRNA metabolism and protein localization (Tian and Manley, 2017). Modifications in the length of 39 UTRs, mediated through APA, have been found to play a fundamental role in posttranscriptional regulation in diverse tissues and developmental stages. Analysis of 39 UTR switching has become a routine process in many APA studies (Fu et al., 2011(Fu et al., , 2016Gruber et al., 2014;Lin et al., 2017;Zhou et al., 2019); however, there seems to be no universal strategy to identify 39 UTR lengthening/shortening events. In PlantAPAdb, we implemented two methods that have been widely applied in previous studies to identify 39 UTR switching by leveraging the comprehensive APA sites from various biological samples available in our database.
The APA switching module (http://bmi.xmu.edu.cn/ plantAPAdb/APAswitch.php) provides comprehensive lists of genes and PACs involving 39 UTR shortening/ lengthening. Users are free to choose one species and a pair of samples to obtain the list of APA site switching genes. Different filtering conditions (e.g. adjusted P value and log fold change) can be set to filter the results. 39 UTR switching events between the selected two conditions are Figure 5. The APA metric module. A list of poly(A) sites or genes can be obtained by choosing a metric (top). Top poly(A) sites ranked by the metric score (here the Shannon entropy score) can be visualized by a heatmap that shows the variation of tissue specificity across samples (middle). By clicking the buttons outlined in red, the full data and the data corresponding to the heatmap can be downloaded. This example is available at http://bmi.xmu.edu.cn/plantAPAdb/APAmetric.php.
visualized by a scatterplot that shows both shortening and lengthening events. Moreover, the list of genes with significant 39 UTR switching is present in a table and can be downloaded, including information such as the respective gene identifier, 39 UTR lengths, expression levels of involved PACs, and relevant statistical information. Through the APA switching module, users can easily retrieve 39 UTR switching results by different methods and filtration conditions. Therefore, consensus or combined results from multiple methods can be obtained to mitigate potential bias caused by different methods.
Next, we utilized an example to show the retrieval of APA site switching events. A previous study (Fu et al., 2016) investigated APA profiles in 14 different rice tissues and developmental stages and found a distinct pattern of APA usage in mature pollen. Here, we attempted to identify APA site switching genes between mature pollen and anther using the linear trend method. By choosing the two sample groups from the drop-down lists and selecting the Linear Trend method on the web page (Fig. 6, top), 22 genes with significant 39 UTR shortening/lengthening were obtained. These genes are visualized by a scatterplot to show the extent of 39 UTR shortening/lengthening (Fig. 6, middle). The full list of genes is provided in a table, which clearly shows expression levels of each gene and related PACs in both samples (Fig. 6, bottom). By clicking the link of a gene in the list, users can view distributions of PACs and their expression levels between the two investigated samples of this gene in a genome browser.

APA Conservation
The conservation information of APA sites across species is crucial for understanding the evolutionary importance of APA and the divergence in gene expression. However, the majority of previous studies focused on conservation of gene expression profiles across species, and there is scarce work done at the level of APA, especially in plants. The collection of high-confidence APA sites in PlantAPAdb would provide advantages to study APA conservation in a wider range of plant species. The conservation information of APA would provide insights into the importance of APA in transcriptome diversification and help elucidate the extensive morphological and functional differences among plant species.
The APA conservation module (http://bmi.xmu. edu.cn/plantAPAdb/APAconservation.php) provides the conservation information of APA sites across species. Users can browse conserved PACs by species and genomic regions (Fig. 7A). By clicking the Stats link in a data set card, conserved PACs and statistical information can be shown in a pop-up window (Fig. 7B). For each PAC, the coordinates of both the original species and the reference species (here Arabidopsis) were recorded. Moreover, the conservation status (conserved or nonconserved) and the PAC counterparts in other species of the given PAC are also given. Users can also download the list of conserved PACs to their local computers.

Bulk Download
In addition to cataloging PACs from individual samples, PlantAPAdb also provides the PAC list of the pooled sample for bulk download (http://bmi.xmu. edu.cn/plantAPAdb/Bulkdownload.php; Supplemental Fig. S3). For the pooled data, different files are available for download to meet the user's needs. The file in bed format records simple information such as chromosome, strand, coordinate, and total number of reads for each PAC. The file in text format tabulates full information of each PAC, including the total read count, the raw and normalized read count in each individual experiment, the respective gene, the genomic location (CDS,intron,39 UTR,59 UTR,or intergenic), and the distance to neighboring genes (if the PAC is located in the intergenic region). Moreover, the file of heterogeneous cleavage sites in bed format is also provided, which allows users to inspect the polyadenylation in higher resolution. In addition to PACs, a conserved PAC list for each species is also available for download.

DISCUSSION
We present a comprehensive database of APA sites in plants called PlantAPAdb, which catalogs the most comprehensive plant poly(A) site data sequenced from diverse 39 seq protocols and biological samples. Poly(A) sites are sorted by categories such as experimental studies, sequencing protocols, and biological samples, facilitating the retrieval of data for specific usage.
PlantAPAdb also provides rich annotation information of genome-wide poly(A) sites, including genomic locations, heterogeneous cleavage sites, expression levels, related poly(A) signals, sample information, and conservation information. APA sites and cleavage sites from the pooled sample and individual samples are available for bulk download as flat files.
PlantAPAdb not only provides various kinds of data for download but also provides biologists with an easyto-use web service for data query, visualization, and browsing. A convenient and powerful one-click search based on the full-text search technique is integrated in PlantAPAdb for data query. Search results can then be visualized in their genomic context via the JBrowse genome browser. In addition, implemented with dynamic interaction technologies, each kind of data [e.g. poly(A) signals] are arranged in different categories and can be browsed in a card-like manner. In the card-like view, different categories are shown in individual cards with respective statistical information and download option. Moreover, PlantAPAdb provides an overview of the data by presenting rich statistics and intuitive charts. A wealth of help information and references are available throughout the web site for the user's reference. These user-friendly features greatly improve the ease of use and data retrieval of our database.
In addition to various function modules, PlantA-PAdb also incorporates a rich repertory of back-end programs that facilitate processing, annotation, and analyses of APA sites. A uniform and flexible processing pipeline was designed for accurately identifying and quantifying high-confidence poly(A) sites from 39 seq.
PlantAPAdb also integrates a standard procedure for parsing a genome annotation file obtained from the unified portal (Ensembl Plants), facilitating the automatic annotation of poly(A) sites in a universal way. Figure 6. The APA switching module. The top part shows parameters for retrieving an APA switching list between two sample groups. The middle part shows a scatterplot visualizing both 39 UTR shortening and lengthening events between the selected two conditions. The bottom part tabulates the list of genes with significant 39 UTR switching. By clicking the link of a gene in the list, users can view this gene in the genome browser. This example is available at http://bmi.xmu.edu.cn/plantAPAdb/APAswitch.php. These resources make PlantAPAdb highly expandable for the incorporation of emerging 39 seq data sets in the future. Relevant scripts are available for download in PlantAPAdb, which allows users to perform similar analyses conducted in PlantAPAdb for their own data. However, it should be noted that there is no single best pipeline for poly(A) site identification from 39 seq data. In different studies on 39 seq, researchers may use different alignment tools for read mapping, different strategies for removing internal priming artifacts, different algorithms for grouping cleavage sites, and different criteria for filtering high-confidence poly(A) sites (Ji et al., 2015). Currently, there is no criterion or benchmark study to evaluate existing pipelines for identifying poly(A) sites. Additional work will be needed in the future to design an objective comparative study for a better evaluation of diverse poly(A) site identification pipelines.
With accurate identification and comprehensive annotation of APA sites from a large volume of 39 seq data in plants, as well as conserved APA configuration across various species, PlantAPAdb will be valuable for annotating gene 39 ends and understanding APA-mediated gene regulation. Future work will involve adding APA sites from more species and more diverse cell/tissue types upon the availability of new 39 seq data. Recently, a wide range of computational tools for identifying APA sites from RNA-seq data have continued to emerge, and an increasing number of APA sites have been found beyond annotated sites from 39 seq data (Chen et al., 2019). APA sites were also found from single-molecule sequencing (Wang et al., 2017(Wang et al., , 2018a. At present, PlantAPAdb mainly focuses on 39 seq data; we expect to leverage the merit of unprecedented RNA-seq and single-molecule sequencing data to expand our compendium of APA sites from more diverse species and biological samples in the future.

A Uniform Pipeline to Identify Poly(A) Sites with 39 Seq Data
Publicly available 39 seq data were mainly downloaded from the NCBI Sequence Read Archive (www.ncbi.nlm.nih.gov/sra). The latest genome assemblies for all species except for moso bamboo (Phyllostachys edulis; Peng et al., 2013) were downloaded from Ensembl Plants (http://plants.ensembl.org). We designed a bioinformatics pipeline to facilitate the uniform processing of 39 seq data obtained from diverse experimental protocols (Supplemental Fig. S4A). First, a quality control check was performed to examine the data quality. Barcodes of each file were then examined according to the description in the relevant study and/or by an in-house Perl script, and the file was split into subfiles if barcodes were present. Next, A/T stretches in the raw file were trimmed off. For samples generated with protocols that assign a 59 adapter sequence in the reads, reads with a T stretch at the 59 end were filtered and the T stretch was trimmed off. For data with 39 adapter, reads with an A stretch at the 39 end were filtered and the A stretch was trimmed off. Then Trimmomatic (v0.38;Bolger et al., 2014) was adopted for quality trimming and reads shorter than 25 nucleotides were discarded. Remaining reads were then aligned to the corresponding genome assembly using STAR (v2.6.0a; Dobin et al., 2013). Only uniquely mapped reads were retained, and coordinates of candidate cleavage sites were obtained. Finally, those sites representing potential internal priming artifacts, with six consecutive adenines or more than six adenines within the 210 to 110 nucleotide window from the cleavage site, were filtered out. Next, qualified 39 end cleavage sites were grouped into PACs to reduce the impact of microheterogeneity. For each individual sample or the pool of samples, cleavage sites within 24 nucleotides of each other were merged into PACs. Finally, relevant information of each PAC was recorded, including the start and end coordinates of the cluster, the coordinate of the dominant cleavage site, and the number of supported reads in each sample. Detailed commands and relevant scripts were provided in our PlantAPAdb web site.
To quantify the usage of each PAC, we calculated the raw expression level (number of reads) and normalized expression level in tags per million for each individual sample. The mean tags per million value across all samples for each PAC was also calculated. We also calculated for each PAC the percentage of samples expressed, which is defined as the ratio of the number of samples with raw expression level $ 2 to the total number of samples. We defined highconfidence PACs as those PACs with normalized read count $ 1 in two or more experiments. Both raw PACs and high-confidence PACs are available for download in PlantAPAdb, whereas most analysis results, such as poly(A) signals and 39 UTR switching results, are based on high-confidence PACs.

Annotation and Quantification of Poly(A) Sites
Poly(A) sites were annotated with respective genes and genomic locations based on the latest genome annotations (Supplemental Fig. S4B). An R script based on the GenomicFeatures R package was implemented to uniformly process the genome annotation file in GFF3 format. Annotation for both protein-coding genes and noncoding genes was parsed. PACs were then annotated based on their genomic locations (i.e. 39 UTR, coding sequence, and intron for protein-coding genes, exon for noncoding genes, and intergenic region). To resolve annotation ambiguity owing to multiple transcripts from the same gene, we annotated a PAC based on the following priority: 39 UTR, CDS, intron. Particularly if a PAC is located in an intergenic region, we recorded the neighboring genes and its distance from the 39 end of the nearby 59 gene and the distance from the 59 end of the nearby 39 gene. This strategy to annotate intergenic PACs allows annotation of PACs with higher flexibility to recruit PACs falling within extended 39 UTRs.

Quantification of Poly(A) Site Usage
We adopted four PAC-level metrics and two gene-level metrics to quantify the dynamics of APA across samples. The PAC-level metrics include NSE (number of samples expressed), PSE (percentage of samples expressed), Ratio, and Sample Specificity. NSE of a PAC is calculated as the number of samples in which the PAC is expressed. PSE of a PAC is calculated as the ratio of NSE to the total number of samples. Ratio is the relative usage of a PAC in a gene, which is calculated as the ratio of the expression level of a PAC to the total expression level of the respective gene. Sample Specificity (Ni et al., 2013;Ji et al., 2018) of a PAC is a Shannon entropy score denoting the overall sample specificity: where n is the number of samples and p s is the ratio of the expression level of the PAC in sample s to the total expression level of this PAC in all samples. Then the specificity of a PAC for sample s can be calculated as: Q5H 2 log 2 ðP s Þ. A lower H or Q score means higher sample specificity. The gene-level metrics include RUD (relative usage of distal PAC) and WUL (weighted 39 UTR length). RUD (Ji et al., 2009) of a gene in a sample s is calculated as the ratio of the number of 39 reads of the distal PAC in sample s to the number of total reads of proximal and distal PACs in sample s. Here, only genes with at least two 39 UTR PACs were used. Proximal and distal PACs are defined as the two most abundant 39 UTR PACs or the two most distant 39 UTR PACs. The RUD score represents the relative 39 UTR length for a gene in a sample, with higher RUD indicating longer 39 UTR. WUL (Ulitsky et al., 2012;Fu et al., 2016) of a gene in a sample s is calculated as the average 39 UTR length of all 39 UTR PACs in this gene weighted by the number of supported 39 reads of each PAC.

Detection of 39 UTR Shortening/Lengthening Events
We adopted two strategies for detecting 39 UTR shortening/lengthening events (also called 39 UTR switching or APA site switching) from samples with or without replicates (Supplemental Fig. S4C). The first strategy is based on the x 2 test for linear trend in proportions, which is applicable for samples without replicates (replicates were averaged first). This method has the advantage to consider both abundance and 39 UTR length of all 39 UTR PACs in a gene, which was adopted in several previous APA studies (Fu et al., 2011(Fu et al., , 2016Ye et al., 2018;Zhou et al., 2019). Briefly, PACs in a gene are sorted by the respective 39 UTR length (denoted as score). A contingency table of read count is then created with rows representing the indexes of samples and columns denoting the scores. Next the x 2 test for trend in proportions is performed with R function prop.trend.test, and the Pearson correlation r is obtained using the read count in the table as the value and the score as the coordinate. The correlation r ranges from 21 to 1, with larger absolute value indicating a higher extent of 39 UTR shortening/lengthening. Finally, genes with adjusted P value smaller than a given cutoff (e.g. 0.05) are considered as genes with significant 39 UTR shortening/lengthening.
The second strategy is applicable for samples with replicates, which extends the differential expression (DE) results from DESeq2 to identify genes with 39 UTR shortening/lengthening events. To detect DE PACs, genes with only one PAC were discarded. Then DESeq2 (Anders and Huber, 2010) was used for DE PAC identification. For each PAC in APA genes, both P value and adjusted P value were obtained and PACs with adjusted P value below a given cutoff were considered as DE PACs. To detect 39 UTR shortening/lengthening events, 39 UTR APA genes with at least one DE PAC were filtered. Given a pair of 39 UTR PACs i and j between two samples a and b, the respective expression levels are denoted as E i;a ; E i;b ; E j;a ; and E j;b . The relative change for this pair of PACs is calculated by A Fisher's exact test was also performed to test the significance of differential usage of PACs i and j between samples a and b, and a P value was obtained. Genes with jRC i;j j larger than a given threshold (e.g. 1) and a P value below a given cutoff (e.g. 0.05) were considered as genes with 39 UTR shortening/ lengthening events.
To facilitate users to obtain DE results between samples stored in PlantA-PAdb, we have precalculated all genes with 39 UTR shortening/lengthening events between each two samples with default parameters (e.g. adjusted P , 0.05 and jRC i;j j . 1). Users are free to download the 39 UTR switching list with selected samples and custom parameters to meet their own needs.

Identification of Poly(A) Signals
Poly(A) signals for poly(A) sites in different genomic locations of each species were identified (Supplemental Fig. S4D). According to previous studies (Loke et al., 2005;Shen et al., 2008b), we divided the six species into two groups: one is Chlamydomonas (Chlamydomonas reinhardtii) and the other contains the remaining species. In Chlamydomonas, four signal elements were reported: FUE (far upstream element), NUE, CE (cleavage element), and downstream element (Shen et al., 2008b). In the other five species with similar single-nucleotide profiles surrounding poly(A) sites, we used the signal model of Arabidopsis (Arabidopsis thaliana; Loke et al., 2005) that contains FUE, NUE, and CE. The choice of signal region range and the respective length of signal patterns are based on the observations in previous studies (Loke et al., 2005;Shen et al., 2008b). To identify statistically significant signal patterns in a given poly(A) signal region, we applied an oligo analyzer called RSAT (Thomas-Chollier et al., 2008). We classified poly(A) sites of each species into four groups based on their genomic locations (39 UTR, CDS, intron, and 59 UTR) and then obtained 50 topranked signal patterns by RSAT for each signal element in each group of poly(A) sites.

Conservation of Poly(A) Sites
The strategy to identify conserved poly(A) sites is shown in Supplemental Figure S4E. We used the Arabidopsis genome as the reference and downloaded pairwise genome alignment chain files from Ensembl Plants. Then synthetic regions between other genomes and the reference genome were obtained. Next, coordinates of PACs of all other species were converted to coordinates of the reference genome. We adopted the reciprocal best match method (Wang et al., 2018b) to determine conserved PACs. Briefly, two PACs from two species were considered as orthologous if their distance was smaller than 24 nucleotides based on the whole-genome alignment. For each PAC in each species, the information of conservation is recorded, including the species with conserved sites and the corresponding PACs.

Database and Web Site Design
To ensure the smooth access of PlantAPAdb, we have built two mirror web sites, which can be accessed through http://bmi.xmu.edu.cn/plantAPAdb or http://www.bmibig.cn/plantAPAdb. Pipelines of poly(A) site identification and annotation were implemented by a series of Perl scripts and R scripts, which are available for download in PlantAPAdb. The JBrowse genome browser (Skinner et al., 2009) was embedded for browsing of poly(A) sites associated with genome annotations at the genome-wide level. We used Java-Script (https://www.javascript.com/) and jQuery (https://jquery.com/) to construct interactive web pages with simplified JavaScript programming. To ensure asynchronous data transmission between the web application and the server, Asynchronous Javascript and XML technology was extensively employed in PlantAPAdb. The one-click search was implemented based on Sphinx (http:// sphinxsearch.com/), an open-source full-text search engine. The back-end database was implemented by MySQL, which uses tables to store gene annotation files and poly(A) site data. We also utilized several integration plugins, such as Plotly (https://plot.ly/) and DataTable (https://datatables.net/), to enable the interactive display of data.

Supplemental Data
The following supplemental materials are available.
Supplemental Figure S1. The sample-centric search in PlantAPAdb.
Supplemental Figure S3. The bulk download module.
Supplemental Figure S4. Procedures for data processing in PlantAPAdb.
Supplemental Data Set S1. Full information of samples used in PlantAPAdb.