ANAP: An Integrated Knowledge Base for Arabidopsis Protein Interaction Network Analysis

.

Protein interaction networks can provide a global view of cellular processes, thus facilitating the study of complex, dynamic biological systems (Jansen et al., 2003).Interactions between proteins can be direct physical interactions and also indirect, which may involve intermediate molecules to facilitate interactions.For example, an indirect interaction means that if proteins A and B, and also B and C, have direct interactions, then A and C indirectly interact.These interactions are key to cellular events associated with protein localization, translation rates, gene regulation, and posttranslational modifications (Bork et al., 2004).The development of full-genome-and proteomics-based technologies, such as next-generation sequencing, transcriptomics, and high-throughput yeast two-hybrid screening, has generated huge amounts of biological data.To capitalize upon these data for functional biological studies, this information needs to be analyzed, effectively integrated, and stored to facilitate rapid searching and in-depth analysis.
Currently, it is not easy to directly access these data to integrate information from different sources and methodologies to provide biological network information.Fortunately, an excellent recent resource, PSICQUIC (for Protemics Standard Initiative Common QUery InterfaCe; Aranda et al., 2010), has provided an interface for protein interaction databases to allow easy access to these data.The main goal of the PSICQUIC project is to provide a common query interface and implement data quality assessment from these disparate databases; this is now being successfully used for many projects, including Cytoscape, IntAct, and Reactome (http://code.google.com/p/psicquic/wiki/WhoUsesPsicquic).
There are a number of bioinformatics tools, such as ATTED (Obayashi et al., 2007), that utilize coexpression data for network analysis; however, these have their limitations, since they are based upon transcript levels and do not utilize protein data.One of the initial network analysis tools for visualizing the Arabidopsis interactome was the Arabidopsis Interaction Viewer (Geisler-Lee et al., 2007).The Arabidopsis Interaction Viewer currently contains nearly 99,466 Arabidopsis interacting proteins, which were collected from BIND, MINT, literature sources such as Arabidopsis Interactome Mapping (Arabidopsis Interactome Mapping Consortium, 2011), and some predictions generated by the authors.The Arabidopsis thaliana Protein Interaction Network also offers an online tool that integrates some of the available Arabidopsis protein interaction databases, including the Predicted Interactome for Arabidopsis (Geisler-Lee et al., 2007), Arabidopsis proteinprotein interaction data curated from The Arabidopsis Information Resource (TAIR) curators (http://www.arabidopsis.org/index.jsp),BioGRID (Stark et al., 2006(Stark et al., , 2011)), and IntAct (Aranda et al., 2010).
There are many variables that have to be addressed to facilitate data integration between the large numbers of available protein interaction data sets.These include data standards, the use of single types of protein identifiers, and well-defined ontology terms.A large amount of these data are generated from different sources with no shared database design, many with no clearly defined standards and the use of different identifiers.Therefore, it is vital to develop a set of definitive standards for the collection, integration, and analysis of protein interaction data to enable the establishment of networks that utilize data from both small-scale experiments and high-throughput approaches.This is particularly important since, if interactions have been demonstrated by multiple approaches, it provides a greater validity and robustness to the network.
To address these issues and to facilitate effective protein interaction network construction, we have developed an interactive bioinformatics Web tool entitled the Arabidopsis Network Analysis Pipeline (ANAP) for Arabidopsis network analysis.The main aims of ANAP are to integrate the currently available Arabidopsis protein interaction data sets and to provide biologists with a novel, easy-to-use, and intuitive interface that enables researchers to carry out high-throughput detailed network analysis with limited bioinformatics experience.Protein interaction data sets were integrated and formatted from 11 public Arabidopsis protein interaction databases.At publication, ANAP contained 201,699 unique protein interaction pairs, comprising 15,208 identifiers (include 11,931 TAIR Arabidopsis Genome Initiative [AGI] codes) with 89 interaction detection methods, 73 proteins from different species that interact with Arabidopsis proteins, and 6,161 references (Table I).This provides an extensive and valuable knowledge base for generating protein interaction networks from the integrated data sets, thus producing a far more detailed and reliable network than if produced from any single protein interaction database.
ANAP allows for either single or multiple protein searches to be conducted for each query protein.The networks generated display the various interaction detection methods and data sources in unique colors to enable effective network viewing.There are additional functions available to conduct "in-depth" protein searches, which identify the indirect interactions from the original input source protein.This is very important, as a network, or a protein interaction complex, may include indirect interactions to many other proteins.This type of approach has previously been shown to be a very useful way to recognize new interactions within a complex (Jensen et al., 2009).Each protein in the network is described using its TAIR AGI code, UniProt identifier (ID), and a short description; additionally, the full TAIR locus details can be viewed by double clicking on the protein.Direct links to five popular Arabidopsis resources (AtGenExpress Visualization Tool, Arabidopsis 1,001 Genomes GBrowse, Protein Knowledgebase, Kyoto Encyclopedia of Genes and Genomes, and Ensembl Genome Browser) are also provided.The detailed evidence of the network and each interaction can be saved in various file formats, including PNG, PDF, SVG, SIF, GRAPHML, and XGMML.The file formats SIF, GRAPHML, and XGMML are particularly useful for large networks where the user wishes to import the resulting ANAP network into Cytoscape, which is a well-established network analysis tool (Shannon et al., 2003;Kohl et al., 2011;Smoot et al., 2011).ANAP also supports the import of the resulting network into other network analysis tools, such as Network Workbench (GRAPHML; NWB Team, 2006).ANAP is a fully functional integration and analysis pipeline that will serve as an extremely valuable resource for biologists.It will enable them to capitalize upon the currently available Arabidopsis protein interaction data for effective networkbased analysis, enabling greater predictions of function and selection of targets for further biological analysis.

ANAP Framework and Searches
The ANAP tool has been developed to integrate the available Arabidopsis protein interaction data that have been generated from different sources by a variety of approaches.These data are then used to generate accurate protein interaction networks, which will facilitate greater understanding of biological processes.ANAP can be used as a platform to construct protein interaction networks based on both direct and indirect interaction analysis.
ANAP has an intuitive graphical user interface that allows the user to easily construct molecular networks using single or multiple starting proteins as inputs; the results are displayed showing the proteins that interact with the initial query protein(s).Figure 1A shows the ANAP tool interface, which includes ID Mapping and a Help link.The user enters the Arabidopsis TAIR AGI code(s) or the protein UniProt ID(s) into the central search box, with the option of selecting two types of node relationship: "Source Database" and "Interaction Detection Method."The selection of node relationship does not affect the overall network that is generated but rather the presentation of the links between the nodes; Source Database lists the database information used to generate the links, while Interaction Detection Method presents the experimental technique that has been used to generate the relationship.
Figure 1B shows the whole framework of the ANAP output, which includes several useful functions to enable the user to easily extract extensive information from the resultant network.This framework includes the network map in the center of the main panel, a "Change the Color" button underneath, network information about numbers of nodes and interactions, and a panel for searching and mapping data onto the network.There is a panel for saving the resultant network, another panel for useful information, which includes links to the supporting evidence for the interactions, and a Simple Interaction Format (SIF; Cytoscape format) file containing the Source Database and Interaction Detection Method.This panel also contains a "Depth Search" button, which supports the indirect interaction search option.Moreover, there are another two panels at the bottom of the framework, one is "network filtering," which is useful for simplifying the output of a complex network and allows users to toggle between different databases and different interaction detection methods to generate networks; the other is "upload network," which is useful for reanalyzing the generated network and making the input nodes remain in their original positions.

Single Protein Searches
The locus AT5G42970 (which encodes subunit 4 of the COP9 signalosome [CSN] complex) was used as an example for analysis using the ANAP tool. Figure 2 shows the resulting network of 34 nodes and 130 edges, based on the direct protein interactions generated after selecting the option of Interaction Detection Method.The query protein AT5G42970 is marked in red in the center of the figure, and each associated protein is linked by a uniquely colored line, based on the interaction detection method and the rendering rules from the complete list of all interaction detection methods (Supplemental Data Set S1).
The CSN is a highly conserved protein complex that is associated with the ubiqutin-proteolytic breakdown pathway.In eukaryotes, it is formed of nine subunits, of which subunit 4 (AT5G42970) is one member (Schwechheimer and Isono, 2010).Searching ANAP with AT5G42970 identified 34 nodes; the gene identities and functions of these are shown in Figure 2B.The nine components of the CSN (COP9 subunit 4 and eight others) were all identified and are highlighted in orange (Fig. 2).To construct the network, ANAP has utilized data from multiple sources, comprising both predicted interactions and experimental evidence; the numbers of each for these proteins are shown in Figure   2B.This wide range of data provides valuable support for any interactions; for example, a large number of interactions were seen between COP9 subunit 4 and the other well-established components of the CSN.The other proteins that have been identified in the network range from those with established roles in ubiquitination pathways (Schwechheimer and Isono, 2010) to other developmental processes that are regulated by ubiquitin proteolysis.It is also very easy to go directly from the identified proteins to PubMed sources to aid in characterizing the network and further interrogate the validity of the predicted parts of the networks.The multiple sources of data accessed by ANAP offer excellent opportunities to confirm known networks but also to extend these further to identify novel targets.The range and depth of the data utilized for network generation, therefore, provide a valuable mechanism to assess the validity of such predictions prior to follow-up experimental analysis.
Table II shows an example of five evidence records generated when searching using AT5G42970 based on direct protein interactions.In addition, Supplemental Data Set S2 lists the complete relevant evidence records for the AT5G42970 protein network.The user can dynamically interact with the network by using the mouse-over function on the nodes; this shows the protein's UniProt ID, TAIR AGI code, and a short description of the predicted protein function.Additionally, there are links to the relevant locus details, which are visible when the node is double clicked.Moreover, each node in the network has a direct link to the AtGenExpress Visualization Tool, the Arabidopsis 1,001 Genomes GBrowse, the Protein Knowledgebase, the Kyoto Encyclopedia of Genes and Genomes, and the Ensembl Genome Browser.A similar feature is also seen with the edges in the network, which highlight the interaction method when the mouse hovers over each edge.ANAP provides the opportunity for the user to select a node(s) of interest in the resultant network and to use this to construct a new network and extract the evidence data.Furthermore, users can also search for specific protein(s) in the resultant network; such proteins are marked in blue when in the resultant network and marked in fuchsia when it is the same as the query protein(s).Using AT5G42970, a network was constructed based on the same configuration as the network in Figure 2 by selecting the option of Source Database (Supplemental Fig. S1).The edges of each source data-base are indicated by a unique color based on the rendering rules for the complete list of all source databases (Supplemental Data Set S1).
There is also an added feature that allows users to easily identify the indirect interactions of the original protein using the "Depth Search" button.Supplemental Figure S2 shows the network constructed based on the indirect protein interaction data generated for AT5G42970.This approach is useful for recognizing new potential interactions in the network (Jansen et al., 2003) to assign putative functions to less well-characterized proteins and to provide more comprehensive understanding of the query protein at the system level with the help of each cluster in the constructed network.

Multiple Protein Searches
Currently, more and more researchers are employing transcriptomic, next-generation sequencing and many high-throughput technologies in the fields of molecular, cell, and developmental biology to decipher novel biological phenomena.By using bioinformaticsbased approaches, lists of key genes can be further classified to confirm candidates by biological experimentation.However, for such gene selection and functional analysis to be effective, particularly at a protein level, these data sets require supplementation and detailed analysis.Therefore, it is critical to produce protein interaction networks using multiple proteins as a way of visualizing and analyzing all the interactions simultaneously to aid in functional analysis.ANAP supports such multiple protein searches and protein interaction network construction, so that users can submit targets as TAIR AGI code, UniProt ID, or a combination of these identifiers into the ANAP tool.Such networks, therefore, provide valuable information establishing links between proteins, which are likely to represent functional and regulatory conservation.
Figure 3 shows the network generated by searching using five proteins (AT1G02090, AT1G10840, AT1G22920, AT1G29150, and AT1G30950) from the AT5G42970 (COP9 signalosome complex) ANAP interaction network.This was constructed based on direct protein interactions, with the option of selecting based on Interaction Detection Method.Each of the query proteins is marked as a red node, and each interaction detection method is allocated a unique color.Several clusters from each query protein can be easily recognized within the network graph (Fig. 3).

DISCUSSION
The Current Challenges of Integrating Protein Interaction Networks Protein interaction networks can give a system-level view that is vital for the detailed analysis of complex biological systems (Jansen et al., 2003).However, providing mechanisms to integrate protein interaction data that have been generated from various sources poses significant challenges.For instance, two proteins may only interact during a certain developmental stage and/or in a specific tissue; however, most of the currently available protein interaction data do not provide temporal or spatial specificity.Furthermore, these data sets have frequently been generated in ectopic expression systems and thus may not represent the genuine interactions occurring in vivo.These limitations reduce the accuracy of the established networks, although such problems can be lessened by the successful integration of the increasing amounts of protein interaction data that have been generated by different approaches.The importance of data integration is now being fully appreciated, and there is a general emphasis toward the development of standards for large data sets with defined specific formats, which include PSI-MI (Kaiser, 2002) for protein interactions and BioPAX (Demir et al., 2010) and SBML (Hucka et al., 2003) for pathway standards.Several other approaches utilize controlled vocabularies with a defined glossary of terms for types of interactions (Co ˆte ´et al., 2006) and the use of a specific protein identifiers, which are constant in all the available protein interaction databases to facilitate easier integration.There is also a need for these same standards to be established in published scientific journals to further enhance the effectiveness of text mining to supplement the ANAP integrated protein interaction data set.

Interaction with Other Resources
Defining protein function is an essential requirement for effective, functional network characterization.Moreover, recent studies have shown that protein interaction networks are able to give a good prediction of protein function (Jansen et al., 2003;Sharan et al., 2007).Therefore, bridging target genes from transcriptomic data, or next-generation sequencing data, with the help of Gene Ontology term enrichment to the protein interaction network can provide added substance for network characterization (Maere et al., 2005).
Currently, ANAP provides the function of mapping up-and down-regulated transcriptomic data, nextgeneration sequencing data, and other biology-based results onto the generated network.For transcriptomic mapping, the nodes are colored in red or green, which represent the up-or down-regulated genes in the network (Supplemental Fig. S3).The node can also be highlighted in blue if customized gene list data (any interesting data that users want to overlay onto the ANAP network) are mapped onto the network nodes.This makes the ANAP tool very flexible for the user to identify specific proteins and transcriptomic regulatory relationships within the network.The node is colored in fuchsia or turquoise if the mapped customized data also exist in the up-or down-regulated transcriptomics data.Moreover, ANAP provides an- other seven colors (olive, orange, purple, yellow, maroon, navy, and teal) in the mapping function for users to integrate data such as different subcellular localizations or other biology-based data, rather than only using this to indicate differing expression levels.Another strength of the ANAP tool is the ability for the user to be able to import the resultant networks into Cytoscape and other software for subsequent additional analysis (Shannon et al., 2003;Kohl et al., 2011;Smoot et al., 2011).The user can import the SIF, GRAPHML, or XGMML file generated by ANAP into Cytoscape.The Cytoscape mapping functions can then be used to integrate different resources and plugins for analyzing existing networks, inferring new networks, functional enrichment of networks, etc.This tool also supports import into other network analysis tools, such as Network Workbench (GRAPHML;NWB, 2006).

CONCLUSION
In this paper, the Web-based ANAP tool has been designed and implemented for Arabidopsis protein interaction network analysis.ANAP currently integrates approximately 201,699 unique protein interaction pairs into a tool that has a well-designed, simple-to-use, intuitive interface for biologists that can be exported to Cytoscape.Thus, it can be widely used for Arabidopsis protein interaction network construction and analysis.This is particularly valuable where large numbers of genes of interest have been selected from microarray and next-generation sequencing experiments and where only limited information is known.Case studies using single protein searches and multiple protein searches from the COP9 signalosome complex (Figs. 2 and 3; Supplemental Figs.S1 and S2) and the cytokinin regulatory pathway (ANAP user guide; Supplemental Fig. S3) have demonstrated the consistently good performance of ANAP for Arabidopsis protein interaction network analysis.The current ANAP framework provides a novel, intuitive, and easy-to-interpret tool that will greatly aid biologists in understanding plant developmental networks, which will allow them to decipher their specific biological network interactions far more quickly than by using biological techniques alone.Furthermore, ANAP has been designed to easily add features for extending functionality as the tool develops.Future work is planned to extend this tool to integrate the protein interaction data with metabolic pathway data, gene coexpression data, and other types of interactions to decipher biological problems more effectively.
In a recent ANAP update, we also integrated 5,664 confirmed binary interactions between 2,661 proteins from the Arabidopsis Interactome Mapping Consortium (2011), which is a recently published high-throughput Arabidopsis yeast two-hybrid data set.

Access Availability
ANAP is implemented in HTML, Shell, AWK, PHP, and JavaScript with the support of the Cytoscape Web, which allows the developer to embed dynamic networks into HTML (Lopes et al., 2010).The tool is open access for any use and available at http://gmdd.shgmo.org/Computational-Biology/ANAP.The top right corner of the index page includes a Help link, which is very useful to new users.The Help page contains a "Video Tutorial," "Frequently Asked Questions," and a "User Guide."If users have questions regarding using ANAP or some problems in understanding the terms or concepts, please refer to the Help page.
Generally, ANAP is updated with new interaction data every 3 months; however, we have developed a semiautomatic formatting and updating program for ANAP.This has been rigorously tested with random access checks and manual checks to ensure stable and accurate integration of new data.In addition, we have established a log analysis tool to analyze access to ANAP.

Flow Chart of the ANAP Tool
The main modules in ANAP connect together data collection, data integration, and network viewing.The architecture of the ANAP pipeline is shown in the flow chart in Figure 4. We first searched the Arabidopsis protein interaction data based on the mnemonic (ARATH), taxon identifier (3702), scientific name (Arabidopsis thaliana), common name (Mouse-ear cress) and other names [Arabidopsis thaliana (L.) Heynh., Arabidopsis thaliana (thale cress), Arabidopsis thaliana, thale cress, and thale-cress] used in current protein interaction databases.The collected protein interaction data were then formatted to establish the ANAP database source (Supplemental Data Set S3); the graphical user interface was designed to support querying the protein(s) using a TAIR AGI code or UniProt ID.At the same time, the network rendering rules (Supplemental Data Set S1), based on the statistical analysis of the source data, were generated.The option is provided to select Source Database and Interaction Detection Method for the user to choose the desired node relationship.ANAP then produces the resultant network and extracts the interaction evidence.In addition, ANAP generates query keywords that extract the connecting proteins for in-depth searching.Finally, users can interact with the network and save it in different formats, including network maps or as network data.

Protein Interaction Data Format
A protein interaction data format was designed to convert and then integrate the 11 Arabidopsis data sets.Initially, there were numerous issues associated with independently integrating the protein interaction data from the 11 different databases.Different programs were written to collect and format each database; however, these posed problems for subsequent automated continuous updates.In addition, each database had a different data access method, which meant that integrating all the Arabidopsis data was difficult.Furthermore, each database contained very different formats for the interaction evidence.We found an excellent recent resource named PSICQUIC (Aranda et al., 2010), which had integrated raw data from about 22 protein interaction databases.However, after searching and checking extensive random data from each of the 11 Arabidopsis available databases, we found that they did not format well (Supplemental Data Set S3), which made it unsuitable for protein interaction network-based analysis.Taking Interaction Detection Method, for example, the raw PSICQUIC data have 149 unique methods while

Figure 1 .
Figure 1.ANAP search page and the framework of the network result.A, Portal of the ANAP tool, which can search for single or multiple proteins using TAIR AGI code format and/or UniProt ID, based upon the node relationship of the Source Database and Interaction Detection Method.B, The framework of the ANAP result contains a map of the resultant network, a panel for searching and mapping data onto the network, options for saving and exporting the network, a panel of useful information including evidence and depth search (for indirect interaction searches), and two panels for "network filtering" and "upload network."[See online article for color version of this figure.]

Figure 2 .
Figure2.Network generated using the COP9 signalosome protein (AT5G42970).A, Network map based on direct protein interactions and the node relationship of the Interaction Detection Method.The query protein is marked in red, and each interaction detection method is indicated by a unique linking colored line.B, Table generated from the "evidence" from the COP9 interaction network.The table shows the numbers of interactions detected by ANAP, which of these are from experimental data and from inference-based approaches, and the total number of databases supporting the interactions.Members of the COP9 signalosome are shown in orange in both A and B.

Figure 3 .
Figure 3. Network result generated by using a multiple protein search (five proteins searched: AT1G02090, AT1G10840, AT1G22920, AT1G29150, and AT1G30950) based on direct protein interactions and the node relationship Interaction Detection Method.Each query protein is marked in red, and each interaction detection method is indicated by a uniquely colored line.Several node clusters can be identified, with each query protein evident in the network graph.[See online article for color version of this figure.]

Figure 4 .
Figure 4. Flow chart of the ANAP tool.The main modules in ANAP include data collection, data integration, and network visualization.Details of each module are described in the text.[See online article for color version of this figure.]

Table I .
Statistics of the integrated ANAP protein interaction source data Figure 1.(Legend appears on following page.)ANAP: An Arabidopsis Protein Interaction Network Tool Plant Physiol.Vol.158, 2012

Table II .
Five evidence records generated when searching ANAP using protein AT5G42970 based on direct protein interactions Ath, Arabidopsis.