|
|
||||||||
|
Plant Physiology 138:116-126 (2005) © 2005 American Society of Plant Biologists FPC Web Tools for Rice, Maize, and Distribution1Arizona Genomic Computational Laboratory, BIO5 Institute, University of Arizona, Tucson, Arizona 85721 (V.P., F.E., J.H., G.G., C.S.); and Clemson University, Clemson, South Carolina 29634 (S.B.)
Many clone-based physical maps have been built with the FingerPrinted Contig (FPC) software, which is written in C and runs locally for fast and flexible analysis. If the maps were viewable only from FPC, they would not be as useful to the whole community since FPC must be installed on the user machine and the database downloaded. Hence, we have created a set of Web tools so users can easily view the FPC data and perform salient queries with standard browsers. This set includes the following four programs: WebFPC, a view of the contigs; WebChrom, the location of the contigs and genetic markers along the chromosome; WebBSS, locating user-supplied sequence on the map; and WebFCmp, comparing fingerprints. For additional FPC support, we have developed an FPC module for BioPerl and an FPC browser using the Generic Model Organism Project (GMOD) genome browser (GBrowse), where the FPC BioPerl module generates the data files for input into GBrowse. This provides an alternative to the WebChrom/WebFPC view. These tools are available to download along with documentation. The tools have been implemented for both the rice (Oryza sativa) and maize (Zea mays) FPC maps, which both contain the locations of clones, markers, genetic markers, and sequenced clone (along with links to sites that contain additional information).
FingerPrinted Contigs (FPC) is a program that orders clones into contigs based on restriction fragment fingerprints and marker data and orders contigs based on genetic markers. FPC provides the ability to assemble and manually edit contigs (Soderlund et al., 1997 The WebAGCoL package is a set of four tools: WebFPC displays contigs in a view very similar to the FPC display. WebChrom shows contigs and genetic markers aligned to the chromosome. It also allows the user to view the distribution of markers based on name or remark. WebFCmp allows fingerprint comparisons of a user-selected clone set against the entire FPC database. WebBSS locates a user supplied sequence on an FPC map based on its similarity to sequences associated with other clones in the map. All of these tools work with a standard browser. The WebAGCoL package has been made available for distribution. Such a package can be difficult to install since it has Java, CGI, and HTML files that all belong in different directories. To simplify the setup, we have written a script that automatically installs the different files based on a configuration file. The set of tools and setup scripts were released in August 2004.
This manuscript also discusses two other FPC support efforts: an FPC module for BioPerl (www.bioperl.org) that provides a simple interface to query an FPC database and an FPC browser using the Generic Genome Browser software (Stein et al., 2002
A previous version of WebFPC and WebChrom was released in August 2003 and is used at multiple sites. Some sites have developed their own FPC Web browsers, for example, ICE (Internet Contig Explorer; Fjell et al., 2003
It should be noted that these tools were not designed to replace FPC. If a user will be executing more than a few occasional queries, then downloading the FPC database and FPC executable will save time. There are FPC executables for Sun, Linux, and Mac OSX, but not for Windows. For Windows users, we recommend using a VNC (http://www.realvnc.com) session on a Unix machine, which allows the user to run FPC on Unix from a Windows machine. A tutorial on building maps is given in Engler and Soderlund (2002)
Rice Map
The rice FPC map has 72,703 clones, 8,870 markers, 180 contigs, and 2,918 anchors. FPC calls any marker that has a location on a chromosome or linkage group an anchor. There are two types of anchors: (1) frameworks are well ordered and (2) placements are binned between frameworks. For the rice map, the 1,378 frameworks are the Japanese Genetic markers (Harushima et al., 1998 The initial WebFPC display for the rice physical map is shown in Figure 1. It provides options to search by clone, marker, or contig. If a marker is contained in more than one contig, all contigs containing that marker will be listed. Substrings can be used for marker and clone names. For example, to view all the sequenced clones, one would enter the string "sd1" (the "sd" stands for simulated digest, as explained below).
WebFPC is implemented in Java, which allows fast navigation around an entire contig, avoiding the slow redisplay common in Web displays using the paging method. Each marker is centered over the largest stack of clones to which it is attached. WebFPC features a filtering window that gives user options to show or hide information. For example, there may be many markers in a small region, causing the marker track to become very deep. Marker filtering allows the user to limit the depth by showing only the markers of interest, as shown in Figure 2a. In Figure 2b, markers prefixed by SOG or OJ have been removed from the display. Anchors are shown at the bottom of the display (Fig. 2c). By default, only the frameworks are shown, but the placements can be made visible via the filtering window. While only a region of the contig may be visible, all frameworks are shown, providing an overview of the whole contig. Selecting an anchor centers the contig display on the corresponding region. A pull-down at the top of the display lets the user filter the clones by "No Buried," "All," or "Seq Only." A buried clone is one whose fingerprint pattern matches that of another clone either exactly or approximately. Selecting No Buried hides the buried clones to limit redundant information. The Seq Only option shows only the simulated digest (SD) clones, which are generated from sequenced clones.
The SD clones are generated by a nightly cronjob (a script that is scheduled to run automatically at a given time). The cronjob executes the following steps: (1) download updated rice sequence from GenBank; (2) run a simulated digest on each sequence (Engler et al., 2003
In the top right hand corner of the contig display, there is a pull-down that lists other on-line databases to which the user can link. For the rice map, there are links to GenBank (Benson et al., 2004
Figure 3 shows the rice WebChrom display, which provides a view of the ordering of the contigs and anchors along the chromosome. Selecting a contig will bring up the WebFPC display for the contig. The markers link to WebFPC, Gramene, and RGP INE. It is not unusual for the genetic location of anchors to disagree with their location in a contig. For example, contig 14 in Figure 3 shows all of its markers in close succession (bounded box with tick marks representing genetic markers), but the long unbounded yellow box indicates that one marker is on the lower part of chromosome 3. The contig's chromosome and the position on the chromosome are calculated by FPC (Engler et al., 2003
If the user wants to search for a particular marker on the chromosome map or see the distribution of a set of markers based on a substring of the marker name or marker remark, he or she can use the WebChrom Search tool. Many markers are from expressed sequence tags, making it advantageous to remark them with their annotation. As a test case, we blasted (Altschul et al., 1997
Though WebFPC shows overlapping clones, the amount of overlap is not exact due to the error in the data (Soderlund et al., 2000
Suppose one wants to determine whether a sequence from a related organism is found in rice and, if so, what other markers are surrounding the given marker. Such information can be gained by querying all sequences associated with FPC clones, namely the sequences for all SD clones and/or the bacterial artificial chromosome end sequence (BES) for all fingerprinted clones. FPC has a feature called Blast Some Sequence (BSS) that blasts a file of sequences against a directory of BESs or genomic sequences and creates a report of all hits and their location on the FPC contigs. For the WebBSS, the user inputs a sequence, the FPC BSS routine is called, and the output is parsed and displayed as shown in Figure 6. Selecting a contig will bring it up in WebFPC, where the user can view the surrounding markers and clones. As of February 7, 2005, there are 4,058 genomic sequences and 98,286 BESs.
For an alternative view to WebFPC and WebChrom, we have developed an FPC configuration file and GFF file description that are used to create a Generic Model Organism Project (GMOD) genome browser (Stein et al., 2002
A full description of the origin of the clones and markers in the rice FPC is presented at the Web site (www.genome.arizona.edu/fpc/rice). The FPC file is available from our FTP site (ftp.genome.arizona.edu/pub/fpc/rice/).
The maize genome (approximately 2,500 Mb) will be the next cereal genome to be sequenced and will thus require a robust physical map and software tools to support the sequencing project as well as a variety of positional cloning projects. The maize FPC map currently contains 292,168 clones, 17,523 markers, and 2,998 anchors (Coe et al., 2002
The maize FPC Web site has the same set of tools as the rice Web site. WebFPC links to maizeGDB (Lawrence et al., 2004 For the WebBSS, there are currently (as of February 7, 2005) 490 genomic sequences and 682,116 BESs to search against. As with rice, new and updated sequences are downloaded nightly so all available sequences are in the database. The maize FPC map currently contains 503 SD clones, of which 13 have the suffix of sd2 or greater, indicating they come from GenBank sequences that result in more than 55 bands. The number of SD clones will continue to grow since clones are currently being sequenced. The maize sequencing status page shows what clones have been sequenced and where they are located, along with their similarity to the original clone (see www.genome.arizona.edu/shotgun/maize/status).
A full description of the origin of the clones and markers in the maize FPC is presented at the Web site www.genome.arizona.edu/fpc/maize. Additionally, a High Information Contig Fingerprinting (HICF; Ding et al., 1999
Rice has a finished map and an almost finished sequence. Maize has almost 300,000 clones, and the map and sequencing are still in progress. These two datasets provide good test cases for maps with a large number of sequences and clones, respectively. To test the tools on a dataset with a large number of markers, we downloaded the human map from www.bcgsc.bc.ca/perl/humanbac (The International Human Genome Mapping Consortium, 2001
The FPC tools with the prefix of Web are distributed as a package. A difficulty in distributing code for the Web is that it is a mix of Java, HTML, and CGI, where the three types of files go in different directories. An additional complexity is that the WebAGCoL package is composed of four tools with different requirements. A manual could be written to explain how to set up the files, but that would be tedious and error prone. Hence, we have written a setup script that reads a configuration file and automatically runs the correct scripts and puts files in their correct directories. It also creates a script that can be run to update the Web sites based on an updated FPC database.
The philosophy of making data available to the public as soon as it is generated helps investigators stay up-to-date on the progress of their genomes of interest. To address this need, we developed a set of tools for viewing and analyzing FPC maps via the Internet. A significant advantage of Web based displays is that the host institution can automatically update the display when there is an updated FPC database instead of the user having to manually download new FPC releases. As an example, for the Maize Mapping Project (www.maizemap.org), the WebAGCoL tools were regularly updated as new clones were fingerprinted. Therefore, the community could easily follow the progress being made by simply visiting the maize FPC Web site. When the contig numbers have changed, the user can simply search in WebFPC by clone or marker for the new contig number.
Regular updates on the host site can be greatly simplified through the use of a cronjob. At AGCoL, a cronjob automatically updates the WebAGCoL tools when a given FPC database has changed. New and updated sequences are downloaded nightly from GenBank, a simulated digest is performed on them using Fingerprinted Simulated Digest (FSD; Engler et al., 2003 The FPC Generic Genome Browser (GBrowse) displays the same data as WebFPC and WebChrom, but the layout and functionality are quite different. WebFPC closely models the way FPC displays data, but GBrowse resembles other well-known sequence-based Web browsers. In GBrowse, each scroll, zoom, or position shift requires the entire page to be redrawn, which can make extended browsing a bit tedious. Since WebFPC is designed as a Java applet, scrolling through a contig is immediate once it is loaded. GBrowse is flexible in what tracks are shown, whereas WebFPC is flexible in that entities can be colored or made invisible. GBrowse will show adjacent contigs while WebFPC does not. A final difference is that when a marker is clicked on in WebFPC, the clones it is associated to are highlighted; similarly, a clone can be clicked and its markers and remarks are highlighted; this feature is not in GBrowse.
While viewing FPC data via Web pages is quite useful, the ability to perform computations on the data on-line is desired as well since labs often do not have the high-performance computing resources nor the technical knowledge required for setting up the process locally. WebBSS addresses this demand by providing searches against a database of sequences associated with clones in FPC. To do this locally, the researcher would need to install BLAST (Altschul et al., 1997
Whereas WebBSS shows the location(s) of a sequence on the map, the WebChrom Search tool shows the location of markers based on name or remark. We have currently demonstrated the ability to search on annotations based on SwissProt hits, which will be extended to use Gene Ontology annotations (The Gene Ontology Consortium, 2000
We want to stress that WebFPC and its associated tools are not a replacement for FPC. If the user will be doing any serious querying of the data, FPC has much better search tools, is much faster, and FPC V7 (Engler et al., 2003
WebAGCoL Package The WebAGCoL package contains WebFPC, WebChrom, WebBSS, and WebFCmp, along with a setup script and on-line help. Each tool has a preprocessor script that creates the files necessary for fast display. They all read a shared configuration file. All preprocessor scripts expect to read an FPC file written with FPC V7 or a later version. All preprocessors are written in Perl, all graphics except WebFPC are generated using the GD library (www.boutell.com/gd/), and Perl CGI is used for run time execution. At AGCoL, all processing and updates are performed on a Sun 280R with 2GB RAM. WebFPC is written in the Java programming language and is implemented as an applet. The preprocessor perl script splits the FPC database at a contig level, writing information for each contig as XML. This allows a Java SAX (www.saxproject.org) parser to retrieve and parse information simultaneously, reducing the time spent waiting for a contig to be displayed. The preprocessor reads a site specific file to determine how to color clones and markers. For example, for the rice (Oryza sativa) and maize (Zea mays) FPCs, the sequenced clones and electronic markers are highlighted yellow. It reads the directory of reference files to set up links to external sites. The preprocessor also creates the initial HTML that accesses the WebFPC Java Jar file. A CGI script is used to execute an external lookup in order to display a contig without going through the initial page. The WebChrom preprocessor splits the FPC file into one HTML file per chromosome and writes perl GD code to display the graphics. The WebChrom Search tool stores the markers in a "Storable," which is a hash table and is used by the CGI script for fast searching. For WebBSS, the preprocessor creates a modified FPC file that contains only the necessary information; this speeds up the reading of the file during the Web-based execution. The configuration file contains the paths to the BES directory and genomic sequence directory. FPC is run in batch mode to execute the BSS. The BSS saves the output in a file, which is read by a CGI script and displayed. For WebFCmp, the preprocessor creates a modified FPC file that only contains the index into the file of bands in order to increase speed. FPC is run in batch mode to compare the fingerprints. A CGI script is run to read these files, compute the overlap score, and display the results. The setup script reads a configuration file to determine where to put the CGI, HTML, and Jar file. It also reads the location of the reference file, FPC file, and target directories and runs the preprocessors. It creates the initial HTML file and writes a file called update.sh that can be used to update all the Web tools when a new version of the organism's FPC file is updated.
For the rice and maize Web sites, a cronjob is run nightly to download new and updated sequences from GenBank (as mentioned previously). Each GenBank file is parsed into a FASTA file of sequences and saved into a directory whose location is known by WebBSS; hence, this feature is always run on the latest sequences. A program called FSD2 (FPC Simulated Digest, Version 2) reads the GenBank file, cuts the sequence into overlapping clones, and creates the file of restriction fragment sizes. It also extracts the first author, clone name, and chromosome and writes the information into a file to be loaded as a remark for the clone. The size2band program uses the file of marker lanes (used by Image) to convert the sizes to bands. FPC is then run in batch mode to enter the new fingerprints and remarks into the database and position each clone in the same location as its best match. Once a clone has been positioned in FPC, it is not automatically repositioned when an updated record is entered. Therefore, we periodically remove all the SD clones from their contigs by making a keyset of them and executing "Move to Ctg0" from the pull down menu. They are then repositioned by executing "Keyset->FPC" on the keyset of SD clones. FSD2 is available from the FPC Web site.
BioPerl (www.bioperl.org) is an initiative that seeks to simplify bioinformatics development by providing perl objects that perform mundane tasks such as parsing a file and retrieving information from it. To further this undertaking, we have developed a BioPerl module that reads an FPC file and allows the user to extract information from it. For example, one may retrieve all markers in a particular contig or find all clones attached to a marker. This module also converts FPC data into GFF format suitable for input into a Generic Genome Browser database, discussed in the next section.
The GMOD is "a joint effort ...to develop reusable components suitable for creating new community databases of biology" (Stein et al., 2002
Scott Pearson and Luke Delorna wrote the WebChrom. Jayesh Sharma wrote the original WebBSS. Kiran Rao developed the rice and maize sequencing status pages and wrote the FSD2 script from the original FSD/ESD scripts. We thank Rod Wing and William Nelson for their valuable feedback on this manuscript. The Maize Mapping Project is a collaboration with the University of Missouri (PI Ed Coe, Karen Cone, Georgia Davis, Jack Gardiner, Michael McMullen, Mary Polacco, and Hector Sanchez Villeda), the University of Georgia (Andrew Paterson), and the University of Arizona (Rod Wing and Cari Soderlund). Received November 18, 2004; returned for revision February 15, 2005; accepted February 20, 2005.
1 This work was supported in part by the U.S. Department of Agriculture Initiative for Future Agriculture and Food Systems (grant no. 11180) and by the National Science Foundation (grant no. 0213764). www.plantphysiol.org/cgi/doi/10.1104/pp.104.056291. * Corresponding author; e-mail cari{at}agcol.arizona.edu; fax 5206262632.
Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 33893402
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2004) GenBank update. Nucleic Acids Res 32: D23D26
Boeckmann B, Bairoch A, Apweiler R, Blatter M, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, et al (2003) The Swiss-Prot protein knowledgebase and its supplement TrEMBL. Nucleic Acids Res 31: 365370 Chen M, Presting G, Barbazuk W, Goicoechea J, Blackmon B, Fang G, Kim H, Frisch D, Yu Y, Higingbottom S, et al (2001) An integrated physical and genetic map of the rice genome. Plant Cell 14: 537545
Coe E, Cone K, McMullen M, Chen S-S, Davis G, Gardiner J, Liscum E, Polacco M, Paterson A, Sanchez-Villeda H, et al (2002) Access to the maize genome: an integrated physical and genetic map. Plant Physiol 128: 912
Cone K, McMullen M, Bi IV, Davis G, Yim Y, Gardiner J, Polacco M, Sanchez-Villeda H, Fang Z, Schroeder S, et al (2002) Genetic, physical, and informatics resources for maize. On the road to an integrated map. Plant Physiol 130: 15981605 Ding Y, Johnson MD, Colayco R, Chen YJ, Melnyk J, Schmitt H, Shizuya H (1999) Contig assembly of bacterial artificial chromosome clones through multiplexed fluorescence-labeled fingerprinting. Genomics 56: 237246[CrossRef][Web of Science][Medline]
Engler F, Hatfield J, Nelson W, Soderlund C (2003) Locating sequence on FPC maps and selecting a minimal tiling path. Genome Research 13: 2152, 2163 Engler F, Soderlund C (2002) Software for physical maps. In Ian Dunham, ed, Genomic Mapping and Sequencing, Genome Technology Series. Horizon Press, Norfolk, UK, pp 200236
Fang Z, Cone K, Sanchez-Villeda H, Polacco M, McMullen M, Schroeder S, Gardiner J, Davis G, Havermann S, Yim Y, et al (2003) iMap: a database-driven utility to integrate and access the genetic and physical maps of maize. Bioinformatics 19: 21052111
Fjell C, Bosdet I, Schein J, Jones S, Marra M (2003) Internet Contig Explorer (iCE): a tool for visualizing clone fingerprint maps. Genome Res 13: 12441249
Gardiner J, Schroeder S, Polacco ML, Sanchez-Villeda H, Fang Z, Morgante M, Landewe T, Fengler K, Useche F, Hanafey M, et al (2004) Anchoring 9,3971 maize expressed sequence tagged unigenes to the bacterial artificial chromosome contig map by two-dimensional overgo hybridization. Plant Physiol 134: 13171326
Harushima Y, Yano M, Shomura A, Sato M, Shimano T, Kuboki Y, Yamamoto T, Lin SY, Antonio BA, Parco A, et al (1998) A high-density rice genetic linkage map with 2275 markers using a single F2 population. Genetics 148: 479494
Lawrence C, Dong Q, Polacco M, Seigfried T, Brendel V (2004) MaizeGDB, the community database for maize genetics and genomics. Nucleic Acids Res 32: D393D397 Meyers BC, Scalabrin S, Morgante M (2004) Mapping and sequencing complex genomes: let's get physical! Nat Rev Genet 5: 578588[CrossRef][Web of Science][Medline] Nelson W, Soderlund C (2005) Software for restriction fragment physical maps. In K Meksem, G Kahl, eds, The Handbook of Genome Mapping: Genetic and Physical Mapping. Wiley-VCH Verlag GmbH, Weinheim, Germany, pp 285305
Paterson AH, Bowers JE, Chapman BA (2004) Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proc Natl Acad Sci USA 101: 99039908
Soderlund C, Humphrey S, Dunhum A, French L (2000) Contigs built with fingerprints, markers and FPC V4.7. Genome Res 10: 17721787 Soderlund C, Longden I, Mott R (1997) FPC: a system for building contigs from restriction fingerprinted clones. CABIOS 13: 523535
Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, et al (2002) The generic genome browser: a building block for a model organism system database. Genome Res 12: 15991610 Sulston J, Mallet F, Staden R, Durbin R, Horsnell T, Coulson A (1988) Software for genome mapping by fingerprinting techniques. CABIOS 4: 125132 The Gene Ontology Consortium (2000) Gene ontology: tool for the unification of biology Nat Genet 25: 2529[CrossRef][Web of Science][Medline] The International Human Genome Mapping Consortium (2001) A physical map of the human genome. Nature 409: 934941[CrossRef][Medline]
Ware D, Jaiswal P, Ni J, Pan X, Chang K, Clark K, Teytelman L, Schmidt S, Zhao W, Cartinhour S, et al (2002) Gramene: a resource for comparative grass genomics. Nucleic Acids Res 30: 103105
Wilkinson MD, Links M (2002) BioMOBY: an open-source biological web services proposal. Brief Bioinform 3: 331341 This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY® | THE PLANT CELL | |
|---|---|---|---|