|
Plant Physiol, May 2003, Vol. 132, pp. 84-91
Mining for Single Nucleotide Polymorphisms and
Insertions/Deletions in Maize Expressed Sequence Tag
Data1
Jacqueline
Batley,2
Gary
Barker,
Helen
O'Sullivan,
Keith J.
Edwards, and
David
Edwards*
Agriculture Victoria, Plant Biotechnology Centre, La Trobe
University, Bundoora, Victoria 3086, Australia (J.B., D.E.); and the
School of Biological Sciences, University of Bristol, Bristol B58 1UG,
United Kingdom (G.B., H.O., K.J.E.)
 |
ABSTRACT |
We have developed a computer based method to identify
candidate single nucleotide polymorphisms (SNPs) and small
insertions/deletions from expressed sequence tag data. Using a
redundancy-based approach, valid SNPs are distinguished from erroneous
sequence by their representation multiple times in an alignment of
sequence reads. A second measure of validity was also calculated based
on the cosegregation of the SNP pattern between multiple SNP loci in an
alignment. The utility of this method was demonstrated by applying it
to 102,551 maize (Zea mays) expressed sequence
tag sequences. A total of 14,832 candidate polymorphisms were
identified with an SNP redundancy score of two or greater. Segregation
of these SNPs with haplotype indicates that candidate SNPs with high
redundancy and cosegregation confidence scores are likely to represent
true SNPs. This was confirmed by validation of 264 candidate SNPs from 27 loci, with a range of redundancy and cosegregation scores, in four
inbred maize lines. The SNP transition/transversion ratio and
insertion/deletion size frequencies correspond to those observed by
direct sequencing methods of SNP discovery and suggest that the
majority of predicted SNPs and insertion/deletions identified using
this approach represent true genetic variation in maize.
 |
INTRODUCTION |
The development of high-throughput
methods for the detection of single nucleotide polymorphisms (SNPs) and
small indels (insertion/deletions) has led to a revolution in their use
as molecular markers. SNPs are increasingly becoming the marker of
choice in genetic analysis and are used routinely as markers in
agricultural breeding programs (Gupta et al., 2001 ).
They also have many uses in human genetics, such as for the detection
of alleles associated with genetic diseases and the identification of
individuals (Nikiforov et al., 1994 ). SNPs are
invaluable as a tool for genome mapping, offering the potential for
generating very high-density genetic maps, which can be used to develop
haplotyping systems for genes or regions of interest (Rafalski,
2002 ). The low mutation rate of SNPs also makes them excellent
markers for studying complex genetic traits and as a tool for the
understanding of genome evolution (Syvanen, 2001 ).
Unlike random amplified polymorphic DNAs and RFLPs, SNPs are direct
markers because sequence information provides the exact nature of the
allelic variants. They are far more prevalent than microsatellites and,
therefore, may provide a high density of markers near a locus of
interest. Recent evidence has shown that when comparing human DNA from
two individuals, SNPs are found on average every 1 to 2 kb
(Clifford et al., 2000 ; Deutsch et al.,
2001 ). Limited work has been carried out to examine the
occurrence of SNPs in plants, although these preliminary studies have
indicated that SNPs appear to be even more abundant in plant systems
than in the human genome. Germano and Klein (1999)
identified five SNPs in 1 kb of nDNA of Picea rubens and
Picea mariana, and also discovered SNPs in the chloroplasts
of these species. Recently, Coryell et al. (1999)
identified two SNPs in approximately 400 bp of sequence in soybean
(Glycine max). In maize (Zea mays), they have
been found even more frequently, with one SNP approximately every 48 bp
and every 130 bp in 3'-untranslated regions and coding regions,
respectively (Tenaillon et al., 2001 ; Rafalski,
2002 ). Mogg et al. (2002) amplified and
sequenced the flanking region of 97 previously characterized
microsatellite primer sets in 11 maize inbred lines. The sequencing
results indicated that the flanking regions of maize microsatellites
show increased levels of polymorphism when compared with other regions
of the genome, with SNPs in these regions found on average every 40 bp.
As with the majority of molecular markers, one of the limitations of
SNPs is the initial cost associated with their development. A variety
of approaches have been adopted for the discovery of novel SNP markers.
The conversion of microsatellite markers, identifying the relatively
abundant nucleotide polymorphisms surrounding simple sequence repeats
has the advantage that many of these markers have already been
characterized (Mogg et al., 2002 ). A further approach
identifies SNPs in overlapping genomic sequence, a product of the large
genome sequencing programs (Taillon-Miller et al., 1998 ;
Dawson et al., 2001 ). This later method requires
experimental confirmation to preclude errors associated with cloning
and sequencing procedures but provides the greatest potential for
cost-effective SNP discovery because it uses preexisting sequence data.
With the development of high-throughput sequencing technology, large amounts of data have been submitted to the various DNA databases that
may be suitable for data mining and SNP discovery. In particular, expressed sequence tag (EST) sequencing programs have provided a wealth
of information, identifying novel genes from a broad range of organisms
and providing an indication of gene expression level in particular
tissues (Adams et al., 1995 ). EST sequence data may
provide the richest source of biologically useful SNPs due to the
relatively high redundancy of gene sequence, the diversity of genotypes
represented within databases, and the fact that each SNP would be
associated with an expressed gene (Picoult-Newberg et al.,
1999 ). The major drawback to this approach is that the relatively high sequence error naturally associated with EST programs may lead to the identification of false positives. We have attempted to
overcome these difficulties by developing software for the automated
detection of SNPs within EST data. A conservative approach was followed
to limit the potential errors associated with cloning and sequencing so
that only polymorphisms represented by two or more sequences were
considered. For each polymorphism, two associated measurements of
confidence in the validity of SNPs were also calculated. The frequency
of occurrence of a polymorphism at a particular locus
provides a primary measure of confidence in the SNP representing a
true polymorphism and is referred to as the SNP redundancy score. The
cosegregation of multiple SNPs within an alignment to define a
haplotype provides a second measure of confidence in SNP validity and
is referred to as the cosegregation score.
To assess this software, we have applied it to maize EST data collated
at ZmDB as part of the Maize Gene Discovery Project (Gai et al.,
2000 ). Analysis of these data and experimental verification of
264 of the identified SNPs indicate that this method is not only
efficient at detecting SNPs between alleles from different maize
genotypes but also identifies expressed paralogous genes present in the
maize genome. Allelic SNP data is of use for current mapping and
genotyping programs, whereas the ability to differentiate between
duplicate paralogous genes is important for the true alignment of
genomic sequence data for the forthcoming maize genome sequencing program (Bennetzen et al., 2002 ).
 |
RESULTS |
cDNA Assembly
Expressed maize sequences (102,551) consisting of a total
of over 46 million nucleotides were retrieved from ZmDB (Gai et al., 2000 ). These sequences are derived from a variety of cDNA libraries produced as part of the maize gene discovery program or
collated from GenBank. The maize inbred lines OH43, W23, B73, W64a, and
BMS represent 32%, 23%, 20%, 6%, and 5% of the sequences, respectively, with the remaining 14% having no defined varietal identification. All sequences were retrieved in FASTA format and assembled into contigs using d2 cluster and cap3 using default values
of 80% similarity for d2cluster and 95% for cap3. These sequences
aligned into 13,247 contigs and 19,112 singletons. A total of 6,107 of
the contigs (46%) contained four or more sequence reads, the minimum
required for redundancy-based SNP detection (Table
I).
View this table:
[in this window]
[in a new window]
|
Table I.
Cluster profile of maize EST data
A total of 102,551 expressed maize sequences were aligned using d2
cluster and cap3. Minimum similarity thresholds of 80% and 95% were
used for d2 cluster and cap3, respectively, and a minimum overlap of
100 bases was specified for cap3.
|
|
Identifying Candidate SNPs
Of the 6,107 contigs containing four or more sequence
reads, 3,479 (57%) alignments contained candidate SNPs with a
redundancy score of two or greater. Although only 22% of contigs of
four reads contained candidate SNPs, this proportion increased rapidly with 75% of 10 read contigs and 89% of 15 read contigs containing candidate SNPs (Fig. 1). Alignments
representing over 50 sequence reads contained a disproportionate number
of SNPs due to random errors accumulating redundancy scores greater
than one. Therefore, these larger contigs were ignored in calculating
SNP abundance.

View larger version (15K):
[in this window]
[in a new window]
|
Figure 1.
Abundance of maize EST contigs containing
candidate SNPs in relation to contig size. The frequency of contigs
that contained SNPs was calculated for all alignments of increasing
numbers of sequence reads.
|
|
A total of 13,122 candidate SNPs were detected with SNP abundance
increasing with increasing contig size (Fig.
2). This relates to approximate SNP
frequencies of one per 600 bp of aligned sequence for five read contigs
to one per 100 bp of aligned sequence for 20 read contigs. SNP scores
ranged from two to 20, with mean SNP score increasing in direct
proportion to the number of sequences in an alignment (Fig.
3).

View larger version (13K):
[in this window]
[in a new window]
|
Figure 2.
Abundance of candidate SNPs identified within
contigs in relation to contig size for maize EST data. The mean SNP
abundance was calculated for all alignments of increasing numbers of
sequence reads.
|
|

View larger version (13K):
[in this window]
[in a new window]
|
Figure 3.
A measurement of SNP redundancy score in relation
to contig size for maize EST data. The mean SNP redundancy score was
calculated for all candidate SNPs identified in alignments of
increasing numbers of sequence reads.
|
|
Along with the SNP redundancy score, a further measure of confidence of
SNP validity was calculated based on the cosegregation of SNP pattern
between multiple SNP loci within an alignment. SNPs between sequences
representing divergence between two genes (orthologs or paralogs) would
be expected to cosegregate defining a haplotype at multiple loci within
an alignment, whereas sequencing errors would occur randomly between
haplotypes. Where several candidate SNPs are detected within an
alignment, the majority of SNPs cosegregate with a haplotype.
Validation of Candidate SNPs
A total of 264 candidate SNPs from 27 loci were validated using
direct sequencing of PCR products. The SNPs were chosen based on a
range of redundancy and cosegregation scores and predicted expression
of multiple genes. Of the 264 candidate SNPs, 241 (91%) were shown to
be true polymorphisms (Table II). Of the
23 candidate SNPs that were shown to be false, 22 (96%) had an SNP
score of 2. Overall, 130 of the candidate SNPs had an SNP score of 2, of which 108 (83.1%) were shown to be true polymorphisms. The average weighted cosegregation value of the validated SNPs was 57.5%, with
scores ranging from 6% to 100%. In all cases, the validated SNPs with
low SNP, cosegregation, and weighted cosegregation scores were in
contigs where many different haplotypes were present (Fig. 4). Frequently, one of these haplotypes
was represented by a single sequence only and, therefore, could only be
confirmed by sequence validation. The average weighted cosegregation
score in the false SNPs was 9.5%, ranging from 4% to 24%, and the
highest cosegregation score was 4/17.
View this table:
[in this window]
[in a new window]
|
Table II.
Details of the 27 loci genotyped and SNPs validated
Candidate SNPs from 27 loci (264), with a range of redundancy and
cosegregation scores, were validated in the four inbred maize lines
B73, W23, OH43, and W64a. The candidate SNPs that could not be verified
had low redundancy and cosegregation scores. Where multiple genes were
predicted in a sequence alignment and candidate SNPs were predicted
between these genes, these SNPs were verified.
|
|

View larger version (38K):
[in this window]
[in a new window]
|
Figure 4.
AutoSNP summary report 246. This report depicts
nine candidate SNPs, identifying their base position in the sequence
alignment along with measures of confidence of SNP validity. The key
relates the aligned sequences to original GenBank sequence
identification and also identifies the maize line (where available)
derived from the GenBank annotation. The SNP redundancy score measures
the minimum number of sequences that represent a polymorphism. The
cosegregation score is a measure of the number of SNPs in the alignment
that share the same pattern of polymorphism between aligned sequences.
The weighted cosegregation score corrects for missing data in the EST
alignments that may otherwise bias the cosegregation score. In this
example, all SNPs were verified as true polymorphisms.
|
|
Analysis of Base Changes
Candidate SNPs were categorized according to nucleotide
substitution as either transitions (C/T or G/A) or transversions (C/T, A/G, C/A, or T/G; Table III). There was a
relative increase in the proportion of transitions over transversions.
We also observe a relative increase in frequency of C/A and its reverse
complement, T/G transversions compared with C/G and A/T transversions
(Table III).
View this table:
[in this window]
[in a new window]
|
Table III.
Nucleotide substitution frequencies for candidate
SNPs identified in maize EST data
Candidate polymorphisms (14,832) were identified with an SNP redundancy
score of 2 or greater. Of these, 13,122 were nucleotide substitutions.
The absolute frequencies of the different types of nucleotide
substitutions were calculated.
|
|
Analysis of Insertions/Deletions
As well as nucleotide transitions and transversions, 1,710 insertion/deletion (indels) were identified as having a redundancy score of two or greater. These occur as single nucleotides or strings
of up to 26 bases (Table IV). The
frequency of indels does not decrease exponentially with their increase
in length but displays a relative increase in frequency of six-,
eight-, 10-, 12-, and 15-base indels. Analysis of indel sequences
indicated a bias toward A and T nucleotides for both single-base and
longer indels. Significantly, there was also an underrepresentation of CG dinucleotide indels compared with other dinucleotide indels (Table
V).
View this table:
[in this window]
[in a new window]
|
Table IV.
Prevalence of candidate indels identified in maize
EST data
Candidate polymorphisms (14,832) were identified with an SNP redundancy
score of 2 or greater. Of these, 1710 were small indels. The absolute
frequencies of different size indels was calculated, as well as a
percentage of the total no. of indels (1,710).
|
|
View this table:
[in this window]
[in a new window]
|
Table V.
The frequency of single and dinucleotide indel
sequences predicted from the alignment of 102,551 maize EST sequences
Of the 1,710 candidate indels identified with an SNP redundancy score
of 2 or greater, 1,014 were 1-bp indels and 230 were 2-bp indels. The
absolute frequencies of the different 1- and 2-bp indel compositions
were calculated, demonstrating the relatively low frequency of C or G
indels.
|
|
Differentiation between Paralogs and Orthologs
The identification of multiple cosegregating SNPs within an
alignment of EST sequences allows the accurate prediction of sequence haplotypes. Comparison of predicted haplotypes with known maize lines
from which the sequences were derived allows the identification of
predicted orthologous genes. By examining multiple SNPs from 100 random
alignments representing over 1,450 sequence reads, a total of more than
250 haplotypes were observed. In 66 of these alignments, each of the
haplotypes cosegregated with the maize lines from which the ESTs were
derived, whereas for 34 alignments, the segregation of haplotypes
indicated the expression of multiple genes within a single maize line.
 |
DISCUSSION |
SNPs are becoming the marker of choice for molecular genetic
analysis. As with other molecular markers, their discovery and characterization is both expensive and laborious. Of the methods applied for the discovery of SNPs, the mining of sequence data sets
should provide the cheapest source of abundant SNPs (Gu et al.,
1998 ; Taillon-Miller et al., 1998 ; Buetow
et al., 1999 ; Picoult-Newberg et al., 1999 ).
Although every effort is made to produce and submit sequence of only
the highest quality, the high-throughput nature of the sequencing
programs inevitably leads to the submission of inaccuracies. The
electronic filtering of these data to identify potentially biologically
relevant polymorphisms is thereby hampered by the false calling of
these bases. Previous methods used to identify SNPs in aligned sequence
data has relied on the comparison of sequence trace files to filter out
polymorphisms, where the base calling within one or more of the traces
is of dubious quality and, therefore, likely to be due to sequence
error rather than representative of a true polymorphism (Kwok et
al., 1994 ; Garg et al., 1999 ; Marth et
al., 1999 ). This method, although suitable for comparing
genomic sequence, is limited by the requirement of sequence trace file
data and does not distinguish errors incorporated during the reverse
transcription of mRNA. For highly redundant data sets compiled from a
variety of sources with a limited availability of sequence trace files,
this means of filtering sequence errors from true polymorphisms is not
feasible. However, the redundant nature of these EST data sets does
permit the selection of polymorphisms that occur multiple times within
a set of aligned sequences. The frequency of occurrence of a
polymorphism at a particular locus provides a measure of confidence in
the SNP representing a true polymorphism and is referred to as the SNP
redundancy score. By examining SNPs that have a redundancy score of two
or greater, i.e. two or more of the aligned sequences represent the
polymorphism, the vast majority of sequencing errors are removed.
Although some true genetic variation is also ignored due to its
presence only once within an alignment, the high degree of redundancy
within the data permits the rapid identification of large numbers of SNPs from data collated from a variety of sources.
We have applied this SNP detection method to maize sequence data
compiled at ZmDB that consist both of sequences derived from the maize
gene discovery program and those submitted to the DNA sequence
repositories by other researchers. The 102,551 EST sequences were
processed using stringent parameters to limit the alignment of multiple
genes from gene families and identify polymorphisms between homologs
from different maize lines. In removing the 54% of the total contigs,
which contained less than four aligned sequence reads from the
analysis, we greatly restrict the number of potentially polymorphic
loci that may be detected. However, this sacrifice is necessary if we
are to use redundancy to measure confidence in the validity of SNPs
from the remaining loci. Analysis of larger data sets would increase
the proportion of contigs containing more than four sequence reads and,
therefore, would identify SNPs in a greater number of genes. The
observation that the proportion of contigs that contain SNPs increases
with contig size (Fig. 1) suggests that the number of SNP loci
identified would increase with larger data sets. The mean number of
SNPs identified per locus also increases with increasing number of
sequences aligned (Fig. 2). This suggests that larger data sets with
increased contig sizes would provide a greater number of SNPs per
locus, whereas the increase in mean SNP score with contig size (Fig. 3)
indicates that larger data sets would also provide a greater confidence in the validity of the predicted polymorphisms. Together, these results
indicate that, although we have identified a large number of candidate
SNPs in maize, these only represent a small proportion of the total
genetic variation between maize expressed sequences.
Although using a redundancy-based approach to distinguish between
sequence errors and true SNPs is highly efficient, the nonrandom nature
of sequence error may lead to certain sequence errors within complex
DNA structures being repeated between runs. Therefore, errors at these
loci would have a relatively high SNP redundancy score and appear as
confident SNPs. To identify these sequence errors and distinguish them
from true SNPs, a further measure of SNP confidence was also
calculated. Sequencing errors at complex loci are random between runs,
whereas SNPs that represent divergence between homologous genes would
cosegregate with haplotype. A cosegregation score based on the
frequency of an SNP pattern occurring at multiple loci in an alignment
allows ready identification of non-cosegregating SNPs. Weighting this
score to account for the number of SNP loci and missing sequence data
within an alignment further permits comparison of cosegregation scores
across alignments. The SNP score and cosegregation score together
provide a means for estimating confidence in the validity of SNPs
within aligned sequences.
SNPs may differentiate between duplicate genes within a genome
(paralogs) or orthologous genes between maize lines. Where validated
SNPs are present in sequences derived from the same line, they must be
due to gene duplication and the expression of the resulting paralogous
genes. SNPs that define haplotypes that differentiate between maize
lines may represent orthologous genes, although it is possible that
segregation of these SNPs may be coincidental or reflect the
differential expression of paralogous genes found in cDNA libraries
produced from different lines. Analysis of SNPs from 100 random contigs
suggests that 34 contain multiple genes from the same line. This is in
line with other reports of duplicate gene copy number in maize and reflects its ancient allotetraploid origin (Gaut and Doebley, 1997 ; Gaut, 2001 ). Estimation of the copy number
of closely related genes is essential for the avoidance of errors in
whole genome sequence assembly (Eichler, 1998 ).
Therefore, the availability of large numbers of SNPs that differentiate
between maize paralogs should assist in the assembly of the sequence
data from the proposed maize genome sequencing program
(Bennetzen et al., 2002 ).
The high frequency of transitions detected has been observed in
previous SNP discovery programs (Garg et al., 1999 ;
Picoult-Newberg et al., 1999 ; Deutsch et al.,
2001 ) and reflects the high frequency of the C to T mutation
after methylation (Coulondre et al., 1978 ). The relative
abundance of the C/A and its reverse complement, T/G transversions
compared with C/G and A/T transversions, was unexpected and remains to
be explained.
Along with SNPs, 1,710 predicted indel polymorphisms in maize EST data
were identified. The majority of indels (>80%) were three bases or
less in length with a disproportionate increase in the frequency of
six-, eight-, 10-, 12-, and 15-base indels (Table IV). This
distribution is similar to that observed by Bhattramakki et al.
(2002) , with the exception that we failed to identify large indels, presumably due to larger indels limiting contig assembly. Indels that increase in size by 3 bp suggests selection for the conservation of reading frames within coding sequence. There was no
differentiation between coding or non-coding regions during contig
assembly or in the screening for indels. Although indels occur most
frequently in non-coding regions, our data suggest that coding regions
may also be a rich source of indel polymorphisms. Indels may be
produced by errors in DNA synthesis, repair, or recombination or be due
to the insertion and excision of transposable elements that often leave
a characteristic DNA footprint of several nucleotide bases. The
relative abundance of eight base indels was also observed in maize by
Bhattramakki et al. (2002) and may be due to sequence
duplication during insertion and excision of Ac/Ds
transposable elements (Sutton et al., 1984 ). The high
frequency of 10 base indels does not correspond to any characterized
maize transposon footprint, and the source of these indels remains
unexplained. Comparison of indel dinucleotide sequence frequencies
reveals a relative bias against CG dinucleotides (Table V). This may reflect methylation and conversion of these sequences (Coulondre et al., 1978 ).
A selection of SNPs, representing a range of SNP type, redundancy
scores, cosegregation scores, and predicted expression of multiple
genes, were validated using direct sequencing of PCR products. These
SNPs (91%) were verified to be true polymorphisms, demonstrating the
ability of the program to predict true polymorphisms. The redundancy
and cosegregation scores were also demonstrated to be accurate
indicators of valid polymorphisms because all false SNPs had low
scores. However, a small proportion of the validated SNPs also had low
cosegregation scores; this was due to the presence of many haplotypes
in the contig. This suggests that the greater the number of sequences
aligned, the more haplotypes are accurately predicted.
 |
CONCLUSION |
In total, we have identified over 14,832 candidate sequence
polymorphisms in maize EST sequence data, along with two measures of
confidence for each predicted polymorphism. Segregation of these SNPs
with haplotype along with validation demonstrates that candidate SNPs
with high redundancy and cosegregation confidence scores are likely to
represent true SNPs. The transition to transversion ratio and indel
size frequencies correspond to those observed by direct sequencing
methods of SNP discovery and suggest that the majority of predicted
SNPs and indels identified using this approach represent true genetic
variation in maize.
 |
MATERIALS AND METHODS |
Auto SNP Version 1.0
Candidate SNPs were detected using the PERL script Auto_snip
version 1.0 available from the authors. Auto_snip clusters and contigs
FASTA format sequences by acting as a wrapper for the clustering
package d2cluster (Burke et al., 1999 ) and the contig building package cap3 (Huang and Madan, 1999 ). Using
d2cluster to break the sequences into subgroups for cap3 assembly
allows the analysis of more than 100,000 sequence reads on a desktop personal computer (1.3 GHz PIII, 576 Mb RAM) running Red Hat Linux. Auto_snip parses the d2cluster output table and generates a set of
cluster output files in multiple FASTA format. These files are then
passed as input to cap3 for contig building, after which the AceDB
output file is parsed to produce a series of gapped multiple FASTA
format files. Contigs containing at least four reads were selected for
SNP detection by SNP score. Spacing characters (-) added during
sequence alignment were considered as a fifth element in addition to A,
C, G, and T. This permits the identification of insertion/deletion
polymorphisms between sequences that may be used to differentiate
between genes using an SNP-based assay. Where a nucleotide polymorphism
was shared between two or more sequences, a candidate SNP was recorded
and an SNP score was allocated that was equal to the minimum number of
reads that share a common polymorphism. Where several SNPs are present
in an alignment, a redundant cosegregation score was calculated for
each SNP. This was measured as the frequency of that SNP pattern
occurring among each of the SNPs identified in the alignment. This
figure was then normalized to the number of sequences and number of
SNPs detected in the alignment to produce a standard cosegregation score. An HTML format output file is generated to allow the user to
browse through the SNP results. Relevant statistics are written to a
summary HTML page and process log file during the clustering, contig
building, and SNP detection phases of the analysis. Minimum similarity
thresholds of 80% and 95% were used for d2cluster and cap3,
respectively, and a minimum overlap of 100 bases was specified for cap3.
Experimental Validation
Twenty-seven SNP reports were selected for validation of 264 SNPs, based on a range of redundancy scores, cosegregation scores, and
predicted multiple copy genes. Genomic DNA was isolated from the four
inbred maize (Zea mays) lines OH43, B73, W23, and W64a, using the procedure of Edwards et al. (1991) .
Amplification of the 27 loci was performed using primers designed to
the conserved sequence surrounding the SNPs, using the primer design
program PRIMER version 0.5 (Whitehead Institute, Cambridge, MA).
Amplifications were carried out in a 25-µL reaction volume containing
25 ng of DNA, 2.5 µL of 10× PCR reaction buffer (Qiagen, Valencia,
CA), 15 pmol forward and reverse primers, 200 µM of each
dNTP, and 2 units of HotStar Taq polymerase (Qiagen).
After an initial hot start at 95°C for 15 min, the following cycling
parameters were employed: denaturation at 94°C for 1 min, annealing
at 55°C for 1 min, and extension at 72°C for 1 min. After 35 rounds
of amplification, a final extension step was performed at 72°C for 10 min. All PCR reactions were performed in a PE9700 DNA thermal cycler
(PE Biosystems, Foster City, CA). After amplification, PCR products
were purified by electrophoresis and subsequent elution from 1.2%
(w/v) agarose gels using a QiaEX II gel extraction kit (Qiagen).
Gel-purified PCR products were sequenced according to the protocol
outlined in the DYEnamic ET Dye Terminator kit (Amersham Biosciences,
Little Chalfont, Buckinghamshire, UK), using both forward and reverse
PCR primers, and analyzed using a MegaBACE 1000 DNA analysis system
(Molecular Dynamics, Sunnyvale, CA). To obtain an accurate consensus
sequence, individual PCR products were sequenced at least twice using
both the forward and reverse primers. Allele sequences from each locus
and inbred line were aligned and compared using Sequencher (GeneCode,
Ann Arbor, MI), and each of the 264 SNPs was assessed.
 |
FOOTNOTES |
Received December 18, 2002; returned for revision February 17, 2003; accepted February 26, 2003.
1
This work was supported by the Biotechnology and
Biological Sciences Research Council and the Victorian Bioinformatics
Consortum (UK; grant-aided support to IACR-Long Ashton, Investigating
Gene Function initiative grant no. IGF12403 to D.E. and G.B.,
and grant no. D14009 to H.O.). Detailed results from this study are
available on-line at www.cerealsdb.uk.net. D.E. is supported by the
Victorian Bioinformatics Consortum.
2
Present address: School of Biological Sciences,
University of Bristol, Woodland Road, Bristol BS8 1UG, UK.
*
Corresponding author; e-mail Dave.Edwards{at}nre.vic.gov.au;
fax 61-3-94793618.
Article, publication date, and citation information can be found at
www.plantphysiol.org/cgi/doi/10.1104/pp.102.019422.
 |
LITERATURE CITED |
-
Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, Lee NH, Kirkness EF, Weinstock KG, Gocayne JD, White O, et al
(1995)
Initial assessment of human gene diversity and expression patterns based upon 83-million nucleotides of cDNA sequence.
Nature
377: 3[Medline]
-
Bennetzen JL, Chandler VL, Schnable P
(2002)
National Science Foundation-Sponsored Workshop Report: Maize Genome Sequencing Project.
Plant Physiol
127: 1572-1578
-
Bhattramakki D, Dolan M, Hanafey M, Wineland R, Vaske D, Register JC III, Tingey SV, Rafalski A
(2002)
Insertion-deletion polymorphisms in 3' regions of maize genes occur frequently and can be used as highly informative genetic markers.
Plant Mol Biol
48: 539-547[CrossRef][Web of Science][Medline]
-
Buetow KH, Edmonson MN, Cassidy AB
(1999)
Reliable identification of large numbers of candidate SNPs from public EST data.
Nat Genet
21: 323-325[CrossRef][Web of Science][Medline]
-
Burke J, Davison D, Hide W
(1999)
d2_cluster: a validated method for clustering EST and full-length cDNA sequences.
Genome Res
9: 1135-1142[Abstract/Free Full Text]
-
Clifford R, Edmonson M, Hu Y, Nguyen C, Scherpbier T, Buetow KH
(2000)
Expression-based genetic/physical maps of single nucleotide polymorphisms identified by the cancer genome anatomy project.
Genome Res
10: 1259-1265[Abstract/Free Full Text]
-
Coryell VH, Jessen H, Schupp JM, Webb D, Keim P
(1999)
Allele-specific hybridisation markers for soybean.
Theor Appl Genet
101: 1291-1298[CrossRef]
-
Coulondre C, Miller JH, Farabaugh PJ, Gilbert W
(1978)
Molecular basis of base substitution hot spots in Escherichia coli.
Nature
274: 775-780[CrossRef][Medline]
-
Dawson E, Chen Y, Hunt S, Smink LJ, Hunt A, Rice K, Livingston S, Bumpstead S, Bruskiewich R, Sham P, et al
(2001)
A SNP resource for human chromosome 22: extracting dense clusters of SNPs from the genomic sequence.
Genome Res
11: 170-178[Abstract/Free Full Text]
-
Deutsch S, Iseli C, Bucher P, Antonarakis SE, Scott HS
(2001)
A cSNP map and database for human chromosome 21.
Genome Res
11: 300-307[Abstract/Free Full Text]
-
Edwards K, Johnstone C, Thompson C
(1991)
A simple and rapid method for the preparation of plant genomic DNA for PCR analysis.
Nucleic Acids Res
19: 1349[Free Full Text]
-
Eichler EE
(1998)
Masquerading repeats: paralogous pitfalls of the human genome.
Genome Res
8: 758-762[Free Full Text]
-
Gai X, Lal S, Xing L, Brendel V, Walbot V
(2000)
Gene discovery using the maize genome database ZmDB.
Nucleic Acids Res
28: 94-96[Abstract/Free Full Text]
-
Garg K, Green P, Nickerson DA
(1999)
Identification of candidate coding region single nucleotide polymorphisms in 165 human genes using assembled expressed sequence tags.
Genome Res
9: 1087-1092[Abstract/Free Full Text]
-
Gaut BS
(2001)
Patterns of chromosomal duplication in maize and their implications for comparative maps of the grasses.
Genome Res
11: 55-66[Abstract/Free Full Text]
-
Gaut BS, Doebley JF
(1997)
DNA sequence evidence for the segmental allotetraploid origin of maize.
Proc Natl Acad Sci USA
94: 6809-6814[Abstract/Free Full Text]
-
Germano J, Klein AS
(1999)
Species specific nuclear and chloroplast single nucleotide polymorphisms to distinguish Picea glauca, P. mariana and P. rubens.
Theor Appl Genet
99: 37-49[CrossRef]
-
Gu Z, Hillier L, Kwok P-Y
(1998)
Single nucleotide polymorphism hunting in cyberspace.
Hum Mutat
12: 221-225[CrossRef][Web of Science][Medline]
-
Gupta PK, Roy JK, Prasad M
(2001)
Single nucleotide polymorphisms: a new paradigm for molecular marker technology and DNA polymorphism detection with emphasis on their use in plants.
Curr Sci
80: 524-535
-
Huang X, Madan A
(1999)
CAP3: a DNA sequence assembly program.
Genome Res
9: 868-877[Abstract/Free Full Text]
-
Kwok PY, Carlson C, Yager TD, Ankener W, Nickerson DA
(1994)
Comparative analysis of human DNA variations by fluorescence-based sequencing of PCR products.
Genomics
23: 138-144[CrossRef][Web of Science][Medline]
-
Marth GT, Korf I, Yandell MD, Yeh RT, Gu ZJ, Zakeri H, Stitziel NO, Hillier L, Kwok PY, Gish WR
(1999)
A general approach to single-nucleotide polymorphism discovery.
Nat Genet
23: 452-456[CrossRef][Web of Science][Medline]
-
Mogg R, Batley J, Hanley S, Edwards D, O'Sullivan H, Edwards KJ
(2002)
Characterising the flanking regions of Zea mays microsatellites reveals a large number of useful sequence polymorphisms.
Theor Appl Genet
105: 532-543[CrossRef][Web of Science][Medline]
-
Nikiforov TT, Rendle RB, Goelat P, Rogers Y-H, Kotewicz ML, Anderson S, Trainor GL, Knapp MR
(1994)
Genetic bit analysis: a solid phase method for typing single nucleotide polymorphisms.
Nucleic Acids Res
22: 4167-4175[Abstract/Free Full Text]
-
Picoult-Newberg L, Ideker TE, Pohl MG, Taylor SL, Donaldson MA, Nickerson DA, Boyce-Jacino M
(1999)
Mining SNPs from EST databases.
Genome Res
9: 167-174[Abstract/Free Full Text]
-
Rafalski A
(2002)
Applications of single nucleotide polymorphisms in crop genetics.
Curr Opin Plant Biol
5: 94-100[CrossRef][Web of Science][Medline]
-
Sutton WD, Gerlach WL, Schwartz D, Peacock WJ
(1984)
Molecular analysis of Ds controlling element mutations at the Adh1 locus of maize.
Science
223: 1265-1268[Abstract/Free Full Text]
-
Syvanen AC
(2001)
Genotyping single nucleotide polymorphisms.
Nat Rev Genet
2: 930-942[CrossRef][Web of Science][Medline]
-
Taillon-Miller P, Gu ZJ, Li Q, Hillier L, Kwok PY
(1998)
Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms
Genome Res
8: 748-754[Abstract/Free Full Text]
-
Tenaillon MI, Sawkins MC, Long AD, Gaut RL, Doebley JF, Gaut BS
(2001)
Patterns of DNA sequence polymorphism along chromosome 1 of maize (Zea mays ssp mays L.).
Proc Natl Acad Sci USA
98: 9161-9166[Abstract/Free Full Text]
© 2003 American Society of Plant Biologists
This article has been cited by other articles:

|
 |

|
 |
 
P. J. Maughan, S. M. Yourstone, E. N. Jellen, and J. A. Udall
SNP Discovery via Genomic Reduction, Barcoding, and 454-Pyrosequencing in Amaranth
The Plant Genome,
November 1, 2009;
2(3):
260 - 270.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
Y. Lu, J. Curtiss, R. G. Percy, S. E. Hughs, S. Yu, and J. Zhang
DNA Polymorphisms of Genes Involved in Fiber Development in a Selected Set of Cultivated Tetraploid Cotton
Crop Sci.,
August 7, 2009;
49(5):
1695 - 1704.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
C. Duran, N. Appleby, T. Clark, D. Wood, M. Imelfort, J. Batley, and D. Edwards
AutoSNPdb: an annotated single nucleotide polymorphism database for crop plants
Nucleic Acids Res.,
January 1, 2009;
37(suppl_1):
D951 - D953.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
E. Gaitan-Solis, I.-Y. Choi, C. Quigley, P. Cregan, and J. Tohme
Single Nucleotide Polymorphisms in Common Bean: Their Discovery and Genotyping Using a Multiplex Detection System
The Plant Genome,
November 1, 2008;
1(2):
125 - 134.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
J. C. Sullivan, A. M. Reitzel, and J. R. Finnerty
Upgrades to StellaBase facilitate medical and genetic studies on the starlet sea anemone, Nematostella vectensis
Nucleic Acids Res.,
January 11, 2008;
36(suppl_1):
D607 - D611.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
A. M. Missaoui, B. K. Ha, D. V. Phillips, and H.R. Boerma
Single Nucleotide Polymorphism Detection of the Rcs3 Gene for Resistance to Frogeye Leaf Spot in Soybean
Crop Sci.,
July 30, 2007;
47(4):
1681 - 1690.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
B.-K. Ha, R. S. Hussey, and H. R. Boerma
Development of SNP Assays for Marker-Assisted Selection of Two Southern Root-Knot Nematode Resistance QTL in Soybean
Crop Sci.,
July 16, 2007;
47(S2):
S-73 - S-82.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
I. Vroh Bi, M. D. McMullen, H. Sanchez-Villeda, S. Schroeder, J. Gardiner, M. Polacco, C. Soderlund, R. Wing, Z. Fang, and E. H. Coe Jr.
Single Nucleotide Polymorphisms and Insertion-Deletions for Genetic Markers and Anchoring the Maize Fingerprint Contig Physical Map
Crop Sci.,
December 2, 2005;
46(1):
12 - 21.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
L. H. Pratt, C. Liang, M. Shah, F. Sun, H. Wang, St. P. Reid, A. R. Gingle, A. H. Paterson, R. Wing, R. Dean, et al.
Sorghum Expressed Sequence Tags Identify Signature Genes for Drought, Pathogenesis, and Skotomorphogenesis from a Milestone Set of 16,801 Unique Transcripts
Plant Physiology,
October 1, 2005;
139(2):
869 - 884.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
D. Savage, J. Batley, T. Erwin, E. Logan, C. G. Love, G. A. C. Lim, E. Mongin, G. Barker, G. C. Spangenberg, and D. Edwards
SNPServer: a real-time SNP discovery tool
Nucleic Acids Res.,
July 1, 2005;
33(suppl_2):
W493 - W495.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
T. R. Bhangale, M. J. Rieder, R. J. Livingston, and D. A. Nickerson
Comprehensive identification and characterization of diallelic insertion-deletion polymorphisms in 330 human candidate genes
Hum. Mol. Genet.,
January 1, 2005;
14(1):
59 - 69.
[Abstract]
[Full Text]
[PDF]
|
 |
|
|
|