PlantGDB-assembled Unique Transcripts (PUTs) - Motivation and Procedure
(Last updated: Jan 23, 2006)Sections:
- Data Source
- Contaminations and Repetitive Elements
- PolyA tail
- Duplicates Removal
- Clustering and Assembly
- Refinement
- Final Results
With the ultimate goal of characterizing the plant gene space, PlantGDB regularly assembles unique transcripts from plant mRNA sequences. Our procedure involves Vmatch, PaCE, and CAP3 software programs. This page describes the assembly procedure in detail.
Data sources
Plant mRNA sequences are extracted from NCBI. More specifically, every GenBank record is downloaded from the NCBI FTP site. The plant mRNA sequences are extracted from the EST, HTC, and PLN divisions and sorted by species. Unless specifically requested by researchers, PlantGDB only assembles species-specific PUTs when the species' mRNA sequence count has reached 10,000. New releases of PUTs will be available every three months shortly following the most recent GenBenk release. The release version is indicated in the PUT identifier using this nomenclature.
Contamination and repetitive elements
Some sequences deposited in GenBank are
contaminated by non-native sequences derived from
cloning vectors, bacterial host, and etc. In
addition, abundant repetitive elements (e.g.,
transposons) in a sequence collection will also
prevent accurate PUT assembly (for recent review on
this topic, see
Comparative EST analyses in plant
systems).
In our assembly pipeline, we use the Vmatch program to
identify contaminations and repetitive elements by
comparison of the mRNA sequences to vector,
bacterial and repeat databases. Specifically, the
NCBI
UniVec database and the E.
coli genome sequence are used for masking
vector and bacterial contamination, respectively
(Vmatch options: -qmaskmatch X -d -p -l 50
-exdrop 1 -identity 90). After trimming off the
masked contamination nucleotides, the surviving
sequence length must be at least 100 bp in order to
proceed to the next step. Similarly, the TIGR
plant repeat database is used for masking known
repetitive elements (Vmatch option: -qmaskmatch
X -d -p -l 100 -exdrop 2 -identity 80). In this
case, if more than 50% of the nucleotides in a
sequence are masked, the sequence is excluded from
the subsequent assembly. Otherwise, we will use the
original un-masked sequence in the next assembly
step (i.e., for any given sequence, if we can't
confidently mask the majority of its nucleotides as
repetitive elements, we do not consider it
containing any repetitive element at all to avoid
false-masking).
PolyA tail
The presence of PolyA tail in the transcripts also prevent accurate PUT assembly (i.e., non-related sequences can be linked together solely by polyA sequences). Therefore, we first masked polyA sequences (Vmatch options: -d -p -v -l 15 -exdrop 1 -identity 90 -selfun end2end-match.so -qmaskmatch X -seedlength 10 where end2end-match.so is our own customized Vmatch selection function to ensure the masking is performed on end-to-end region). After trimming off the masked polyA region, the surviving sequence length must be at least 50 bp in order to proceed to the next step.
Removal of Duplicates
This is a step currently perhaps unique to our
assembly procedure (compared with other
resources, e.g., TIGR Gene Indices). GenBank
(especially the EST division) is known to contain
large duplicates (identical or near-identical
sequences). Those duplications waste a lot of
computational resources during assembly (i.e.,
repeatedly aligning the identical and
near-identical sequences). One goal at PlantGDB
is to provide accurate estimation of plant gene
space by assembling together mRNA fragments in
timely fashion. Unfortunately, the performance
of current assembly programs (e.g., CAP3) slows
down dramatically with large amounts of input
sequences. Based on our experience, reducing the
duplicates greatly speeds up our assembly process
and the resulted consensus sequences are still
compatible with ones assembled from the entire data
set.
In order to identify duplicates from the input
sequences, a filter is designed to remove
globally-similar sequences. Specifically, the
Vmatch program (with option: -d -p -l 50 -exdrop
1 -identity 99) is first applied to compare
each input sequence against each other. Then if a
sequence A is contained in another sequence B
("contained" is defined as A being
shorter than B and at least 99% of the nucleotides
in A being matched to B), the sequence A is
excluded from the subsequent step while the
sequence B still proceeds. Note that although A
is excluded from later assembly (i.e., not
participating in building the consensus sequence),
its sequence information is already represented by
B. In addition, such "contained"
relationship between A and B is also saved and
stored in PlantGDB database table as well as being
displayed in the web so that we can trace the
contribution of A. As a result, we no longer
designate "contig" or "singlet"
to classify our FINAL assembly results because
any final "Unique Transcripts" may
inclusively represent their contained sequences.
Clustering and Assembly
The sequences are first being clustered by the PaCE
program, which groups overlapping sequences based
on single-linkage clustering using parallel
computers (PaCE options: match 2, mismatch -4,
gap -1, hgap -6, AlignmentWithN -1, LOADPERPROC 80,
window 11, MinLen 100, ScratchMemory 250,
TranscriptsTogether 0, EndToEndScoreRatioThreshold
10, EndtoEndAlignLenThreshold 80,
MaxScoreRatioThreshold 5,
TranscriptCoverageThreshold 40, ClonePairsFile
None, Keep_Mbuf_Full 0, MPI_Block_Sends 1,
ReportSplicedCandidates 0, ReportMaximalPairs 0,
ReportMaximalSubstrings 0, ReportAcceptedPairs 0,
ReportGeneratedPairs 0, ReportMaximalRepeatCount 1,
DumpClustersMidway 1).
Then for each resulted PaCE clusters, CAP3 program
is used to perform the assembly (CAP3 option: -p
95 -o 49 -t 10000). The output is a set of CAP3
contigs/singlets, where the contigs are the
consensus sequences derived from multiple member
mRNA sequences.
Refinement
This is a step currently perhaps unique to our
assembly procedure. The above PaCE parameters
are designed after balancing sensitivity
(clustering all the overlapping sequences
together), specificity (clustering based on
meaningful end-to-end or global overlap) as well as
performance (clustering speed). There is no
guarantee that the PaCE result won't generate
any false negatives (sequences that should be
clustered together are spread into different
clusters). Subsequently, the false negative will be
prorogated into the CAP3 assembly step since CAP3
only performs on individual PaCE clusters.
Therefore, in order to minimize such potential
false negatives, the above resulted CAP3
contigs/singlets are self-clustered using the
Vmatch program (Vmatch option: -d -p -seedlength
15 -l 50 -exdrop 1 -identity 95 -selfun
end2end-match.so -dbcluster 0 0 where
end2end-match.so is our own customized Vmatch
selection function to ensure the clustering is
performed on end-to-end overlap). If any CAP3
contigs/singlets are clustered (e.g., the previous
PaCE/CAP3 false negatives), their member mRNA
sequences will be pooled together for a re-assembly
by CAP3. In other words, we provide a comprehensive
opportunity for any potential overlapping sequences
to be grouped together for a potential re-build of
the consensus sequences.
Final results
After the above refinement, a set of final CAP3
contigs/singlets are obtained with the insurance
that they represent minimal overlaps with each
other. These final CAP3 contigs/singlets are
designed as PlantGDB-assembled Unique Transcripts
(PUT). Those unique transcripts are subjected to
our automated functional annotations by BLASTXing
(BLAST option: -e 1e-20) against UniProt protein
database to identify significantly similar
proteins. Besides individual record display on the
web, the final unique transcripts as well as their
functional annotations can be accessed at: