PlantGDB

Home

EST Assemblies (PUTs)

PlantGDB-assembled Unique Transcripts (PUTs) - Motivation and Procedure

(Last updated: Jan 23, 2006)

PUT Procedure Diagram

Sections:

With the ultimate goal of characterizing the plant gene space, PlantGDB regularly assembles unique transcripts from plant mRNA sequences. Our procedure involves Vmatch, PaCE, and CAP3 software programs. This page describes the assembly procedure in detail.

Data sources

Plant mRNA sequences are extracted from NCBI. More specifically, every GenBank record is downloaded from the NCBI FTP site. The plant mRNA sequences are extracted from the EST, HTC, and PLN divisions and sorted by species. Unless specifically requested by researchers, PlantGDB only assembles species-specific PUTs when the species' mRNA sequence count has reached 10,000. New releases of PUTs will be available every three months shortly following the most recent GenBenk release. The release version is indicated in the PUT identifier using this nomenclature.

Contamination and repetitive elements

Some sequences deposited in GenBank are contaminated by non-native sequences derived from cloning vectors, bacterial host, and etc. In addition, abundant repetitive elements (e.g., transposons) in a sequence collection will also prevent accurate PUT assembly (for recent review on this topic, see Comparative EST analyses in plant systems).
In our assembly pipeline, we use the Vmatch program to identify contaminations and repetitive elements by comparison of the mRNA sequences to vector, bacterial and repeat databases. Specifically, the NCBI UniVec database and the E. coli genome sequence are used for masking vector and bacterial contamination, respectively (Vmatch options: -qmaskmatch X -d -p -l 50 -exdrop 1 -identity 90). After trimming off the masked contamination nucleotides, the surviving sequence length must be at least 100 bp in order to proceed to the next step. Similarly, the TIGR plant repeat database is used for masking known repetitive elements (Vmatch option: -qmaskmatch X -d -p -l 100 -exdrop 2 -identity 80). In this case, if more than 50% of the nucleotides in a sequence are masked, the sequence is excluded from the subsequent assembly. Otherwise, we will use the original un-masked sequence in the next assembly step (i.e., for any given sequence, if we can't confidently mask the majority of its nucleotides as repetitive elements, we do not consider it containing any repetitive element at all to avoid false-masking).

PolyA tail

The presence of PolyA tail in the transcripts also prevent accurate PUT assembly (i.e., non-related sequences can be linked together solely by polyA sequences). Therefore, we first masked polyA sequences (Vmatch options: -d -p -v -l 15 -exdrop 1 -identity 90 -selfun end2end-match.so -qmaskmatch X -seedlength 10 where end2end-match.so is our own customized Vmatch selection function to ensure the masking is performed on end-to-end region). After trimming off the masked polyA region, the surviving sequence length must be at least 50 bp in order to proceed to the next step.

Removal of Duplicates

This is a step currently perhaps unique to our assembly procedure (compared with other resources, e.g., TIGR Gene Indices). GenBank (especially the EST division) is known to contain large duplicates (identical or near-identical sequences). Those duplications waste a lot of computational resources during assembly (i.e., repeatedly aligning the identical and near-identical sequences). One goal at PlantGDB is to provide accurate estimation of plant gene space by assembling together mRNA fragments in timely fashion. Unfortunately, the performance of current assembly programs (e.g., CAP3) slows down dramatically with large amounts of input sequences. Based on our experience, reducing the duplicates greatly speeds up our assembly process and the resulted consensus sequences are still compatible with ones assembled from the entire data set.
In order to identify duplicates from the input sequences, a filter is designed to remove globally-similar sequences. Specifically, the Vmatch program (with option: -d -p -l 50 -exdrop 1 -identity 99) is first applied to compare each input sequence against each other. Then if a sequence A is contained in another sequence B ("contained" is defined as A being shorter than B and at least 99% of the nucleotides in A being matched to B), the sequence A is excluded from the subsequent step while the sequence B still proceeds. Note that although A is excluded from later assembly (i.e., not participating in building the consensus sequence), its sequence information is already represented by B. In addition, such "contained" relationship between A and B is also saved and stored in PlantGDB database table as well as being displayed in the web so that we can trace the contribution of A. As a result, we no longer designate "contig" or "singlet" to classify our FINAL assembly results because any final "Unique Transcripts" may inclusively represent their contained sequences.

Clustering and Assembly

The sequences are first being clustered by the PaCE program, which groups overlapping sequences based on single-linkage clustering using parallel computers (PaCE options: match 2, mismatch -4, gap -1, hgap -6, AlignmentWithN -1, LOADPERPROC 80, window 11, MinLen 100, ScratchMemory 250, TranscriptsTogether 0, EndToEndScoreRatioThreshold 10, EndtoEndAlignLenThreshold 80, MaxScoreRatioThreshold 5, TranscriptCoverageThreshold 40, ClonePairsFile None, Keep_Mbuf_Full 0, MPI_Block_Sends 1, ReportSplicedCandidates 0, ReportMaximalPairs 0, ReportMaximalSubstrings 0, ReportAcceptedPairs 0, ReportGeneratedPairs 0, ReportMaximalRepeatCount 1, DumpClustersMidway 1).
Then for each resulted PaCE clusters, CAP3 program is used to perform the assembly (CAP3 option: -p 95 -o 49 -t 10000). The output is a set of CAP3 contigs/singlets, where the contigs are the consensus sequences derived from multiple member mRNA sequences.

Refinement

This is a step currently perhaps unique to our assembly procedure. The above PaCE parameters are designed after balancing sensitivity (clustering all the overlapping sequences together), specificity (clustering based on meaningful end-to-end or global overlap) as well as performance (clustering speed). There is no guarantee that the PaCE result won't generate any false negatives (sequences that should be clustered together are spread into different clusters). Subsequently, the false negative will be prorogated into the CAP3 assembly step since CAP3 only performs on individual PaCE clusters.
Therefore, in order to minimize such potential false negatives, the above resulted CAP3 contigs/singlets are self-clustered using the Vmatch program (Vmatch option: -d -p -seedlength 15 -l 50 -exdrop 1 -identity 95 -selfun end2end-match.so -dbcluster 0 0 where end2end-match.so is our own customized Vmatch selection function to ensure the clustering is performed on end-to-end overlap). If any CAP3 contigs/singlets are clustered (e.g., the previous PaCE/CAP3 false negatives), their member mRNA sequences will be pooled together for a re-assembly by CAP3. In other words, we provide a comprehensive opportunity for any potential overlapping sequences to be grouped together for a potential re-build of the consensus sequences.

Final results

After the above refinement, a set of final CAP3 contigs/singlets are obtained with the insurance that they represent minimal overlaps with each other. These final CAP3 contigs/singlets are designed as PlantGDB-assembled Unique Transcripts (PUT). Those unique transcripts are subjected to our automated functional annotations by BLASTXing (BLAST option: -e 1e-20) against UniProt protein database to identify significantly similar proteins. Besides individual record display on the web, the final unique transcripts as well as their functional annotations can be accessed at: