This step-by-step tutorial covers the basic concepts behind setting up a GeneSeqer@PlantGDB gene structure analysis, understanding the results of an analysis, and applying these results to further investigations. This tutorial illustrates the individual steps necessary to use the GeneSeqer@PlantGDB web service and tips to make its use more efficient and rewarding.
The GeneSeqer@PlantGDB web service is intended primarily for the purpose of performing spliced alignment of query sequences (sequences representing transcribed genes, i.e. ESTs, cDNAs, and proteins) with a target sequence (genomic DNA). The results of such an alignment are useful in many studies. Because there are a number of questions which may be addressed by spliced alignment, the process by which you use GeneSeqer@PlantGDB may vary. For instance, your choice of input sequences can be determined based on similarity to other sequences (both genomic and transcribed) or you may already possess an uncharacterized sequence you wish to know more about. In this example, the later case applies. The form elements found in this tutorial are identical to those of the standard web service. You may wish to open GeneSeqer@PlantGDB in a new window and follow along in parallel with this tutorial however this is not required.
To demonstrate a typical use of this service, we have choosen to characterize a segment of genomic sequence from Sorghum bicolor (GenBank accession AF503433). This bacterial artificial chromosome (BAC) represents approximately 142,000 bases of genomic sequence. While this sequence could be pasted into the large text area below or even saved to a local file and uploaded, we have choosen to simply enter the accession number in the appropriate field as shown. To finish this step, we select the GenBank format. A detailed description of optional formats can be found at the Select format link.
While an exhaustive alignment of "All Plant" ESTs and cDNAs is possible, the size of our genomic sequence in this example calls for a more efficient approach. Generally, when characterizing a large genomic sequence (as is being demonstrated), the detection of genic regions is the primary goal. Only after their detection, are these regions looked at in greater detail. Note the options for alignment of specific data types as well as individual or logical species groups. These functions as well as alignment of sequences of your own choosing are discussed in the refined analysis section later in this tutorial. For now, we choose to align only the representative TUG collection of all plants. This sequence collection represents the Tentative Unique Gene clusters assembled using the PlantGDB contiging method.
We now choose the maize splicing model parameter, being the most closely related model available.
And finally our sequences are submitted. For a large sequence such as this, analysis may take as long as 30 min. Thus results are available via email as well. The results of this demonstration have been cached and are therefore immediately available by clicking the submit button.
If neccessary, click here to open the results of the GeneSeqer@PlantGDB analysis described above. NOTE: For large files, most browsers will take a while to correctly process all location tags. This may cause links on the graphic to appear non-functional. After the browser has completely loaded the page however, all links will be fully functional.
Large genomic sequences are broken into fragments of 60000 bases for visualization. Each of these segments can be viewed by selecting it in the drop down menu on the left of the results window. The corresponding graphic summary for each segment is displayed in the upper pane of the results window. The summary graphic is clickable; by selecting a structure (colored arrow) within the graphic, the alignment file in the lower pane will be scrolled to the appropriate section dealing with the element represented by your selection. Colored arrows represent aligned sequences and predicted gene structures according to their unique color. In this example, red arrows represent predicted open reading frames; green arrows represent possible gene structures (possibly alternative structures); and blue arrows represent the alignment of EST or cDNA sequences. For all arrow drawings exons are represented as colored rectangles connect by thin lines which depict introns. A legend as to the color scheme is shown when you move your cursor over the "PREDICTION SUMMARY" title above the graphic.
The alignment file found in the lower pane of the results window is the heart of the GeneSeqer@PlantGDB output. This text shows the base-to-base alignment of the expressed sequence(s) with the genomic DNA. Predicted introns are shown as strings of periods '.'. Score statistics for the alignment quality as well as the predicted splice site quality are shown for each aligned sequence. In addition, links to the source of each sequence are provided above their respective alignments.
The culmination of the GeneSeqer@PlantGDB analysis is the prediction of an accurate gene structure. The quality of this prediction can be assessed by prediction of a probable open reading frame (ORF) and comparison to know proteins. Predicted ORFs are shown as red arrows in the summary graphic. Additionaly, the longest ORF as well as its translation frame is displayed in the alignment file. The NCBI blastp link following the translated ORF sequence in the alignment file will allow you to more easily find putative homologs for this putative gene.
Interesting gene regions found through the process described above can be further refined through various methods. One such method, demonstrated in this paragraph, involves a detailed look at the evidence (ESTs and cDNAs) supporting a given gene structure. Through the spliced alignment of "All Plants" ESTs and cDNAs to the restricted region, insight into possible alternative gene structures, polymorphisms, and differential transcription is made possible. To demonstrate this concept, we have choosen the 15kb region extending from base 7500 to base 22500 of the Sorghum bicolor BAC analyzed above. The results are available here. This analysis was done in the same manor as above with the exceptions that the 7500 to 22500 range was input in step 2 and the "All Plants" EST and cDNA options were choosen in step 3.
As shown by the summary graphic, three distinct gene regions have been characterized. These three gene regions putatively represent a mitochondrial carrier protein, subunit 1 of a cleavage stimulation factor, and a serine threonine kinase based on BlastP queries with the NCBI non-redundant database as described in the next section. Interestingly, spliced alignment of non-native (non Sorghum) transcripts alone are responsible for the characterization of the mitochondiral carrier protein in the 7800 to 11800 region shown to the left. Also noteworthy is the apparent alternative gene structure represented by an exon in the 9438 to 9477 region of this gene. The native transcript presumably encoded by this gene region is assumed to lack this exon or to express it as an alternatively spliced product due to the low local alignment similarity of the homologous sequence alignments. Investigation as to the origin of the transcripts corresponding to each gene structure reveal two (2) transcripts arising from monocotyledons (Secale cereale (rye) gi:10093099; Oryza sativa (rice) gi:27547342) and two (2) transcripts arising from dicotyledons (Solanum tuberosum (potato) gi:17074557l Lycopersicon esculentum (tomato) gi:18260535). In this example, the gene structure lacking the exon in question is supported by spliced alignment of the monocot homologs and thus as assumed before most likely represents the native Sorghum gene transcript.
Determining the complete gene structure, representing the entire coding region, of a gene is in some cases not possible using the alignment of transcribed sequences alone. As mentioned above, inclusion of homologous transcripts can increase the coverage of these alignments but is not always sufficient to produce a complete gene structure. For this reason, GeneSeqer@PlantGDB includes an interface allowing the alignment of homologous proteins. These homologs may be determined through the use of the NCBI blastp link provided in the ORF section of the web service results. The results shown here represent such alignments in the 7800 to 11800 region of the Sorghum BAC used throughout this demonstration.
The 10 putatively homologous proteins aligned in this example were obtained using the external BlastP link provided in the GeneSeqer@PlantGDB text results. Each ORF prediction found in the text results is followed by this link to facilitate searches against the NCBI non-redundant database. In this example, all 10 proteins demonstrated e-values of at most 8e-66. As was shown in the preceding section, homologous alignments of two (2) putative Arabidopsis thaliana proteins suggest an alternative gene structure while alignment of the other eight (8) protein sequences confers the predicted native gene structure.