Frequently Asked Questions
Questions are organized by category. Click a category to expand it, then select a question.
To view all questions in all categories, click [Expand].
If you don't find the answer to your question, please use our feedback form (top).
PlantGDB provides sequence data for >70,000 plant species, custom EST assemblies (PUT) for over 150 species, web tools and plant genome browsers, as well as an outreach portal for plant genomics. For more information on PlantGDB, visit our About page or take a brief tour on our Help Home Page.
Use the 'feedback link' at the top right corner of any PlantGDB web page. We will contact you within 24 hours. You are also welcome to contact any of the PlantGDB contacts listed under About.
PlantGDB has been optimized for use with Firefox 3, Safari, or Internet Explorer 7 / 8. Many advanced features require that Javascript be enabled. If you encounter problems viewing any page at PlantGDB.org, please contact us using our feedback page. Please include a description of what didn't work as expected, and what web browser/operating system you were using. We will do our best to address the problem.
- PlantGDB's Public Plant Sequence data is updated every four months, coinciding with every other GenBank Version Release (December, April, and August). Transcript assemblies (PUT) are updated at this time and are typically made available 2-4 weeks after version update.
- Genome data at PlantGDB are updated periodically when a new genome assembly becomes available, or when transcript data are significantly increased.
- For more information , see FAQ categories 'Plant Sequence and PUT assemblies' and 'Genome Browsers' below.
- Sequence data and metadata data are stored on our servers in three primary forms: 1) In MySQL databases which store metadata and links to other data types; 2) In multiFASTA-formatted sequence files, for sequence retrieval using FASTACMD; 3) In indices for BLAST and GeneSeqer analysis.
- For more information about how to access and download PlantGDB sequence data, see FAQ categories 'Plant Sequence and PUT assemblies' and 'Genome Browsers' below.
For an overview of sources and methods, each GDB has a Data and Methods page, accessed from the left menubar or via the Data and Methods Portal.
- We obtain both genome sequence and gene model (annotation) data from original source repositories which may differ for each genome. The data source is identified on the genome browser home page under "Genome/Gene Models". In all cases, we provide both links to the original data source and local copies of all data files used in compiling our genome databases (see next f.a.q. item).
- Other data displayed for each genome consists of splice-aligned transcript (EST, cDNA, TSA), splice-aligned related-species protein data, and microarray probe sequences (which are first matched to PUT assemblies and then positioned on the genome). Spliced alignments and probe positioning are carried out at PlantGDB using primary data downloaded from GenBank (transcript), genome repositories (proteins), and PLEXdb (probes).
- Specialized datasets (e.g. Genome Survey Sequence assemblies, masking datasets) are obtained from original source databases as specified in the track header. More information can usually be gleaned by clicking on a track glyph and viewing the Description and/or Notes for that sequence.
PlantGDB's genome focus is on accurate spliced alignments of transcript to genomes, a critical component of accurate genome annotation. The xGDB genome browser platform used at PlantGDB has unique features that make it useful for viewing and annotating genomes:
- All splicing evidence can be viewed online and reproduced using web tools provided at PlantGDB.
- A community annotation tool (yrGATE) and gene model incongruence-detection system (GAEVAL) are built in, to facilitate genome annotation.
- Each xGDB has powerful BLAST tools and search tools to retrieve upstream sequence for motif analysis.
- xGDB supports the DAS (Distributed Annotation Service) standard for cross-platform data display, and provides both DAS client and DAS server capabilities.
- The complete xGDB code is available as open source software and can be custome-installed on a Linux server.
For more information, see the Genome Browser Help Page.
Likely reasons include: too large a region chosen; or region is very heavily annotated with one track type (typically, EST). In either case, the load on the graphics engine causes a long delay in track display times. Solutions:
- Re-enter a set of coordinates that span a narrower region and try again.
- If problem remains, try unselecting the EST track type using the track control and re-submit the region request.
- If you are unable to solve the problem, please contact us using the Feedback form, describing the region you were attempting to view.
Each genome has a "Downloads" page, accessible from the left panel on the GDB home page. Or, access it directly using this url: http://www.plantgdb.org/XGDB/phplib/download.php?GDB=Xx where Xx is the Genus/species abbreviation. On this page you will find:
- FASTA files containing all the genomic and aligned data from the current GDB version
- A GFF2- or GFF3-formatted file containg all annotations and their chromosomal location and features.
- The complete MySQL database, in a flat file format that can be used to recreat the database locally.
- For some genomes, a 0README file is included to describe special data
The genome coordinate information you want is contained in the gff2 or gff3--formatted file that accompanies each genome annotation (EXAMPLE: Gmax_109_gene.gff3.gz). These files are available in the same location as the other download data: either from each genome page (GDB left menu → Search/Download → Download - Data) (e.g.http://www.plantgdb.org/XGDB/phplib/download.php?GDB=Gm), or from the PlantGDB ftp site (Top Menu → Sequence → FT Server) (ftp://ftp.plantgdb.org/download/Genomes/).
Information on the original GFF (generic feature format) can be found here: http://www.sanger.ac.uk/resources/software/gff/spec.html, and the GFF3 description is described here: http://www.sequenceontology.org/gff3.shtml.
You could accomplish by using the blast tool and then the "dowload region" tool.
- paste your sequence at http://www.plantgdb.org/MtGDB/cgi-bin/blastGDB.pl (accessed via MtGDB left menu -> Blast MtGDB) and select Mt pseudochromosomes as target dataset , blastn as search tool, hit "Run Blast"
- based on the top blast result alignment, enter chromosome #, left and right genomic coordinates into the "chr - start- end" inputs at the top of the MtGDB page and click 'Genome Context'
- This will display the genome context of the hit region. Use Zoom button if desired to pad the region with additional sequence left and right, or enter desired coordinates above as before.
- From the Genome Context submenu click the green "Download" button to load the "Search/Download From Region" page, pre-configured with the current coordinates
- Click"Display Genomic Sequence for Download"
Alternatively if you have an accession number for the cDNA, search for it using the MtGDB left menu -> Search ID/Keyword tool
- If a result is returned, click "Retrieve Sequences:" to see options for retrieving up and/or downstream sequence.
- If no result is returned, then the sequence is either of more recent origin than this GDB version, or else its alignment was insufficiently good to be accepted for display.
A. Yes, you can retrieve selected up/downstream sequences using the Search ID/Keyword tool:
- From any GDB home page, click Search ID/Keyword on the left side menu
- Enter IDs for one or more sequences (either aligned transcripts/proteins or gene models), or a keyword in quotes
- Optionally, limit search to relevant data type(s) by clicking appropriate selections under Limit Search
- Click Search to retrieve records. This may take up to a minute or more for large searches.
- On the results page under Retrieve Sequences, select 5' region, enter desired range, and select whether you want to exclude other overlapping genes
- 6) Click the Sequence ID column header checkbox to select all sequences for retrieval (or click individual checkboxes to select a subset). [Note: if the retrieval set is too large the program will error out]
- Click Retrieve FASTA to retrieve the desired sequences. This may take a minute or more for large datasets
B. If you need to retrieve ALL the upstream or downstream sequences from an annotated genome, you will need to download the genome data from PlantGDB and use appropriate tools on your local machine.
Below is a a step-by-step guide to the process you will need to follow (you will need access to MySQL and NCBI blastall or similar package):
- Download the FASTA genome sequence and the genome database .sql from e.g. http://www.plantgdb.org/XGDB/phplib/download.php?GDB=Zm
- Create a local MySQL database from the .sql file and write a MySQL query to retrieve the upstream coordinates from each gene model. You will use the table called chr_gene_annotation, and your queries will look something like this:
select geneId, chr, r_pos + 1 as f_seq_start, r_pos + 1000 as f_seq_end from chr_gene_annotation where strand="f"; select geneId, chr, l_pos - 1000 as r_seq_start, l_pos - 1 as r_seq_end from chr_gene_annotation where strand="r";
- Format the genome FASTA using e.g. formatdb with -o T (see Note below)
- Create scripts to retrieve each sequence range as a FASTA file from each genome/chromosome using blastall's fastacmd (http://www.ncbi.nlm.nih.gov/BLAST/docs/fastacmd.html) or equivalent package.
- For fastacmd, the following options apply for blastall versions before 2.2.21. [Note that NCBI has recently updated blast to BLAST+ 2.2.23 (View new blast information) and the command line syntax has changed].
- use -d to specify the indexed genome data target
- use -s to specify the chromosome in a multifasta file
- use the -L option to specify the range
- use the -S option to get appropriate strand from f_seq and r_seq if that's important
- use the -o option to give the output file a name according to the geneId (or use some other naming scheme as appropriate)
Example: fastacmd -d /path/to/genome_data -s chr1 -L1000,2000 -S2 -o filename1.fasta
We don't make this data directly available but you can derive it easily from our database tables which are available for download, if you have access to MySQL and a scripting language.
First download the appropriate genome MySQL database from http://www.plantgdb.org/XGDB/phplib/download.php?GDB=Xx where Xx is the Genus/species abbreviation, e.g. Zm for maize. Once you create the database locally you can derive the coordinate as follows:
- Find the table that stores gene model information; it is named either chr_gene_annotation (for chromosome-based browsers) or gseg_gene_annotation (for BAC or scaffold-based browsers).
- The relevant columns are chr (or gseg_gi), l_pos, r_pos, CDSstart, CDSstop and strand.
- A query such as the following will build a tabular output featuring the 3'UTR chr/coordinates, length and direction:
mysql>SELECT geneID, chr, IF(strand="f", CDSstop, l_pos) AS left_position, IF(strand="f", r_pos, CDSstop) AS right_position, IF(strand="f", r_pos-CDSstop, CDSstop-l_pos) AS length, strand FROM chr_gene_annotation;
Once you have the coordinates you can build a script to retrieve the data from the genome sequence (which is also available from the same download page referenced above), using fastacmd or perl, python or similar scripting language.
DAS (Distributed Annotation Service) standard for cross-platform data display, and provides both DAS client and DAS server capabilities. Several PlantGDB genome have DAS-served data - see DAS Services for details.
For more information on DAS, see the Genome Browser Help Page.
PlantGDB downloads GenBank and UniProt sequence data approximately every four months, corresponding to every other GenBank Release. Sequence data is parsed according to a database schema, and individual sequence files are filtered to detect vector and repeat sequence. When you download FASTA-formatted sequence data from PlantGDB, you may see differences in the masking of repeat or vector regions, but the sequence is otherwise identical.
- PUT = PlantGDB-assembled Unique Transcript. PlantGDB regularly assembles transcript sequences (EST and cDNA) and TSA (Transcriptome Shotgun Assemblies) for species with >10,000 sequences in GenBank, as well as by request for smaller or combined datasets. The resulting sequence assemblies (PUTs) are made available for search, download, BLAST, and spliced alignment using GeneSeqer.
- PUT assemblies include both contigs (comprising multiple sequences) and singletons. They are named according to version number, genus_species, and sequence number.
For more information visit the EST Assembly Page (Home>Left Menu>EST Assembly).
TSAs are Transcriptome Shotgun Assemblies, and are computationally drived from a combination of ESTs and short reads submitted to the Short Read Archive. The submitter of the TSA sequences is responsible for their generation, not NCBI, and all sequences in a TSA must originate with the submitter. You can read more about the TSA submission process here: http://www.ncbi.nlm.nih.gov/genbank/TSA.html.
Where available, PlantGDB uses TSAs as part of its PUT assembly. You can read more about the PUT assembly process here: http://www.plantgdb.org/prj/ESTCluster/PUT_procedure.php
You can download sequence for any plant species by going to the Download portal (Home>Download>Sequence). Enter Genus/species and click 'Search'. (For popular species, use the shortcut "Featured Species" on the Home Page left menubar.)
To download PUT assemblies, go to the EST contig Download portal (Home>EST Assembly>Download)
To download large datasets, visit our ftp site at ftp.plantgdb.org where you can download all PUT assemblies or plant sequences using ftp.
There are two ways you can assess PUT directionality:
A) Evaluate the PUT's tblastn orientation to top hit protein:
-
Download the "Similar Proteins" table from our Download Portal
Path: Home Page -> Download -> EST Assemblies -> [click a species directory] -> current version -> [genus_species.Similar.Protein.txt]
Example: http://www.plantgdb.org/download/download.php?dir=/Sequence/ESTcontig/Actinidia_chinensis/current_version) - Search the PUT ID of interest and check columns 8 and 9 (start/end of query sequence) - if start>end, then orientation is reverse w/respect to tblastn hit protein.
B) If PUT is splice-aligned to a genome in our Genomes list, view the PUT alignment details and assess its direction of transcription (from GeneSeqer analysis) versus its strand.
-
Open a Search ID/Keyword window in any genome browser (e.g. http://www.plantgdb.org/ZmGDB/):
Path: Home Page -> Genomes -> Search ID/Keyword (left menu) ->[Search page] paste PUT ID (e.g. PUT-1-171a-Zea_mays-10395) -> Click 'Search' -> [Result page] Click highlighted PUT ID -> [Record page] Click 'GeneSeqer Alignment' ->[output]. - Click the "?" icon on the [Record page] next to the GeneSeqer link for hints on how to interpret the GeneSeqer output.
PlantGDB's sequence data is updated every 4 months, coinciding with every other GenBank Release (odd numbers). For example, recent updates included V.165 (April 2008) and V.163 (December 2007).
If you visit the Download page for any species, you can retrieve files named as:
- Genus_species.PUT_member.txt
- Genus_species.alignment.txt
Which both provide the mapping of the ESTs to a PUT.
Alternatively, from the "Search" page, e.g.
http://www.plantgdb.org/search/display/data.php?Seq_ID=PUT-157a-Oryza_sativa-6232
You can view or retrieve the EST components of an individual PUT
PlantGDB's sequence data is updated every 4 months, coinciding with every other GenBank Release (odd numbers). For example, recent updates included V.165 (April 2008) and V.163 (December 2007).
PlantGDB's taxonomic conventions will always reflect NCBI's current naming system since our data source is GenBank. Check the current taxonomic name for your species using GenBank's Taxonomy browser. It is possible that the genus and/or species name has changed.