Searching RefSeq

RefSeq, produced by the National Center for Biotechnology Information (NCBI), provides the best non-redundant and comprehensive collection of naturally occurring DNA, RNA, and protein molecules for major organisms. A step-by-step example illustrating how to use RefSeq appears at the end of this article.

Is this a familiar scenario? You are searching the online GenBank database, perhaps looking for an mRNA sequence for your gene of interest. When the search results appear, you are overwhelmed by the sheer number, and don’t know which one to choose. If you have experienced this challenge, here is a solution.

Although GenBank offers the most comprehensive data source for nucleotide sequences (38,989,342,565 bases as of April 2004 and increasing exponentially), the database suffers from redundancy. When you search GenBank without using the limit option, it retrieves all records that match your query term in any field in the database record, producing a result set with many duplicate and non-significant records.

Instead of using GenBank, a much more efficient search is available through the Reference Sequences (RefSeq) database (RefSeq) Databases <www.ncbi.nlm.nih.gov/RefSeq/>, which supplies a concise results list containing just one record representing each splice variant of your gene. molecular biology and genetics resources, produced by the National Center for Biotechnology Information (NCBI), provides the best non-redundant and comprehensive collection of naturally occurring DNA, RNA, and protein molecules for major organisms.

While RefSeq is substantially based on GenBank sequence records, a useful analogy from the NCBI Handbook <www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18d1.pdf> clarifies the differences: "RefSeq records include attribution to the original sequence data; however, RefSeq differs from GenBank in the same way that a review article differs from the relevant collection of primary research articles on the same subject. RefSeq represents a synthesis and summary of information by a person or group based on the primary information that was gathered by others…. GenBank represents the sequence and annotations that are supplied by the original authors and is never changed by others. GenBank remains the primary sequence repository. RefSeq is one of many possible 'review articles' based on that essential archive."

The RefSeq Accession numbers are in an alphanumeric format, consisting of a two-letter prefix, followed by an underscore bar and six digits. The two-letter prefix represents molecule types, as presented in the following table.

ACCESSION PREFIX        MOLECULE TYPE

NM_			mRNA
NP_			Protein
NR_			RNA
NC_			Complete genomic molecule
NT_			Genomic contig (Computed)
XP_			Protein (Computed)
XM_			mRNA (Computed)

How to Use RefSeq: a step-by-step example

Question: Find the mRNA sequence for human Epidermal Growth Factor Receptor (EGFR).

A GenBank search of "EGFR" as a text word produces a result set of 14,219 records. Even a search of "human EGFR" only reduces the results to 13,320 records, because many nonhumnan records are still included in the results. However, limiting your search to "RefSeq," and specifying "human as an organism" narrows the search results to only four records, representing each splice variant of human EGFR.

Go to
(below)
1) Type “ EGFR” in the Search box
2) Click on “Limits”

(above right)
Limit your search by selecting
3) "Gene Name" from the Fields Options
4) "Exclude All of the Above"
5) "mRNA" from the Molecule options
6) "RefSeq’" from the "only from" option
7) Click on "Preview/Index" to limit your search to specified organism

(below)
8) Type "Human"
9) Select "Organism"
10) Click "AND"
11) Click "GO"

Step 12 (above right)
12) View the search results.Note that only four records, representing all splice variants of EGFR, appear in the results.

For more information about RefSeq, or any other HSLS molecular biology and genetics resource, go to <www.hsls.pitt.edu/guides/genetics/> or contact Ansuman Chattopadhyay (412-648-1297 or ansuman@pitt.edu).

--Ansuman Chattopadhyay


Links and information are up-to-date when published but are not updated after publication.

The Health Sciences Library System supports the Health Sciences at the University of Pittsburgh and the
UPMC | University of Pittsburgh Medical Center.

© 1996 - 2006 Health Sciences Library System, University of Pittsburgh. All rights reserved.
Contact the Webmaster