Archive |
Searching RefSeqRefSeq, produced by the National Center for Biotechnology Information (NCBI), provides the best non-redundant and comprehensive collection of naturally occurring DNA, RNA, and protein molecules for major organisms. A step-by-step example illustrating how to use RefSeq appears at the end of this article. Is this a familiar scenario? You are searching the online GenBank database, perhaps looking for an mRNA sequence for your gene of interest. When the search results appear, you are overwhelmed by the sheer number, and don’t know which one to choose. If you have experienced this challenge, here is a solution. Although GenBank offers the most comprehensive data source for nucleotide sequences (38,989,342,565 bases as of April 2004 and increasing exponentially), the database suffers from redundancy. When you search GenBank without using the limit option, it retrieves all records that match your query term in any field in the database record, producing a result set with many duplicate and non-significant records. Instead of using GenBank, a much more efficient search is available through the Reference Sequences (RefSeq) database (RefSeq) Databases <www.ncbi.nlm.nih.gov/RefSeq/>, which supplies a concise results list containing just one record representing each splice variant of your gene. molecular biology and genetics resources, produced by the National Center for Biotechnology Information (NCBI), provides the best non-redundant and comprehensive collection of naturally occurring DNA, RNA, and protein molecules for major organisms. While RefSeq is substantially based on GenBank sequence records, a useful analogy from the NCBI Handbook <www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18d1.pdf> clarifies the differences: "RefSeq records include attribution to the original sequence data; however, RefSeq differs from GenBank in the same way that a review article differs from the relevant collection of primary research articles on the same subject. RefSeq represents a synthesis and summary of information by a person or group based on the primary information that was gathered by others…. GenBank represents the sequence and annotations that are supplied by the original authors and is never changed by others. GenBank remains the primary sequence repository. RefSeq is one of many possible 'review articles' based on that essential archive." The RefSeq Accession numbers are in an alphanumeric format, consisting of a two-letter prefix, followed by an underscore bar and six digits. The two-letter prefix represents molecule types, as presented in the following table. ACCESSION PREFIX MOLECULE TYPE NM_ mRNA NP_ Protein NR_ RNA NC_ Complete genomic molecule NT_ Genomic contig (Computed) XP_ Protein (Computed) XM_ mRNA (Computed) How to Use RefSeq: a step-by-step exampleQuestion: Find the mRNA sequence for human Epidermal Growth Factor Receptor (EGFR). A GenBank search of "EGFR" as a text word produces a result set of 14,219 records. Even a search of "human EGFR" only reduces the results to 13,320 records, because many nonhumnan records are still included in the results. However, limiting your search to "RefSeq," and specifying "human as an organism" narrows the search results to only four records, representing each splice variant of human EGFR. Go to ![]() ![]() (above right) (below) ![]() ![]() Step 12 (above right) For more information about RefSeq, or any other HSLS molecular biology and genetics resource, go to <www.hsls.pitt.edu/guides/genetics/> or contact Ansuman Chattopadhyay (412-648-1297 or ansuman@pitt.edu). --Ansuman Chattopadhyay |