第三章 第三章 生物信息学网络资源 生物信息学网络资源 (1) NCBI简介 简介 National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov Entrez系统 系统 NCBI综合数据库 综合数据库 z美国国家生物技术信息中心 美国国家生物技术信息中心 (National Center for Biotechnology Information,简 ,简 称 称 NCBI)创建于 )创建于 1988年 年 。 。 z 1991年, 年, NCBI开发了 开发了 Entrez数据库查询系 数据库查询系 统,用于对 统,用于对 GenBank等分子生物学和生物 等分子生物学和生物 医学文献摘要 医学文献摘要 (Medline)等数据库的查询 等数据库的查询 (Schuler et al, 1996)。 。 Entrez系统的使用方法 系统的使用方法 www.ncbi.nlm.nih.gov All Database integrates… ? the scientific literature; ? DNA and protein sequence databases; ? 3D protein structure data; ? population study data sets; ? assemblies of complete genomes All Database is a search and retrieval system that integrates NCBI databases Accessing information on molecular sequences Accession numbers(登录号) are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 GenBank genomic DNA sequence NT_030059 Genomic contig Rs7079946 dbSNP (single nucleotide polymorphism) N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record protein DNA RNA Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) 5 ways to access protein and DNA sequences [1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) NCBI数据库的参考序列。校正的,非冗余集合,包括基因 组DNA contigs,已知基因的mRNAs和蛋白。 RefSeq的Accession numbers表示形式: Complete genome NC_###### Complete chromosome NC_###### Genomic contig NT_###### mRNA (DNA format) NM_###### e.g. NM_006744 Protein NP_###### e.g. NP_006735 From the NCBI home page, type “rbp4” and hit “Go” By applying limits, there are now just two entries 代码 物种来源 参考文献 GeneBank格式记录序列信息 专业评论 特性 FASTA format Entrez Gene (top of page) Note that links to many other RBP4 database entries are available Entrez Gene (middle of page) Entrez Gene (bottom of page) Example of how to access sequence data: HIV-1 pol There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol Searching for HIV-1 pol: Following the “genome” link yields a manageable four results Example of how to access sequence data: HIV-1 pol For the Entrez query: hiv-1 pol there are about 40,000 nucleotide or protein records (and >100,000 records for a search for “hiv-1”), but these can easily be reduced in two easy steps: --specify the organism, e.g. hiv-1[organism] --limit the output to RefSeq! only 1 RefSeq over 100,000 nucleotide entries for HIV-1 Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) DNA RNA complementary DNA (cDNA) protein UniGene UniGene: unique genes via ESTs ? Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene z被整理成簇的EST和全长mRNA序列,每一个代表一种 特定已知的或假设的人类基因,有定位图和表达信息以 及同其它资源的交叉参考。记录信息主要为该基因的相 关序列(cDNA,EST等)、染色体定位和表达谱信息。 其组成的ESTs来源于完整的cDNA文库。 zUniGene数据库将GenBank序列自动分为很多簇 (cluster),它的每个记录表示一个簇,每个簇代表了一个 唯一的基因。 Cluster sizes in UniGene This is a gene with 1 EST associated; the cluster size is 1 Cluster sizes in UniGene This is a gene with 10 ESTs associated; the cluster size is 10 Cluster sizes in UniGene (human) Cluster size Number of clusters 1 ≈ 8,100 2 38,200 3-4 23,300 5-8 12,000 9-16 5,600 17-32 3,700 ≈500-1000 1,050 ≈2000-4000 100 ≈8000-16,000 12 ≈16,000-30,000 2 UniGene build 172, 8/04 UniGene: unique genes via ESTs Conclusion: UniGeneis a useful tool to look up information about expressed genes. UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression (e.g. brain vs. liver). 练习 练习 z利用 利用 Enterz查找 查找 human CCL18基因的核酸 基因的核酸 蛋白质 蛋白质RefSeq 序列,保存为 序列,保存为 FASTA格式, 格式, 记录 记录RefSeq的Accession numbers 。 。 Access to Biomedical Literature PubMed is… ? National Library of Medicine's search service ? 12 million citations in MEDLINE ? links to participating online journals ? PubMed tutorial (via “Education” on side bar) PubMed at NCBI to find literature information PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries. It has 12 million records dating back to 1966. PubMed search strategies Try the tutorial (“education” on the left sidebar) Use boolean queries (capitalize AND, OR, NOT) lipocalin AND disease Try using “limits” Try “Links” to find Entrez information and external resources Obtain articles on-line via Welch Medical Library (and download pdf files): http://www.welch.jhu.edu/ · Journal Database期刊浏览 · MeSh Database可以用它来分层流览医学 主题词 · Single Citation Matcher输入期刊的信息可以 找到某单篇的文献或整个期刊的内容。 · Batch Citution Matcher用一种特定的形式输 入期刊的信息一次搜索多篇文献。 · Clinical Queries这一部分为临床医生设置, 通过过滤的方式将搜索的文献固定在4个范 围:治疗、诊断、病原学与预后。 Related Resources · Order Documents可以使用户在当地得到文 献的全文,但有些是要收费的。 · NLM Mobile是对另一个NLM基于网络的查 询系统的链接。 练习 练习 z在 在PubMed中搜索human CCL18基因研究 的报道(2000年以后),列出检索到的篇 目,并试图找到一至两篇全文。 BLAST is… ? Basic Local Alignment Search Tool ? NCBI's sequence similarity search tool ? supports analysis of DNA and protein databases ? 80,000 searches per day OMIM is… ?Online Mendelian Inheritance in Man ?catalog of human genes and genetic disorders ?edited by Dr. Victor McKusick, others at JHU Books is… ? searchable resource of on-line books TaxBrowser is… ? browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) ? taxonomy information such as genetic codes ? molecular data on extinct organisms Structure site includes… ? Molecular Modelling Database (MMDB) ? biopolymer structures obtained from the Protein Data Bank (PDB) ? Cn3D (a 3D-structure viewer) ? vector alignment search tool (VAST) 第一次作业 第一次作业 z利用 利用 Enterz查找 查找 human CCL18, , human cxcl1基因的核酸 基因的核酸 \蛋白质 蛋白质RefSeq 序列,保存 序列,保存 为 为 FASTA格式,记录从 格式,记录从GeneBank获得的序 列信息 。 。 z在 在PubMed中搜索human CCL18基因研究 的报道(2000年以后),列出检索到的篇 目,并试图找到一至两篇全文。