第三章
第三章
生物信息学网络资源
生物信息学网络资源
(1)
NCBI简介
简介
National Center for Biotechnology
Information (NCBI)
www.ncbi.nlm.nih.gov
Entrez系统
系统
NCBI综合数据库
综合数据库
z美国国家生物技术信息中心
美国国家生物技术信息中心
(National
Center for Biotechnology Information,简
,简
称
称
NCBI)创建于
)创建于
1988年
年
。
。
z 1991年,
年,
NCBI开发了
开发了
Entrez数据库查询系
数据库查询系
统,用于对
统,用于对
GenBank等分子生物学和生物
等分子生物学和生物
医学文献摘要
医学文献摘要
(Medline)等数据库的查询
等数据库的查询
(Schuler et al, 1996)。
。
Entrez系统的使用方法
系统的使用方法
www.ncbi.nlm.nih.gov
All Database integrates…
? the scientific literature;
? DNA and protein sequence databases;
? 3D protein structure data;
? population study data sets;
? assemblies of complete genomes
All Database is a search and retrieval system
that integrates NCBI databases
Accessing information on
molecular sequences
Accession numbers(登录号)
are labels for sequences
NCBI includes databases (such as GenBank) that contain
information on DNA, RNA, or protein sequences.
You may want to acquire information beginning with a
query such as the name of a protein of interest, or the
raw nucleotides comprising a DNA sequence of interest.
DNA sequences and other molecular data are tagged with
accession numbers that are used to identify a sequence
or other record relevant to molecular data.
What is an accession number?
An accession number is label that used to identify a
sequence. It is a string of letters and/or numbers that
corresponds to a molecular sequence.
Examples (all for retinol-binding protein, RBP4):
X02775 GenBank genomic DNA sequence
NT_030059 Genomic contig
Rs7079946 dbSNP (single nucleotide polymorphism)
N91759.1 An expressed sequence tag (1 of 170)
NM_006744 RefSeq DNA sequence (from a transcript)
NP_007635 RefSeq protein
AAC02945 GenBank protein
Q28369 SwissProt protein
1KT7 Protein Data Bank structure record
protein
DNA
RNA
Four ways to access DNA and
protein sequences
[1] Entrez Gene with RefSeq
[2] UniGene
[3] European Bioinformatics Institute (EBI)
and Ensembl (separate from NCBI)
[4] ExPASy Sequence Retrieval System
(separate from NCBI)
5 ways to access protein and DNA sequences
[1] Entrez Gene with RefSeq
Entrez Gene is a great starting point: it collects
key information on each gene/protein from
major databases. It covers all major organisms.
RefSeq provides a curated, optimal accession
number for each DNA (NM_006744)
or protein (NP_007635)
NCBI’s important RefSeq project:
best representative sequences
RefSeq (accessible via the main page of NCBI)
NCBI数据库的参考序列。校正的,非冗余集合,包括基因
组DNA contigs,已知基因的mRNAs和蛋白。
RefSeq的Accession numbers表示形式:
Complete genome NC_######
Complete chromosome NC_######
Genomic contig NT_######
mRNA (DNA format) NM_###### e.g. NM_006744
Protein NP_###### e.g. NP_006735
From the NCBI home
page, type “rbp4”
and hit “Go”
By applying limits, there are now just two entries
代码
物种来源
参考文献
GeneBank格式记录序列信息
专业评论
特性
FASTA format
Entrez Gene (top of page)
Note that links to
many other RBP4
database entries
are available
Entrez Gene (middle of page)
Entrez Gene (bottom of page)
Example of how to access sequence data:
HIV-1 pol
There are many possible approaches. Begin at the main
page of NCBI, and type an Entrez query: hiv-1 pol
Searching for HIV-1 pol:
Following the “genome” link yields
a manageable four results
Example of how to access sequence data:
HIV-1 pol
For the Entrez query: hiv-1 pol
there are about 40,000 nucleotide or protein records
(and >100,000 records for a search for “hiv-1”),
but these can easily be reduced in two easy steps:
--specify the organism, e.g. hiv-1[organism]
--limit the output to RefSeq!
only 1 RefSeq
over 100,000
nucleotide entries
for HIV-1
Four ways to access DNA and
protein sequences
[1] Entrez Gene with RefSeq
[2] UniGene
[3] European Bioinformatics Institute (EBI)
and Ensembl (separate from NCBI)
[4] ExPASy Sequence Retrieval System
(separate from NCBI)
DNA RNA
complementary DNA
(cDNA)
protein
UniGene
UniGene: unique genes via ESTs
? Find UniGene at NCBI:
www.ncbi.nlm.nih.gov/UniGene
z被整理成簇的EST和全长mRNA序列,每一个代表一种
特定已知的或假设的人类基因,有定位图和表达信息以
及同其它资源的交叉参考。记录信息主要为该基因的相
关序列(cDNA,EST等)、染色体定位和表达谱信息。
其组成的ESTs来源于完整的cDNA文库。
zUniGene数据库将GenBank序列自动分为很多簇
(cluster),它的每个记录表示一个簇,每个簇代表了一个
唯一的基因。
Cluster sizes in UniGene
This is a gene with
1 EST associated;
the cluster size is 1
Cluster sizes in UniGene
This is a gene with
10 ESTs associated;
the cluster size is 10
Cluster sizes in UniGene (human)
Cluster size Number of clusters
1 ≈ 8,100
2 38,200
3-4 23,300
5-8 12,000
9-16 5,600
17-32 3,700
≈500-1000 1,050
≈2000-4000 100
≈8000-16,000 12
≈16,000-30,000 2
UniGene build 172, 8/04
UniGene: unique genes via ESTs
Conclusion: UniGeneis a useful tool to look up
information about expressed genes. UniGene
displays information about the abundance of a
transcript (expressed gene), as well as its regional
distribution of expression (e.g. brain vs. liver).
练习
练习
z利用
利用
Enterz查找
查找
human CCL18基因的核酸
基因的核酸
蛋白质
蛋白质RefSeq
序列,保存为
序列,保存为
FASTA格式,
格式,
记录
记录RefSeq的Accession numbers
。
。
Access to Biomedical Literature
PubMed is…
? National Library of Medicine's search service
? 12 million citations in MEDLINE
? links to participating online journals
? PubMed tutorial (via “Education” on side bar)
PubMed at NCBI
to find literature
information
PubMed is the NCBI gateway to MEDLINE.
MEDLINE contains bibliographic citations
and author abstracts from over 4,600 journals
published in the United States and in 70 foreign
countries.
It has 12 million records dating back to 1966.
PubMed search strategies
Try the tutorial (“education” on the left sidebar)
Use boolean queries (capitalize AND, OR, NOT)
lipocalin AND disease
Try using “limits”
Try “Links” to find Entrez information and external resources
Obtain articles on-line via Welch Medical Library
(and download pdf files):
http://www.welch.jhu.edu/
· Journal Database期刊浏览
· MeSh Database可以用它来分层流览医学
主题词
· Single Citation Matcher输入期刊的信息可以
找到某单篇的文献或整个期刊的内容。
· Batch Citution Matcher用一种特定的形式输
入期刊的信息一次搜索多篇文献。
· Clinical Queries这一部分为临床医生设置,
通过过滤的方式将搜索的文献固定在4个范
围:治疗、诊断、病原学与预后。
Related Resources
· Order Documents可以使用户在当地得到文
献的全文,但有些是要收费的。
· NLM Mobile是对另一个NLM基于网络的查
询系统的链接。
练习
练习
z在
在PubMed中搜索human CCL18基因研究
的报道(2000年以后),列出检索到的篇
目,并试图找到一至两篇全文。
BLAST is…
? Basic Local Alignment Search Tool
? NCBI's sequence similarity search tool
? supports analysis of DNA and protein databases
? 80,000 searches per day
OMIM is…
?Online Mendelian Inheritance in Man
?catalog of human genes and genetic disorders
?edited by Dr. Victor McKusick, others at JHU
Books is…
? searchable resource of on-line books
TaxBrowser is…
? browser for the major divisions of living organisms
(archaea, bacteria, eukaryota, viruses)
? taxonomy information such as genetic codes
? molecular data on extinct organisms
Structure site includes…
? Molecular Modelling Database (MMDB)
? biopolymer structures obtained from
the Protein Data Bank (PDB)
? Cn3D (a 3D-structure viewer)
? vector alignment search tool (VAST)
第一次作业
第一次作业
z利用
利用
Enterz查找
查找
human CCL18,
,
human
cxcl1基因的核酸
基因的核酸
\蛋白质
蛋白质RefSeq
序列,保存
序列,保存
为
为
FASTA格式,记录从
格式,记录从GeneBank获得的序
列信息
。
。
z在
在PubMed中搜索human CCL18基因研究
的报道(2000年以后),列出检索到的篇
目,并试图找到一至两篇全文。