2009-8-20 高级人工智能 史忠植 1
第九章 知识发现和数据开采数据库中知识发现史忠植中科院计算所
2009-8-20 高级人工智能 史忠植 2
概述在数据库基础上实现的知识发现系统,
通过综合运用统计学、粗糙集、模糊数学、
机器学习,和专家系统等多种学习的手段和方法,从大量的数据中提炼出抽象的知识,从而揭示出蕴涵在这些数据背后的客观世界的内在联系和本质规律,实现知识的自动获取,这是一个富有挑战性、应用前景广阔的研究课题。
2009-8-20 高级人工智能 史忠植 3
Evolution of Database Technology:
from data management to data analysis
1960s:
– Data collection,database creation,IMS and network
DBMS.
1970s,
– Relational data model,relational DBMS implementation.
1980s,
– RDBMS,advanced data models (extended-relational,OO,
deductive,etc.) and application-oriented DBMS (spatial,
scientific,engineering,etc.).
1990s,
– Data mining and data warehousing,multimedia databases,
and Web technology.
2009-8-20 高级人工智能 史忠植 4
Motivations
“Necessity is the Mother of Invention”
Data explosion problem:
– Automated data collection tools,mature database
technology and internet lead to tremendous amounts of
data stored in databases,data warehouses and other
information repositories.
We are drowning in information,but starving for
knowledge! (John Naisbett)
Data warehousing and data mining,
– On-line analytical processing
– Extraction of interesting knowledge (rules,regularities,
patterns,constraints) from data in large databases.
2009-8-20 高级人工智能 史忠植 5
The Value Chain
Data
Customer data
Store data
Demographical Data
Geographical data
Information
X lives in Z
S is Y years old
X and S moved
W has money in Z
Knowledge
A quantity Y of product A is used
in region Z
Customers of class Y use x% of
C during period D
Decision
Promote product A in region Z.
Mail ads to families of profile P
Cross-sell service B to clients C
2009-8-20 高级人工智能 史忠植 6
What is DM useful for?
Marketing
Database
Marketing
Data
Warehousing
KDD &
Data Mining
Increase knowledge
to base decision
upon.
E.g.,impact on
marketing
2009-8-20 高级人工智能 史忠植 7
Application Areas and
Opportunities
Marketing,segmentation,customer targeting,...
Finance,investment support,portfolio management
Banking & Insurance,credit and policy approval
Security,fraud detection
Science and medicine,hypothesis discovery,
prediction,classification,diagnosis
Manufacturing,process modeling,quality control,
resource allocation
Engineering,simulation and analysis,pattern
recognition,signal processing
Internet,smart search engines,web marketing
2009-8-20 高级人工智能 史忠植 8
Classes of applications
Market analysis
target marketing,customer relation management,market
basket analysis,cross selling,market segmentation.
Risk analysis
Forecasting,customer retention,improved underwriting,
quality control,competitive analysis.
Fraud detection
Text (news group,email,documents) and Web analysis.
2009-8-20 高级人工智能 史忠植 9
Selection and
Preprocessing
Data Mining
Interpretation
and Evaluation
Data
Consolidation
Knowledge
p(x)=0.02
Warehouse
Data Sources
Patterns &
Models
Prepared Data
Consolidated
Data
The KDD process
2009-8-20 高级人工智能 史忠植 10
The selection and processing of data for:
– the identification of novel,accurate,and useful
patterns,and
– the modeling of real-world phenomena.
Data mining is a major component of the KDD
process - automated discovery of patterns and the
development of predictive and explanatory models.
What is KDD? A process!
2009-8-20 高级人工智能 史忠植 11
Learning the application domain:
– relevant prior knowledge and goals of application
Data consolidation,Creating a target data set
Selection and Preprocessing
– Data cleaning,(may take 60% of effort!)
– Data reduction and projection:
find useful features,dimensionality/variable reduction,invariant representation.
Choosing functions of data mining
– summarization,classification,regression,association,
clustering.
Choosing the mining algorithm(s)
Data mining,search for patterns of interest
Interpretation and evaluation,analysis of results.
– visualization,transformation,removing redundant patterns,…
Use of discovered knowledge
The steps of the KDD process
2009-8-20 高级人工智能 史忠植 12
C o g N o v a
T ech n o lo g ies
9
T h e K DD P r o c e s s
S e l e c t i on a nd
P r e pr oc e s s i ng
D a t a M i n i n g
I nt e r pr e t a t i on
a nd E va l ua t i on
D a t a
C on s ol i da t i on
K n o w l e d g e
p (x )= 0,0 2
W a re h o u se
D a t a S ourc e s
P a t t e rns &
M ode l s
P re pa re d D a t a
Cons o l i da t e d
D a t a
Identify
Problem or
Opportunity
Measure effect
of Action
Act on
Knowledge
Knowledge
ResultsStrategy
Problem
The virtuous cycle
2009-8-20 高级人工智能 史忠植 13
Applications,operations,
techniques
2009-8-20 高级人工智能 史忠植 14
Graphical User Interface
Data
Consolidation
Selection
and
Preprocessing
Data
Mining
Interpretation
and Evaluation
Warehouse KnowledgeData Sources
Architecture of a KDD system
2009-8-20 高级人工智能 史忠植 15
Garbage in Garbage out
The quality of results relates directly to quality of
the data
50%-70% of KDD process effort is spent on data
consolidation and preparation
Major justification for a corporate data warehouse
Data consolidation and
preparation
2009-8-20 高级人工智能 史忠植 16
From data sources to consolidated data repositoryRDBMS
Legacy
DBMS
Flat Files
Data
Consolidation
and Cleansing
Warehouse
Object/Relation DBMS
Multidimensional DBMS
Deductive Database
Flat files
External
Data consolidation
2009-8-20 高级人工智能 史忠植 17
Determine preliminary list of attributes
Consolidate data into working database
– Internal and External sources
Eliminate or estimate missing values
Remove outliers (obvious exceptions)
Determine prior probabilities of categories and
deal with volume bias
Data consolidation
2009-8-20 高级人工智能 史忠植 18
Generate a set of examples
– choose sampling method
– consider sample complexity
– deal with volume bias issues
Reduce attribute dimensionality
– remove redundant and/or correlating attributes
– combine attributes (sum,multiply,difference)
Reduce attribute value ranges
– group symbolic discrete values
– quantize continuous numeric values
Transform data
– de-correlate and normalize values
– map time-series data to static representation
OLAP and visualization tools play key role
Data selection and preprocessing
2009-8-20 高级人工智能 史忠植 19
Data mining tasks and methods
Automated Exploration/Discovery
– e.g.,discovering new market segments
– clusteringanalysis
Prediction/Classification
– e.g.,forecasting gross sales given current factors
– regression,neural networks,genetic algorithms,decision trees
Explanation/Description
– e.g.,characterizing customers by demographics and
purchase history
– decision trees,association rules
x1
x2
f(x)
x
if age > 35
and income < $35k
then,..
2009-8-20 高级人工智能 史忠植 20
Clustering,partitioning a set of data into a set of classes,
called clusters,whose members share some interesting
common properties.
Distance-based numerical clustering
– metric grouping of examples (K-NN)
– graphical visualization can be used
Bayesian clustering
– search for the number of classes which result in best fit
of a probability distribution to the data
– AutoClass (NASA) one of best examples
Automated exploration and
discovery
2009-8-20 高级人工智能 史忠植 21
Clustering,partitioning a set of data into a set of classes,
called clusters,whose members share some interesting
common properties.
Distance-based numerical clustering
– metric grouping of examples (K-NN)
– graphical visualization can be used
Bayesian clustering
– search for the number of classes which result in best fit
of a probability distribution to the data
– AutoClass (NASA) one of best examples
Automated exploration and
discovery
2009-8-20 高级人工智能 史忠植 22
Learning a predictive model
Classification of a new case/sample
Many methods:
– Artificial neural networks
– Inductive decision tree and rule systems
– Genetic algorithms
– Nearest neighbor clustering algorithms
– Statistical (parametric,and non-parametric)
Prediction and classification
2009-8-20 高级人工智能 史忠植 23
The objective of learning is to achieve good
generalization to new unseen cases.
Generalization can be defined as a mathematical
interpolation or regression over a set of training
points
Models can be validated with a previously unseen
test set or using cross-validation methods
f(x)
x
Generalization and regression
2009-8-20 高级人工智能 史忠植 24
Objective,Develop a general model or
hypothesis from specific examples
Function approximation (curve fitting)
Classification (concept learning,pattern
recognition) x1
x2
A B
f(x)
x
Summarizing,inductive
modeling = learning
2009-8-20 高级人工智能 史忠植 25
Learn a generalized hypothesis (model) from
selected data
Description/Interpretation of model provides new
knowledge
Methods:
– Inductive decision tree and rule systems
– Association rule systems
– Link Analysis
– …
Explanation and description
2009-8-20 高级人工智能 史忠植 26
Generate a model of normal activity
Deviation from model causes alert
Methods:
– Artificial neural networks
– Inductive decision tree and rule systems
– Statistical methods
– Visualization tools
Exception/deviation detection
2009-8-20 高级人工智能 史忠植 27
Outlier and exception data
analysis
Time-series analysis (trend and deviation),
– Trend and deviation analysis,regression,
sequential pattern,similar sequences,trend and
deviation,e.g.,stock analysis.
– Similarity-based pattern-directed analysis
– Full vs,partial periodicity analysis
Other pattern-directed or statistical analysis
2009-8-20 高级人工智能 史忠植 28
A data mining system/query may generate thousands of
patterns,not all of them are interesting.
Interestingness measures:
– easily understood by humans
– valid on new or test data with some degree of certainty.
– potentially useful
– novel,or validates some hypothesis that a user seeks to
confirm
Objective vs,subjective interestingness measures
– Objective,based on statistics and structures of patterns,e.g.,
support,confidence,etc.
– Subjective,based on user’s beliefs in the data,e.g.,
unexpectedness,novelty,etc.
Are all the discovered pattern
interesting?
2009-8-20 高级人工智能 史忠植 29
Find all the interesting patterns,Completeness.
– Can a data mining system find all the interesting patterns?
Search for only interesting patterns,Optimization.
– Can a data mining system find only the interesting patterns?
– Approaches
First generate all the patterns and then filter out the
uninteresting ones.
Generate only the interesting patterns - mining query
optimization.
Completeness vs,optimization
2009-8-20 高级人工智能 史忠植 30
Evaluation
Statistical validation and significance testing
Qualitative review by experts in the field
Pilot surveys to evaluate model accuracy
Interpretation
Inductive tree and rule models can be read directly
Clustering results can be graphed and tabled
Code can be automatically generated by some
systems (IDTs,Regression models)
Interpretation and evaluation
2009-8-20 高级人工智能 史忠植 31
Visualization tools can be very helpful
– sensitivity analysis (I/O relationship)
– histograms of value distribution
– time-series plots and animation
– requires training and practice
Response
Velocity
Temp
Interpretation and evaluation
2009-8-20 高级人工智能 史忠植 32
1989 IJCAI Workshop on KDD
– Knowledge Discovery in Databases (G,Piatetsky-Shapiro
and W,Frawley,eds.,1991)
1991-1994 Workshops on KDD
– Advances in Knowledge Discovery and Data Mining (U,
Fayyad,G,Piatetsky-Shapiro,P,Smyth,and R,Uthurusamy,
eds.,1996)
1995-1998 AAAI Int,Conf,on KDD and DM
(KDD’95-98)
– Journal of Data Mining and Knowledge Discovery (1997)
1998 ACM SIGKDD
1999 SIGKDD’99 Conf.
Important dates of data mining
2009-8-20 高级人工智能 史忠植 33
典型的知识发现系统
CoverStory:超级市场数据分析系统。
分析超级市场的数据,产生有关重要事件的报告。
标识并排列产品量跨时间跨地区的富有意义的变化,并对这些变化提供一些可能的解释。
并且以自然语言的形式输出结果。
EXPLORA,一个用于概念性的分析数据和查找感兴趣关系的集成化系统。
KDW,交互式的大型数据库的分析工具
Forty-Niner(49er),通用的 KDD系统
SKICAT ( SKy Image Cataloging and Analysis Tool):
用于分析大规模天空观测数据的自动系统。
2009-8-20 高级人工智能 史忠植 34
典型的知识发现系统
SAS公司的 SAS Enterprise Miner
IBM公司的 Intelligent Miner
Solution公司的 Clementine
中科院计算技术研究所的 MSMiner
2009-8-20 高级人工智能 史忠植 35
知识发现
从数据库中抽取和精化新的模式。
范围非常广泛,数据结构也各不相同。目前,数据开采主要把关系型数据库作为实施对象。
发现的知识可以表示成各种形式,包括规则、法则、科学规律、方程或概念网。
抽取模式的算法是归纳与演译的结合。
2009-8-20 高级人工智能 史忠植 36
信息源
用户发出的对控制器的高级命令;
数据库提供的原始数据;
来自各个方面的存入系统知识库的领域知识。
2009-8-20 高级人工智能 史忠植 37
模式
1)概念描述
2)聚类
3)分类
4)依赖关系分析
5)异常检测
6)模式生成
2009-8-20 高级人工智能 史忠植 38
KDD的技术难点
1)动态变化的数据
2)噪声
3)数据不完整
4)冗余信息
5)数据稀疏
6)超大数据量
2009-8-20 高级人工智能 史忠植 39
数据仓库数据仓库和数据开采之间有着非常密切的关系。将数据开采扩充到它的数据仓库系统环境中,
可以增强用户的决策支持能力。用户从数据仓库中开采信息时有多种不同的方式,一种是较低层次上的由用户制导的被动方式,另一种是高层次上的主动式自动发现方式。前者称验证驱动数据开采,后者称发现驱动数据开采。
2009-8-20 高级人工智能 史忠植 40
数据仓库数据仓库是面向主题的、集成的、稳定的、
不同时间的数据集合,用以支持经营管理中的决策制定过程。从体系结构上看,数据仓库系统由三部分组成:数据仓库、数据仓库管理系统和数据仓库工具。由于建立数据仓库的主要目标是为了提供决策支持,因此,在整个系统中,数据仓库虽然居于核心地位,但它只是进行进一步信息开采的基础。这样一来,数据仓库的工具是否强大,将对整个系统的能力起着非常重要的作用。
2009-8-20 高级人工智能 史忠植 41
关联规则发现任务给定一个事务数据库 D,求出所有满足最小支持度和最小可信度的联想规则。该问题可以分解为两个子问题:
1) 求出 D中满足最小支持度的所有高频物品集;
2) 利用高频物品集生成满足最小可信度的所有联想规则。
2009-8-20 高级人工智能 史忠植 42
关联规则发现的基本思路
1) 首先生成长度为 1的常用物品集(即单个物品),
记为 L[1];
2) 在 L[k]的基础上生成候选物品集 C[k+1],候选物品集必须保证包括所有的常用物品集。
3) 用事务数据库 D中的事务对 C[k+1]进行支持度测试以生成长度为 k+1的常用物品集 L[k+1],计算每个候选物品集的支持度,如果大于 minsup,则加入到 L[k+1]中。
4) 如果 L[k+1]为空集,则结束,L[1]∪ L[2]∪ … 即为结果;否则转 (2),继续。
2009-8-20 高级人工智能 史忠植 43
Apriori算法
(1) L[1]={large 1-itemsets};
(2) for (k=2; L[k-1]不为空 ; k++) do begin
(3) C[k]=apriori-gen(L[k-1]); // 新候选物品
(4) forall transactions t∈ D do begin
(5) C=subset(C[k],t); // t中的候选物品
(6) forall candidates c∈ C do
(7) c.count++;
(8) end;
(9) L[k]={c∈ C[k]|c.count>=minsup};
(10) end;
(11) Answer = L[1]∪ L[2]∪ …
2009-8-20 高级人工智能 史忠植 44
join算法
insert into C[k]
select p.item1,p.item2,...,p.item(k-1),
q.item(k-1)
from L[k-1]p,L[k-1]q
where p.item1=q.item1,...,p.item(k-
2)=q.item(k-2),p.item(k-1)=q.item(k-1)
prune算法
(1) forall itemsets c∈ C[k] do
(2) forall (k-1)-subsets s of c do
(3) if (s L[k-1])
(4) then delete c from C[k]
2009-8-20 高级人工智能 史忠植 45
知识发现工具 SAS
SAS公司的 SAS Enterprise Miner是一种通用的数据挖掘工具。
通过收集分析各种统计资料和客户购买模式,SAS Enterprise Miner
可以帮助您发现业务的趋势,解释已知的事实,预测未来的结果,并识别出完成任务所需的关键因素,以实现增加收入、降低成本。
SAS Enterprise Miner提供 "抽样 -探索 -转换 -建模 -评估 "(SEMMA)的处理流程。数据挖掘算法有:
·聚类分析,SOM/KOHONEN神经网络分类算法
·关联模式 /序列模式分析
·多元回归模型
·决策树模型 (C45,CHAID,CART)
·神经网络模型 (MLP,RBF)
·SAS/STAT,SAS/ETS等模块提供的统计分析模型和时间序列分析模型也可嵌入其中。
2009-8-20 高级人工智能 史忠植 46
知识发现工具
IBM公司的 Intelligent Miner具有典型数据集自动生成、关联发现、序列规律发现、
概念性分类和可视化显示等功能。它可以自动实现数据选择、数据转换、数据发掘和结果显示。若有必要,对结果数据集还可以重复这一过程,直至得到满意结果为止。
2009-8-20 高级人工智能 史忠植 47
知识发现工具
Solution公司的 Clementine 提供了一个可视化的快速建立模型的环境。它由数据获取( Data Access),探查
( Investigate),整理( Manipulation)、
建模( Modeling) 和报告( Reporting) 等部分组成。都使用一些有效、易用的按钮表示,用户只需用鼠标将这些组件连接起来建立一个数据流,可视化的界面使得数据挖掘更加直观交互,从而可以将用户的商业知识在每一步中更好的利用。
2009-8-20 高级人工智能 史忠植 48
知识发现工具 MSMiner
中科院计算技术研究所智能信息处理开放实验室开发的 MSMiner是一种多策略知识发现平台,能够提供快捷有效的数据挖掘解决方案,提供多种知识发现方法。
MSMiner具有下列特点:
·提出了一种面向对象的元数据结构,
·设计实现了一种简单但有效的数据仓库平台
·提出了一种面向对象的数据挖掘任务模型
·设计了一种可扩展算法库