第 7章 描述性统计
Descriptive Statistics
一、集中趋势 (Central Tendency )
1,What is the most typical value?
?The Average,A typical value for quantitative data
?The Weighted Average,Adjusting for importance
?The Median,A typical value for quantitative and ordinal data
?The Mode,A typical value even for nominal data
2,What percentile is it?
?Extremes,Quartiles,and Box Plots
?The Cumulative distribution function displays the percentiles
平均值或均数( Average or Mean)
?Add the data,divide by n or N (the number of elementary
units)
?Divides total equally,The only such summary
?A representative,central number (if data set is approximately
normal近似正态分布 )
?Summation notation
? S is capital Greek sigma
n
XXXX n????,..21
N
XXX N?????,..21
( 样本 ) Sample average
( 总体 ) Population average
??? ni iXnX 11 ???? Ni iXN 11
Example,次品数( Number of Defects)
?Defects measured for each of 10 production lots
4,1,3,7,3,0,7,14,5,9
0
2
0 5 10 15 20
Defects per lot
Fr
eq
ue
nc
y (l
ots
)
Average is 5.1
defects per lot
中位数( Median)
?Also summarizes the data
?The middle one,强调它是一个位置指标 !
? Put data in order( 先排序)
? Pick middle one (or average middle two if n is even( 偶数 ))
? Median (9,4,5) = Median(4,5,9) = 5
? Median (9,4,5,7) = Median (4,5,7,9) = = 6
?Rank( 秩) of the median is (1+n)/2
? If n=3,rank is (1+3)/2 = 2
? If n=4,rank is (1+4)/2 = 2.5 (so average 2nd and 3rd)
? If n=262,rank is (1+262)/2 = 131.5
5+7
2
中位数(续)
?A representative,central number
? If data set has a center
?Less sensitive to outliers than the average
?For skewed data,represents the,typical case( 代表性 个案即
大多数的)” better than the average does
? e.g.,incomes
? Average income for a country equally divides the total,which may include
some very high incomes
? Median income chooses the middle person (half earn less,half earn more),
giving less influence to high incomes (if any)
Example,消费( Spending)
?Customers plan to spend ($thousands)
3.8,1.4,0.3,0.6,2.8,5.5,0.9,1.1
? Rank(秩 ) ordered from smallest to largest
0.3,0.6,0.9,1.1,1.4,2.8,3.8,5.5
1 2 3 4 5 6 7 8
?Median is (1.1+1.4)/2 = 1.25
? Smaller than the average,2.05
? Due to slight skewness?
Rank of median
= (1+8)/2 = 4.5
0 1 2 3 4 5
3 1 8 8 5
6 4
9
Median Average
Example,The Crash of October 19,1987
The market lost about 20% of its value in one day
?Dow-Jones Industrials,stock-price changes as each stock
began trading that fateful morning
?Fairly normal( 近似正态)
?Mean and median are similar
0
5
-20% -10% 0%
Percent change at opening
Fr
eq
ue
nc
y
Average = -8.2%
Median = -8.6%
Example,Incomes(Many small values\some moderate values\a
few large and very large values)
?Personal income of 100 people
?Average is higher than median due to skewness
0
10
20
30
40
50
$0 $100,000 $200,000 Income
Average = $38,710
Median = $27,216
Fr
eq
ue
nc
y
众数( Mode)
?Also summarizes the data
?Most common data value
? Middle of tallest histogram bar
?Problems:
? Depends on how you draw histogram (bin width)
? Might be more than one mode (two tallest bars)
?Good if most data values are,correct”
?Good for nominal(名义的 ) data (e.g.,elections)
Mode
Mode
正态分布( Normal Distribution)
?Average,median,and mode are identical
? If the data come from a normal distribution
Average,median,and mode
are identical
in the case of a normal distribution
偏态分布( Skewed Distribution)
?Average,median,and mode are different
? The few large (or small) values influence the mean more than the
median
? The highest point is not in the center
Average
Median
Mode
有哪些集中趋势指标?
?Average
? Best for normal data
? Preserves totals( 保留所有样品信息)
?Median
? Good for skewed data or data with outliers,provided you do not need
to preserve or estimate total amounts
?Mode
? Best for categories (nominal data).
? The mode is the only summary computable for nominal data!
Which Summary? (continued)
?Average requires quantitative data (numbers)
?Median works with quantitative or ordinal
?Mode works with quantitative,ordinal,or nominal
Quantitative Ordinal Nominal
Average Yes - -
Median Yes Yes -
Mode Yes Yes Yes
加权平均( Weighted Average)
?Ordinary average gives same weight to all elementary units
? Weighted average allows different weights
?Weights must add up to 1
? If not,then divide each by their total( 规一化 )
nXnXnXnX
1...11
21 ????
nn XwXwXwX ????,..2211
1...21 ???? nwww
Weighted Average (continued)
?Average is per elementary unit
? The average of your course grades is your,average per course”
?Weighted average is per unit of weight
? Your GPA (grade point average) is a weighted average,using credit
hours to define the weights,The weighted average is your,average
per credit hour”
Example,组合的回报率( Portfolio Rate of Return)
?Portfolio expected return (an interest rate,indicating
performance) is the weighted average of the expected rates of
return of assets in the portfolio,weighted by $dollars invested
E(RP)=W1E(R1)+ W2E(R2)+…+ W nE(Rn)
?Portfolio contains three stocks,One ($1,000 invested) is
expected to return 20%,Another ($1,800 invested) expects
15%,Third is $2,200 and 30%,
?Total invested is 1,000+1,800+2,200 = $5,000
Example (continued)
?Weights are(以资产分配额度为权重 )
w1 = $1,000/$5,000 = 0.20
w2 = $1,800/$5,000 = 0.36
w3 = $2,200/$5,000 = 0.44
?Weighted average is
0.20?(20%) + 0.36?(15%) + 0.44?(30%) = 22.6%
? The expected return for the portfolio.
? Each stock is represented in proportion to $ invested
百分位数( Percentiles)
?Landmark summaries in the same measurement units as the
data
? e.g.,dollars,people,miles per gallon,…
?Some familiar percentiles
? Smallest data value is 0th percentile
? Median is 50th percentile
? Largest data value is 100th percentile
? 90th percentile is larger than 90% of elementary units
?Finding percentiles
? Difficult to see from histogram
? Easy using CDF (Cumulative Distribution Function累积分布函数 )
累积分布函数 Cumulative Distribution Function
实际上就是折线下的面积
?Data axis horizontally (as in histogram)
?Cumulative percent vertically
?Equal vertical jump at each data value
0.3,0.6,0.9,1.1,1.4,2.8,3.8,5.5
0%
50%
100%
$0 $2 $4 $6
Spending
Cu
mu
lati
ve
Pe
rc
en
t
80th percentile
is $3.80
80%
Example,Business Failures
?Per million people,by state
90th percentile is 432.4
50th percentile is 260.2
0%
50%
100%
0 100 200 300 400 500 600 700
Failures
Cu
mu
lati
ve
P
er
ce
nt
?Selected landmarks to represent entire data set
? Median = 50th percentile
? Quartiles
? LQ = Lower Quartile = 25th percentile
? Rank =
? UQ = Upper Quartile = 75th percentile
? Rank is n+1–[rank of lower quartile]
? Extremes
? Smallest = 0th percentile
? Largest = 100th percentile
Five-Number Summary
2
2
1int1
??
?
??
? ?? n
Rank of median
取整函数 Discard
decimal,if any.
int(10.5)=10
int(35)=35
Five-Number Summary (continued)
?Provides information about
? Central summary
? Median
? Range of the data
? Largest – smallest
?,Middle half” of the data
? From LQ to UQ
? Skewness
? If median is not approximately half way between quartiles
箱图 Box Plot
?Displays five-number summary
?Less detail than histogram
? Easier to compare many groups
0 2 4 6 8
Smallest Largest
Lower
Quartile
Upper
Quartile
Median
{
Middle half
of the data
? Spending rank ordered from smallest to largest
0.3,0.6,0.9,1.1,1.4,2.8,3.8,5.5
1 2 3 4 5 6 7 8
?LQ is (0.6+0.9)/2 = 0.75
?UQ is (2.8+3.8)/2 = 3.3
Example,Spending
Rank of median
= (1+8)/2 = 4.5
Rank of UQ
= 8+1-2.5=6.5
Rank of LQ
= (1+4)/2 = 2.5
4 = int(4.5)
Example,Spending (continued)
?Five-number summary
0.3,0.75,1.25,3.3,5.5
Smallest,LQ,Median,UQ,Largest
?Box plot
? Shows some skewness (缺乏对称性 lack of symmetry)
0 5
Spending ($thousands)
Example,Executive Compensation( 赔偿)
?Box plots to compare firms within industry groups
? Utilities group generally shows lower compensation
? Highest-paid are in Financial Services group
Financial
Services
BanksC
EO
C
om
pe
ns
ati
on
($th
ou
san
ds
)
5,000
0
Drugs Utilities
Example (continued)
?Detailed box plot (with outliers and most extreme non-
outliers named)
PanhandleEastern
WheelabratorTechnologies,
BearStearns,Equitable
FirstFinancialMgmt.
MerrillLynch
Enron
HoustonIndustries
FPLGroup
U.S.Bancorp
Citicorp
BectonDickinson
Pfizer
BerkshireHathaway
Travelers
Scana
Sonat
BanksCE
O
Com
pe
nsa
tion
($t
hous
ands
)
5,000
0
Drugs Financial
Services
Utilities
三种直观描述的比较
?Compare histogram,box plot,and CDF
Histogram
Box plot
CDF
0
10
0 500Failures
0 500Failures
0%
100%
0 500Failures
二,离散趋势
变异性 (Variability)
?Also known as dispersion,spread,uncertainty,diversity,risk
?Example data,2,2,2,2,2,2,2
? Variability = 0
?Example data,1,3,2,2,1,2,3
? How much variability?
? Look at how far each data value is from average X = 2:
? Deviations from average are -1,1,0,0,-1,0,1
? Variability should be between 0 and 1
Examples
?Stock market,daily change,is uncertain
? Not the same,day after day!
?Risk of a business venture
? There are potential rewards,but possible losses
?Uncertain payoffs and risk aversion
? Which would you rather have
? $1,000,000 for sure
? $0 or $2,000,000,each outcome equally likely
? Both have same average! ($1,000,000)
? Most would prefer the choice with less uncertainty
标准差( Standard Deviation S)
?Measures variability by answering:
?,Approximately how far from average are the data values?” (same
measurement units as the data)
? The square root of the average squared deviation
? (dividing by n-1 instead of n for a sample)
?For a sample
?For a population
1
)(...)()( 22221
-
-??-?-?
n
XXXXXXS n
)(...)()( 22221 ?-???-??-?s
N
XXX N
Example,Spending
?Customers plan to spend ($thousands)
3.8,1.4,0.3,0.6,2.8,5.5,0.9,1.1
?Average is 2.05,Sum of squared deviations is
(3.8–2.05)2+(1.4–2.05)2+…+( 1.1–2.05)2 = 23.34
?Divide by 8–1=7 and take square root:
?Customers plan to spend about 1.83 (thousand,i.e.,$1,830)
more or less than the average,2.05.
? Some plan to spend more,others less than average
83.1 3, 3 3 4 2 8 67 34.23 ??
= Standard deviation
Example,Spending (continued)
?On the histogram
? Average is located near the center of the distribution
? Standard deviation is a distance away from the average
? Standard deviation is the typical distance from average
0
1
2
3
0 1 2 3 4 5 6 7
spending
Fre
que
nc
y
X = 2.05S = 1.83 S = 1.83
正态分布预备 标准差
Normal Distribution and Std,Dev.
?For a normal distribution only( 仅此而已! about)
? 2/3 of data within one standard deviation of the average (either
above or below)---interval
? 95% for 2(1.96) std,devs.
? 99% for 3(2.58)
2/3 of data
95% of the data
99.7% of the data
one
standard
deviation
one
standard
deviation
偏态分布与标准差
Skewed Distribution and Std,Dev.
?No simple rule for percentages within one,two,three
standard deviations of the average
?只能用频数分布表的累积频率近似估计,四分位数间距
?Standard deviation retains its interpretation as the standard
measure of
Typically how far from average the observations are
Example,质量控制( Quality Control Charts)
?Control limits are often set at
3 standard deviations from the average
?If the process is normally distributed,then
? Over the long run,observations will stay within the control limits
99.7% of the time
?If the process goes out of control,you will know
0
50
100
Qua
lity
Out of control
Example,股票市场( The Stock Market)
?Daily stock market returns,S&P500 index,first half of 1995,
Standard deviation is 0.487%
? Average daily percent change,0.137%
? Typical day,about half a percentage point up or down
0
50
-2% -1% 0% 1% 2%
Stock Market Return
Fre
que
nc
y (da
ys)
Average
One
standard
deviation
One
standard
deviation
极差( The Range)
?The difference,Largest – Smallest
?Good features
? Easy and fast to compute
? Describe the data
? Check the data,Is the range too big to be reasonable?
?Problem
? Very sensitive to just two data values
? Compare to standard deviation,which combines all data values
? 主要是信息利用不全、容易受极端值影响
Example,Spending
?$Thousands,3.8,1.4,0.3,0.6,2.8,5.5,0.9,1.1
?The range is 5.2
? larger than the standard deviation,1.83
0
1
2
3
0 1 2 3 4 5 6 7
spending
Fre
que
nc
y
Average One standard deviation
The range
5.5–0.3 = 5.2
变异系数( Coefficient of Variation)
?A relative measure of variability是相对数(无量纲的)
?The ratio,Standard deviation divided by average
? For a sample,S/X
? For a population,s/?
?No measurement units,A pure number,Answers:
?,Typically,in percentage terms,how far are data values from
average?”
?Useful for comparing situations of different sizes
? To see how variability compares after adjusting for size
每一个平均单位所承担的变异度的大小,可评价多个投资规模
不等的项目业绩。
每个数据等量增幅后的特点( adding a constant to each data)
?If the same number is added to each data value:
? The average changes by this same number
? The center of the distribution shifts by the same amount
? The standard deviation is unchanged
? Each data value stays the same distance from average
?Example,Order amounts,$3,6,9,5,8
? Average is $6.20,std,dev,is $2.39
? Now add shipping and handling,$1 per order:
$4,7,10,6,9
? Average rises by $1 to $7.20,but std,dev,is still $2.39
Example,组合业绩( Portfolio Performance)
?You have invested $100 in each of 5 stocks
? Results,$116,83,105,113,98
? Average is $103,std,dev,is $13.21
?Your friend has invested $1,000 in each stock
? Results,$1,160,830,1,050,1,130,980
? Average is $1,030,std,dev,is $132.10
?Coefficients of variation are identical
13.21/103 = 132.10/1,030 = 0.128 = 12.8%
?Typically,results for these 5 stocks were approximately
12.8% from their average value
等倍放大后的特点( Multiplying the Data by a Constant) 尺度放大
?If each data value is multiplied by some number:
? The average is multiplied by this same number
? The center of the distribution shifts by the same multiple
? The standard deviation is also multiplied by this same number
(after ignoring any minus sign)
? The distribution is widened (or narrowed) by this factor
?Example,Order amounts,$3,6,9,5,8
? Average is $6.20,std,dev,is $2.39
? Add 10% sales tax,$3.30,$6.60,$9.90,$5.50,$8.80
? Average rises by 10% to $6.82
? Std,dev,also rises by 10%,to $2.63
Example,外汇交易( Foreign Exchange Rates)
?Suppose $1 is worth 5.6 French francs
? Assume for now that this rate is constant
?Your firm is anticipating
? Average profits worth 850,000 francs
? Standard deviation (uncertainty) of 100,000 francs
?In dollars,after conversion,your firm anticipates
? Average profits worth 850,000/5.6 = $151,786
? Standard deviation of 100,000/5.6 = $17,857
?Relative risk(相对风险 ) is the same in $ and in francs
? Coefficient of variation is 11.8%