Data Reduction
“It’s Poetic”
16.621
March 18, 2003
Introduction
? A primary goal of your efforts in this course will
be to gather empirical data so as to prove (or
disprove) your hypothesis
? Typically the data that you gather will not directly
satisfy this goal
? Rather, it will be necessary to “reduce” the data, to
put it into an appropriate form, so that you can
draw valid conclusions
? In our discussion today we will examine some
typical methods for processing empirical data
? Caution-garbage in/garbage out still applies
Hiawatha Designs an Experiment
by
Maurice G. Kendall
From The American Statistician
Vol. 13, No. 5, 1959, pp 23-24
Verses 1 through 6
Deyst’s 16.62X Project
? I have performed a very simple experiment
? The hypothesis was: my driving route
distance, from West Garage to my driveway
in Arlington, is eight miles
? On a number of trips I recorded the mileage,
as indicated by the odometer of my
automobile
? I now wish to reduce the data and draw
some conclusions
Experimental Project (cont.)
? My experimental procedure was: at the exit
from West Garage I zeroed my trip
odometer and when I reached my driveway
at home I recorded the odometer reading
? On each of ten trips I took the same route
home
Error Sources
Random errors
Odometer readout resolution
Odometer mechanical variations
Route path variations
Tire slippage
Systematic errors
Bias in the odometer readings
Odometer scale factor error
Tire diameter decreases due to wear
Error Sources (cont.)
? The resolution I achieved in reading the
odometer was within
±
.025 miles
? The best knowledge I have about the other
random errors is that they were all in the
range of
±
.10 miles
? I zeroed the odometer at the beginning of
each trip so any bias in the measurements is
small (i.e. about
±
.005 miles)
Error Sources (cont.)
? I did a scale factor calibration by driving 28
miles, according to mileage markers on
Interstate 95, and in both directions I
recorded 27.425 miles on my odometer
? Thus, the scale factor is
27.425 odometer indicted miles
S.F. = = .980
28 actual miles
? And any error in the scale factor due to
readout resolution is
.025
?±.0006e
SF
=±
228?
Recorded Data
Trip
Number
Mileage
Reading
S.F. Corrected
Mileage reading
1 7.825 7.985
2 7.850 8.010
3 7.875 8.036
4 7.900 8.061
5 7.850 8.010
6 7.825 7.985
7 7.875 8.036
8 7.850 8.010
9 7.875 8.036
10 7.825 7.985
Mileage Data Analysis
? My system model is that the route distance
is constant
? To minimize the effect of random errors
take the sample mean (average) of the data
to obtain an estimate
n
d
?
=
1
∑
d
i
= 8.015 miles
n
i =1
Mileage Data Analysis (cont.)
? Variations of the individual measurements,
about this estimate are
e
i
= d
i
? d
?
? The sample mean of these variations is
n n
1
n
e
?
=
∑
e
i
=
1
∑
(d
i
? d
?
) = 0
i =1
n
i =1
? So the estimate is unbiased
Mileage Data Analysis (cont.)
? Assuming the variations are statistically
independent we can also compute the
sample standard deviation of these
variations as
n n
?σ=
1
∑
(d
i
? d
?
)
2
=
(n
1
?1)
∑
e
i
2
= .026 miles
(n ?1)
i =1 i =1
Mileage Data Analysis (cont.)
? Since my experiment consisted of a number of
independent trials it is reasonable to assume that the
route distance, as determined by my measurements,
is gaussian
probability
density of
route distance
0
2
4
6
8
10
12
14
16
7.85 7.9 7.95 8 8.05 8.1 8.15
.954 confidence
miles
Linear System Models
? The system model for my experiment assumed
that the route distance is constant
? In many instances the system model is not
constant but is a linear function
? Define a linear system model as
y = c
0
+ c
1
x
where
x ≡ independent variable
y ≡ dependent variable
Linear System Models (cont.)
Typically, for a number of values of the
independent variable (x), the corresponding
values of the dependent variable (y) are
measured
measured
values of
dependent
variable (y)
0
20
40
60
80
100
120
2 4 6 8 10 12
independent variable (x)
Straight Line Fit
? For a linear model, the object is to find the
best straight line fit to the measured data
? We can characterize each measurement as
y
i
= c
0
+ c
1
x
i
+ e
i
where
e
i
= error or variation of the ith measuremen
from a straight line model
Straight Line Fit (cont.)
? To characterize the complete set of n
measurements define the following arrays
?
y
1
? ?
1 x
1
? ?
e
1
?
?
?
? ?
??
? ?
c
0
? ?
?
?
y =
?
?
?
X =
?
??
?
c=
?
c
1
?
? ?
e =
?
?
?
? ? ? ? ?
?
y
n
? ?
1 x
n
?
?
?
e
n
?
? So the measurement equation becomes
y = X c + e
Straight Line Fit (cont.)
? Recall that we wish to find the best straight
line fit to the measured data array y
? A useful criterion for the best fit is to
minimize the sum of the squared errors
n
2
2
T
e
=
∑
e
i
= e e
i =1
Straight Line Fit (cont.)
? And upon substitution from above
2
e = (y? Xc)
T
( y? Xc)
T
= y y? 2y
T
X c+ c
T
X
T
Xc
? Our goal is to find the array c so that the
sum squared error is minimized
? First determine the gradient of the sum
squared error with respect to c
2
? e
=?2y
T
X+2c
T
X
T
X
?c
Straight Line Fit (cont.)
? Setting the gradient to zero yields the optimum
? 1
X
T
yc
?
= (X
T
X )
? Since the required inverse matrix is only
2 × 2
we
can readily solve for the two elements of
c
?
∑ ∑
x
i
2
?
∑
x
i
∑
( y
i
x
i
) n
∑
( x
i
y
i
) ?
∑
x
i
∑
y
i
2
∑
2
c
?
0
=
y
i
nx
i
2
? (
∑
x
i
)
c
?
1
=
nx
i
2
? (
∑
x
i
)
∑
? These are the equations used in your calculator
or computer to get a best straight line fit to data
as
y
?
( x) = c
?
0
+ c
?
1
x
Beam Deflection Example
? A cantilever beam deflects downward when a mass
is attached to its free end. A beam model predicts
that the deflection will be a linear function of the
mass.
? A student places various masses on the end of the
beam and records the deflections
? The masses are measured to within
±.
.11grams
? The error in reading the deflections is within
±.
.23
millimeters
Excerpted from: Beckwith, T.G., Marangoni R.D.,and Lienhard V, J.H.,
Mechanical Measurements, Fifth Edition, Addison Wesley, Reading,
MA, 1993, pp. 113-115
Beam Deflection Data
x value, y value, beam
load mass (gm) deflection (mm)
0 0
50.15 0.6
99.90 1.8
150.15 3.0
200.05 3.6
250.20 4.8
299.95 6.0
350.05 6.2
401.00 7.5
Straight Line Fit to Beam Data
beam
deflection (mm)
y
?
( x ) = ?.076 + .019 x
-1
0
1
2
3
4
5
6
7
8
0 100 150 200 250 300 350 400 45050
load mass (g)
Hiawatha
Verses 7 through 12
Linear Fit Analysis
? Recall that the best fit to y(x) is
y
?
( x) = c
?
0
+ c
?
1
x
? The variations or errors from the fit, at each
measurement point, are then
e
i
= y
i
? y
?
( x
i
) = y
i
? ( c
?
0
+ c
?
1
x
i
)
? So the array of measurement errors is
e = y ? y? = y ? X c? = y ? X(X
T
X )
?1
X
T
y
= (I ? X(X
T
X )
?1
X
T
)y
Straight Line Fit (cont.)
? A useful result is obtained by premultiplying
X
T
both sides of this equation by the matrix
?
∑
e
i
?
?
0
?
X
T
e =
?
?
∑
x
i
e
i
?
?
= ( X
T
? X
T
X(X
T
X )
?1
X
T
)y =
?
?
0
?
?
? Thus, the sample mean and x weighted sample
mean of the errors are both zero
Straight Line Fit
Simple Example
y
3
y
1
x
y
2
1
=?l
x
2
= 0 x
3
= l
∑
e
i
= d ? 2d +d = 0
∑
x
i
e
i
= (?l)?d +(0)?(?2d)+l ?d = 0
d
d
?2d
Straight Line Fit (cont.)
? We can also derive an expression for the
sample standard deviation, in terms of the
measured data, by noting that
?
σ=
1
(n?1)
e
i
2
i=1
n
∑
=
1
(n?1)
e
T
e =
1
(n?1)
(y? ?y)
T
(y? ?y)
=
1
(n?1)
y
T
(I ? X(X
T
X)
?1
X
T
)y
Nonlinear System Models
? In many instances the system model will not
be linear
? Often it is still possible to use a linear fit to
analyze data
? For example, suppose the system is an
electronic circuit, for which we measure the
output voltage over time in response to an
initial condition
Nonlinear System Models (cont.)
? The system model might be
v( t) = v(0)e
?α t
where
v(0) = initial condition voltage
α =1/τ = inverse time constant
? In this case the independent variable is
time and the dependent (measured) variable
is output voltage
Nonlinear System Models (cont.)
? To linearize take the natural log of both
sides of this equation
ln(v(t)) = ln(v(0))?αt
? And we can obtain our previous linear equation
y = c
0
+ c
1
x
by identifying
y ≡ ln(v(t)) c
0
≡ ln(v(0)) c
1
≡ ?α x ≡ t
Nonlinear System Models (cont.)
? Thus the exponential system model is converted
into a linear model
? The measured data is converted using the
identities
x
i
= t
i
and y
i
= ln( v
i
)
? These values are used to obtain c
?
as before
?
1 x
1
?
?1
X
T
y
X =
?
??
?
c
?
= (X
T
X )
?
?
1 x
n
?
?
? The best exponential fit to the data is then
v
?
( t) = exp(y
?
) = exp(c
?
0
+ c
?
1
t) = exp(c
?
0
)?exp(c
?
1
t)
Nonlinear System Model
Example: Exponential Fit
1.2
1
0.8
0.6
0.4
0.2
0
02 4681012
Power Series Approximations
? Often, in cases where such a simple
transformation is not available the data may
be fit by a power series
? Suppose the dependent variable
y( x)
can be
approximated to sufficient accuracy by a
finite power series in x, of degree m
m
y(x) = c
0
+ c
1
x + c
2
x
2
+ null + c x
m
Power Series Approximations (cont.)
? Also, if we have n measurements of the
dependent variable y, corresponding to n
values of the independent variable x, then
define the linear model as before so
y = Xc + e
? Where now
2 m
?
1 x
1
x
1
null x
1
?
?
c
0
?
?
1 x
2
x
2
2
null x
2
m
?
c =
?
null
?
X =
??
nullnullnull null
?
?
c
m
?
?
? ?
2 m
?
1 x
n
x null x
n
?
n
Power Series Approximations (cont.)
? And our previous result can now be applied once
again to get the best linear fit for c as
c
?
= (X
T
X )
?1
X
T
y
? So the best linear fit is
2 m
y
?
(x) = c
?
0
+ c
?
1
x+ c
?
2
x + null + c
?
x
m
? The solution is somewhat more difficult because
the required inverse is
(m+1) × (m+1)
, but for most
situations the problem is still tractable
Fourier Series Approximations
? Sometimes the model may be periodic in
nature and a truncated Fourier series can
approximate the function
? If y(x) is an odd periodic function of x, with
first harmonic wavelength 2L, then a
Fourier sine series approximation to y(x) is
y(x) ? c
1
sin(πx / L)+c
2
sin(2πx / L)+???+ c
m
sin(mπx / L)
Fourier Series Approx. (cont.)
From: Beckwith, T.G., Marangoni R.D.,and Lienhard V, J.H.,
Mechanical Measurements, Fifth Edition, Addison Wesley, Reading,
MA, 1993, p. 141
Fourier Series Approx. (cont.)
? Thus, if n measurements
y
i
are taken at
various values
x
i
of the independent variable,
then the X matrix can be defined as
?
sin( πx
1
/ L ) sin( 2πx
1
/ L ) sin( 3πx
1
/ L ) ?????? sin(mπx
1
/ L )
?
?
sin(πx
2
/ L ) sin( 2πx
2
/ L ) sin( 3πx
2
/ L )??????sin( mπx
2
/ L )
?
X =
?
? ? ? ?
?
?
?
? ? ? ?
?
?
sin(πx / L ) sin( 2πx / L ) sin( 3πx / L )??????sin( mπx / L )
?
?
n n n n
Fourier Series Approx. (cont.)
? And, as before, the array
c
?
is obtained from
c
?
= (X
T
X )
?1
X
T
y
? So the best linear fit for y(x) is
y
?
(x) ? c
?
1
sin(πx / L)+c
?
2
sin(2πx / L)+???+c
?
m
sin(mπx / L)
Hiawatha
Verse 13