SIFT SLAM Vision Details
MIT 16.412J Spring 2004
Vikash K. Mansinghka
1
Outline
? Lightning Summary
? Black Box Model of SIFT SLAM Vision System
? Challenges in Computer Vision
? What these challenges mean for visual SLAM
? How SIFT extracts candidate landmarks
? How landmarks are tracked in SIFT SLAM
? Alternative vision-based SLAM systems
? Open questions
2
Lightning Summary
? Motivation: SLAM without modifying the environment
? Landmark candidates are extracted by the SIFT process
? Candidates matched between cameras to get 3D positions
? Candidates pruned according to consistency w/ robot’s
expectations
? Survivors sent o? for statistical processing
3
Review of Robot Speci?cations
? Triclops 3-camera “stereo” vision system
? Odometry system which produces [p, q, ?]
? Center camera is “reference”
4
Black Box Model of Vision System
? For now, based on black-magic (SIFT). Produces landmarks.
? Assume landmarks globally indexed by i.
? Per frame inputs:
– [p, q, ?] - odometry input (x, z, bearing deltas.)
– List of (i, x
i
) - new landmark pos (from SLAM)
? Per frame output is a list of (i, x
landmark i where:
?
) for each visible , x , r , c
i i i
i
– x
?
i
is its measured 3D pos (w.r.t. camera pos)
– x
i
is its map 3D pos (w.r.t. initial robot pos), if it isn’t new
– (r
i
, c
i
) is its pixel coordinates in center camera
5
Challenges in Computer Vision
? Intuitively appealing ?= computationally realizable
? Stable feature extraction is hard; results rarely general
? Extracted features are sparse
? Matching requires exponential time
? Matches are often wrong
6
Implications for Visual SLAM
? Hard to reliably ?nd landmarks
? Really Hard to reliably ?nd landmarks
? Really Really Hard to reliably ?nd landmarks
? Data association is slow and unreliable
? False matches introduce substantial errors
? Accurate probabilistic models unavailable
7
Remarks on SIFT approach
? For visual SLAM, landmarks must be identi?able across:
– Large changes in distance
– Small changes in view direction
– (Bonus) Changes in illumination
? Solution:
– Produce “scale-invariant” image representation
– Extract points with associated scale information
– Use matcher empirically capable of handling small
displacements
8
The Scale-Invariant Feature Transform
? Described in Lowe, IJCV 2004 (preprint; use Google)
? Four stages:
– Scale-space extrema extraction
– Keypoint pruning and localization (not used in SLAM)
– Orientation assignment
– Keypoint descriptor (not used in SLAM)
9
Lightning Introduction to Scale Space
? Motivation:
– Objects can be recognized at many levels of detail
– Large distances correspond to low l.o.d.
– Di?erent kinds of information are available at each level
10
Lightning Introduction to Scale Space
? Idea: Extract information content from an image at each l.o.d.
? Detail reduction typically done by Gaussian blurring
? Long history in both machine and human vision
– Marr in late 1970s
– Henkel in 2000
? Analogous concepts used in speech processing
0
5
10
15
20
0
5
10
15
20
0
0.002
0.004
0.006
0.008
0.01
0.012
11
Scale Space in SIFT
? I(x, y) is input image. L(x, y, ?) is rep. at scale ?.
? G(x, y, ?) is 2D Gaussian with variance ?
2
? L(x, y, ?) = G(x, y, ?) ? I(x, y) (“only” choice; see Koenderink
1984)
? D(x, y, ?) = L(x, y, k?) ? L(x, y, ?)
? D approximates ?
2
?
2
G ? I (see Mikolajczyk 2002 for
signi?cance)
? D also edge-detector-like; newest SIFT “corrects” for this
? Details of discretization (e.g. resampling, k choice)
unimportant
12
Scale Space in SIFT
? Compute local extrema of D as above
? Each such (x, y, ?) is a feature
? (x, y) part “should” be scale and planar rotation invariant
13
SIFT Orientation Assignment
? For each feature (x, y, ?):
– Find ?xed-pixel-area patch in L(x, y, ?) around (x, y)
– Compute gradient histogram; call this b
i
– For b
i
within 80% of max, make feature (x, y, ?, b
i
)
? Enables matching by including illumination-invariant feature
content (Sinha 2000)
14
?
SIFT Stereopsis
? Apply SIFT to image from each camera.
? Match center feature (x, y, ?, ?) and right feature (x , y
?
, ?
?
, ?
?
)
if:
1. |y ? y
?
| ? 1
2. 0 < |x
?
? x| ? 20
3. |? ? ?
?
| ? 20 degrees
2
?
?
?
3
4.
3 ?
?
2
5. No other matches consistent with above exist
? Match similarly for left and top; discard all not matched twice
? Compute 3D positions (trig) as average from horiz. and vert.
15
Recapitulation (in G Minor)
? Procedure so far:
1. For each image:
(a) Produce scale-space representation
(b) Find extrema
(c) Compute gradient orientation histograms
2. Match features from center to right and center to top
3. Compute relative 3D positions for survivors
? This gives us potential features from a given frame
? How do we use them?
17
Landmark Tracking
? Predict where landmarks should appear (reliability, speed)
? Note: Robot moves in xz plane
? Given [p, q, ?] and old relative position [X, Y, Z], ?nd expected
position [X
?
, Y
?
, Z
?
] by:
X
?
= (X ? p)cos(?) ? (Z ? q)sin(?)
Y
?
= Y
Z
?
= (X ? p)sin(?) ? (Z ? q)cos(?)
? By pinhole camera model ((u
0
, v
0
) image center coords, I
interocular distance, f focal length):
?
= v
0
? f
Y
?
r
Z
?
?
= u
0
+ f
X
?
Z
?
I
d
?
= f
Z
?
?
?
= ?
Z
Z
?
19
c
Landmark Tracking
? V is camera ?eld of view angle (60 degrees)
? A landmark is expected to be in view if:
Z
?
> 0
|
V
tan
?1
(
|X
?
) <
2Z
?
?1
(
|Y
?
|
tan
Z
?
) <
V
2
? An expected landmark matches an observed landmark if:
– Obs. center within a 10x10 region around expected
– Obs. scale within 20% of expected
– Obs. orientation within 20 degrees of expected
– Obs. disparity within 20% of expected
20
Landmark Tracking
? A SIFT view is: (SIFT feature, relative 3D pos, absolute view
dir)
? Each landmark is: (3D position, list of views, misses)
? Algorithm:
For each frame, find expected landmarks w/ odometry
For each observed view v:
If v matches an expected landmark l:
Set l.misses = 0
Add v to view list for l
Else add l to DB
For each expected, unobserved landmark l:
If one view direction within 20 degrees of current:
l.misses++
If l.misses >= 20, delete l from DB
22
Other Examples of Vision-based SLAM
? Ceiling lamp based (Panzieri et al. 2003)
? Bearing-only visual SLAM (lots; unclear to me)
? Monocular visual SLAM, e.g. by optical ?ow (lots; unclear to
me)
? Shi and Tomasi feature based (Davison and Murray 1998)
23
Open Questions
? How to speed vision processing? (Move to hardware?)
? Should full SLAM (not decorrelated SLAM) be used?
? If full SLAM, how can numerics be sped up? (FMM?)
? Can thresholds be automatically tuned? (Maximum
information?)
? Will movement confuse SIFT SLAM? (Fix by optical ?ow?)
? What environments foil SIFT SLAM?
? How to get dense features? (Geometric model ?tting?)
? Can this handle large environments/long trajectories? (Fix by
qualitative navigation?)
? Why does David Lowe rock so hard?
24
Acknowledgements
? Technical content primarily from Lowe 2004 and Se, Lowe,
Little 2002.
*
? Various images liberated from their papers and from the
internet.
? Others produced and/or processed by MATLAB, Lowe’s SIFT
demo and ImageMagick.
? My sincerest apologies to Terry Pratchett.
* Se, S., D. Lowe and J. Little, ”Mobile Robot Localization and Mapping
with Uncertainty using Scale-Invariant Visual Landmarks”, The International
Journal of Robotics Research, Volume 21 Issue 08.
25