SIFT SLAM Vision Details MIT 16.412J Spring 2004 Vikash K. Mansinghka 1 Outline ? Lightning Summary ? Black Box Model of SIFT SLAM Vision System ? Challenges in Computer Vision ? What these challenges mean for visual SLAM ? How SIFT extracts candidate landmarks ? How landmarks are tracked in SIFT SLAM ? Alternative vision-based SLAM systems ? Open questions 2 Lightning Summary ? Motivation: SLAM without modifying the environment ? Landmark candidates are extracted by the SIFT process ? Candidates matched between cameras to get 3D positions ? Candidates pruned according to consistency w/ robot’s expectations ? Survivors sent o? for statistical processing 3 Review of Robot Speci?cations ? Triclops 3-camera “stereo” vision system ? Odometry system which produces [p, q, ?] ? Center camera is “reference” 4 Black Box Model of Vision System ? For now, based on black-magic (SIFT). Produces landmarks. ? Assume landmarks globally indexed by i. ? Per frame inputs: – [p, q, ?] - odometry input (x, z, bearing deltas.) – List of (i, x i ) - new landmark pos (from SLAM) ? Per frame output is a list of (i, x landmark i where: ? ) for each visible , x , r , c i i i i – x ? i is its measured 3D pos (w.r.t. camera pos) – x i is its map 3D pos (w.r.t. initial robot pos), if it isn’t new – (r i , c i ) is its pixel coordinates in center camera 5 Challenges in Computer Vision ? Intuitively appealing ?= computationally realizable ? Stable feature extraction is hard; results rarely general ? Extracted features are sparse ? Matching requires exponential time ? Matches are often wrong 6 Implications for Visual SLAM ? Hard to reliably ?nd landmarks ? Really Hard to reliably ?nd landmarks ? Really Really Hard to reliably ?nd landmarks ? Data association is slow and unreliable ? False matches introduce substantial errors ? Accurate probabilistic models unavailable 7 Remarks on SIFT approach ? For visual SLAM, landmarks must be identi?able across: – Large changes in distance – Small changes in view direction – (Bonus) Changes in illumination ? Solution: – Produce “scale-invariant” image representation – Extract points with associated scale information – Use matcher empirically capable of handling small displacements 8 The Scale-Invariant Feature Transform ? Described in Lowe, IJCV 2004 (preprint; use Google) ? Four stages: – Scale-space extrema extraction – Keypoint pruning and localization (not used in SLAM) – Orientation assignment – Keypoint descriptor (not used in SLAM) 9 Lightning Introduction to Scale Space ? Motivation: – Objects can be recognized at many levels of detail – Large distances correspond to low l.o.d. – Di?erent kinds of information are available at each level 10 Lightning Introduction to Scale Space ? Idea: Extract information content from an image at each l.o.d. ? Detail reduction typically done by Gaussian blurring ? Long history in both machine and human vision – Marr in late 1970s – Henkel in 2000 ? Analogous concepts used in speech processing 0 5 10 15 20 0 5 10 15 20 0 0.002 0.004 0.006 0.008 0.01 0.012 11 Scale Space in SIFT ? I(x, y) is input image. L(x, y, ?) is rep. at scale ?. ? G(x, y, ?) is 2D Gaussian with variance ? 2 ? L(x, y, ?) = G(x, y, ?) ? I(x, y) (“only” choice; see Koenderink 1984) ? D(x, y, ?) = L(x, y, k?) ? L(x, y, ?) ? D approximates ? 2 ? 2 G ? I (see Mikolajczyk 2002 for signi?cance) ? D also edge-detector-like; newest SIFT “corrects” for this ? Details of discretization (e.g. resampling, k choice) unimportant 12 Scale Space in SIFT ? Compute local extrema of D as above ? Each such (x, y, ?) is a feature ? (x, y) part “should” be scale and planar rotation invariant 13 SIFT Orientation Assignment ? For each feature (x, y, ?): – Find ?xed-pixel-area patch in L(x, y, ?) around (x, y) – Compute gradient histogram; call this b i – For b i within 80% of max, make feature (x, y, ?, b i ) ? Enables matching by including illumination-invariant feature content (Sinha 2000) 14 ? SIFT Stereopsis ? Apply SIFT to image from each camera. ? Match center feature (x, y, ?, ?) and right feature (x , y ? , ? ? , ? ? ) if: 1. |y ? y ? | ? 1 2. 0 < |x ? ? x| ? 20 3. |? ? ? ? | ? 20 degrees 2 ? ? ? 3 4. 3 ? ? 2 5. No other matches consistent with above exist ? Match similarly for left and top; discard all not matched twice ? Compute 3D positions (trig) as average from horiz. and vert. 15 Recapitulation (in G Minor) ? Procedure so far: 1. For each image: (a) Produce scale-space representation (b) Find extrema (c) Compute gradient orientation histograms 2. Match features from center to right and center to top 3. Compute relative 3D positions for survivors ? This gives us potential features from a given frame ? How do we use them? 17 Landmark Tracking ? Predict where landmarks should appear (reliability, speed) ? Note: Robot moves in xz plane ? Given [p, q, ?] and old relative position [X, Y, Z], ?nd expected position [X ? , Y ? , Z ? ] by: X ? = (X ? p)cos(?) ? (Z ? q)sin(?) Y ? = Y Z ? = (X ? p)sin(?) ? (Z ? q)cos(?) ? By pinhole camera model ((u 0 , v 0 ) image center coords, I interocular distance, f focal length): ? = v 0 ? f Y ? r Z ? ? = u 0 + f X ? Z ? I d ? = f Z ? ? ? = ? Z Z ? 19 c Landmark Tracking ? V is camera ?eld of view angle (60 degrees) ? A landmark is expected to be in view if: Z ? > 0 | V tan ?1 ( |X ? ) < 2Z ? ?1 ( |Y ? | tan Z ? ) < V 2 ? An expected landmark matches an observed landmark if: – Obs. center within a 10x10 region around expected – Obs. scale within 20% of expected – Obs. orientation within 20 degrees of expected – Obs. disparity within 20% of expected 20 Landmark Tracking ? A SIFT view is: (SIFT feature, relative 3D pos, absolute view dir) ? Each landmark is: (3D position, list of views, misses) ? Algorithm: For each frame, find expected landmarks w/ odometry For each observed view v: If v matches an expected landmark l: Set l.misses = 0 Add v to view list for l Else add l to DB For each expected, unobserved landmark l: If one view direction within 20 degrees of current: l.misses++ If l.misses >= 20, delete l from DB 22 Other Examples of Vision-based SLAM ? Ceiling lamp based (Panzieri et al. 2003) ? Bearing-only visual SLAM (lots; unclear to me) ? Monocular visual SLAM, e.g. by optical ?ow (lots; unclear to me) ? Shi and Tomasi feature based (Davison and Murray 1998) 23 Open Questions ? How to speed vision processing? (Move to hardware?) ? Should full SLAM (not decorrelated SLAM) be used? ? If full SLAM, how can numerics be sped up? (FMM?) ? Can thresholds be automatically tuned? (Maximum information?) ? Will movement confuse SIFT SLAM? (Fix by optical ?ow?) ? What environments foil SIFT SLAM? ? How to get dense features? (Geometric model ?tting?) ? Can this handle large environments/long trajectories? (Fix by qualitative navigation?) ? Why does David Lowe rock so hard? 24 Acknowledgements ? Technical content primarily from Lowe 2004 and Se, Lowe, Little 2002. * ? Various images liberated from their papers and from the internet. ? Others produced and/or processed by MATLAB, Lowe’s SIFT demo and ImageMagick. ? My sincerest apologies to Terry Pratchett. * Se, S., D. Lowe and J. Little, ”Mobile Robot Localization and Mapping with Uncertainty using Scale-Invariant Visual Landmarks”, The International Journal of Robotics Research, Volume 21 Issue 08. 25