Using the Forest to See the Trees: Context-based Object Recognition Bill Freeman Computer Science and Artificial Intelligence Laboratory MIT A computer vision goal ? many viewing conditions in unconstrained settings. ? restricted cases: ? But the general problem is difficult and unsolved. Joint work with Antonio Torralba and Kevin Murphy Recognize many different objects under There has been progress on – one object and one pose (frontal view faces) – Isolated objects on uniform backgrounds. ? ? How we hope to make progress on this hard problem Classify image patches/features at each location and scale features No car Classifier p( car | V L ) V L Local (bottom-up) approach to object detection Various technical improvements Exploit scene context: – “if this is a forest, these must be trees”. Local Problem 1: Local features can be ambiguous Solution 1: Context can disambiguate local features Effect of context on object detection car pedestrian Identical local image features! Even high-resolution images can be locally ambiguous Images by Antonio Torralba Object in context (Courtesy of Fredo Durand and William Freeman. Used with permission.) Isolated object Object in context Problem 2: search space is HUGE x 1,000,000 images/day Plus, we want to do this for ~ 1000 objects y s positive rate) “Like finding needles in a haystack” Need to search over x,y locations and scales s - Error prone (classifier must have very low false - Slow (many patches to examine) 10,000 patches/object/image Solution 2: context can provide a prior on what to look for, and where to look for it People most likely here Torralba, IJCV 2003 c a r s 1.0 0.0 n Talk outline ? Context-based vision ? ? p e d e s t r i a c o m p u t e r des k Computers/desks unlikely outdoors Feature-based object detection Graphical model to combine both sources Talk outline ? Context-based vision ? ? Context-based vision ? ? ? Combine with bottom-up object detection ? training set acquisition. Feature-based object detection Graphical model to combine both sources Measure overall scene context or “gist” Use that scene context for: – Location identification – Location categorization – Top-down info for object recognition Future focus: Contextual machine-vision system ? Low-dimensional representation of overall scene: – Gabor-filter outputs at multiple scales, orientations, locations – Dimensionality reduction via PCA Feature vector for an image: the “gist” of the scene – Compute 12 x 30 = 360 dim. feature vector – over 4x4 regions = 384 dim. feature vector – Reduce to ~ 80 dimensions using PCA The “Visual Gist” System Or use steerable filter bank, 6 orientations, 4 scales, averaged Oliva & Torralba, IJCV 2001 Low-dimensional representation for image context Images 80-dimensional representation Hardware set-up ? ? Computer: Sony laptop ? Wearable system – Gives immediate feedback to the user – Must handle general camera view – Capable of wireless link for audience display Designed for utility, not fashion… Our mobile rig, version 1 Kevin Murphy Our mobile rig, version 2. Antonio Torralba (Courtesy of Kevin Murphy. Used with permission.) (Courtsey of Antonio Torralba. Used with permission.) Experiments – th floor of 200 Tech. Square – ? Test: – th floor (seen in training) – – ? – – Specific location Location category Indoor/outdoor Ground truth System estimate Location recognition for mobile vision system ?Train: Rooms and halls on 9 Outdoors Interior of 200 Tech. Square, 9 Interior of 400 Tech. Square (unseen) Outdoors (unseen places) Goals: Identify previously seen locations Identify category of previously unseen locations Classifying isolated scenes can be hard Corridors Offices Correct recognition misses Correct recognition misses Scene recognition over time P C t-1 O k k 1 … P k n P C t … O s s 1 … P s n V s V k 1 V k n V G V s 1 n P(C t |C t-1 ) is a transition matrix, P(v G |C) is a mixture of Gaussians Cf. topological localization in robotics Torralba, Murphy, Freeman, Rubin, ICCV 2003 Benefit of using temporal integration G Place recognition demo p( q t | v t ) Instantaneous detection P( q t | v 1:t ) G Using HMM over time Categorization of new places frame Specific location Location category Indoor/outdoor Top-down information for object detection Talk outline ? Context-based vision ? ? Bottom-up object recognition ? ? (each view of an object) Feature-based object detection Graphical model to combine both sources Use labelled training set Use local features to categorize each object Training data ?Hand-annotated 1200 frames of video from a wearable webcam ?Trained detectors for 9 types of objects: bookshelf, desk, screen (frontal) , steps, building facade, etc. ?100-200 positive patches, > 10,000 negative patches Feature vector for a patch: step 1 derivatives Laplacian Corner Long edges convolve bank of 12 filters Gaussian Feature vector for a patch: step 2 exponentiate γ = 2 (variance) or 4 (4th moment) Kurtosis Useful for texture analysis Feature vector for a patch: step 3 dictionary of 30 spatial masks .* mask characterizes shape of filter response bank of 12 filters Feature vector for a patch: step 4 dictionary of 30 spatial masks .* 57.3 Average response γ k = 2 (variance) or 4 (4th moment) Summary: Features image 12 x 30 x 2 = 720 features. Special cases include: -g k = delta function, w k -f i (γ)=4/ f i (γ=2) gives kurtosis for texture analysis -w k mask to capture spatial arrangement of parts dictionary of 12 filters dictionary of 30 masks bank of 12 filters k’th feature of i’th patch i’th patch = Haar wavelets – Viola & Jones, Poggio et al Rectangular masks support integral image trick for fast computation Classifier: boosted features where – –h t (f) = output of weak classifier at round t ?α t = weight assigned by boosting ? h t (f) picks best feature and threshold: ? ? ? Viola & Jones, IJCV 2001 Boosting demo ?Output is f = feature vector for patch Weak learners are single features: ~500 rounds of boosting ~200 positive patches, ~ 10,000 negative patches No cascade (yet) Examples of learned features Example detections deskscreen Example detections desk screen bookshelf Bottom-up detection: ROC curves Talk outline ? Context-based vision ? ? Probabilistic models: graphical models ? ? Build up complex models from simple components describing conditional independence assumptions. ? combine evidence from different parts of the model. A B C Feature-based object detection Graphical model to combine both sources Tinker toys for probabilistic models Standard inference algorithms let you Combining global scene information with local detectors using a probabilistic graphical model Freeman, NIPS 2003 V k n P k 1 O k P k n V k 1 … C V s n P s 1 O s P s n V s 1 … V G We use a conditional (discriminative) model since the local and global features are not independent Local image features Scene category Object class Particular objects Global image features Scene categorization using the gist V k n P k 1 O k P k n V k 1 … C V s n P s 1 O s P s n V s 1 … V G Scene category (street, office, corridor,…) Global gist (output of PCA) P(C|v G Murphy, Torralba & ) modeled using multi-class boosting or by a mixture of Gaussians Local patches for object detection and localization 6000 nodes (outputs of screen detector) 9000 nodes (outputs of keyboard detector) V k n P k 1 O k P k n V k 1 … C V s n P s 1 O s P s n V s 1 … V G P s i screen in patch i Location invariant object detection V k n P k 1 O k P k n V k 1 … C V s n P s 1 O s P s n V s 1 … V G O s more screens Modeled as a (non-noisy) OR function =1 iff there is a =1 iff there is one or anywhere in the image O nodes useful for image retrieval, scene categorization and object priming Probability of object given scene V k n P k 1 O k P k n V k 1 … C V s n P s 1 O s P s n V s 1 … V G Inference in the model V k n P k 1 O k P k n V k 1 … C V s n P s 1 O s P s n V s 1 … V G Bottom-up, from leaves to root estimated from co-occurrence counts Inference in the model V k n P k 1 O k P k n V k 1 … C V s n P s 1 O s P s n V s 1 … V G 1. Run detectors and classify patches V k n P k 1 O k P k n V k 1 … C V s n P s 1 O s P s n V s 1 … V G Top-down, from root to leaves “Boy, exhaustive search is tiring!” 2. Infer object presence given detectors V k n P k 1 O k P k n V k 1 … C V s n P s 1 O s P s n V s 1 … V G 3. Classify scene using parts (objects) V k n P k 1 O k P k n V k 1 … C V s n P s 1 O s P s n V s 1 … V G “I think I saw a screen and a car, so I may be in an office or a street” “Some screen detectors fired, so there’s probably a screen somewhere” 4. Classify scene holistically (gist) V k n P k 1 O k P k n V k 1 … C V s n P s 1 O s P s n V s 1 … V G “This looks like a street to me” 5. Update object estimates using scene V k n P k 1 O k P k n V k 1 … C V s n P s 1 O s P s n V s 1 … V G “Since I’ve decided I’m in a street, there is no screen in the image” 6. Update patch estimates using objects “Since there’s no screen in the image, this patch is a false positive” C O k O s … P s V P k 1 … P k n P s 1 n s V k 1 V k n V G V s 1 n Effect of context on object detection bookshelf building detector G detector + v t detector + v G 1:t (hmm) detector + scene oracle Example of object priming using gist thresholds to get 80% hit rate) Pruning false positives using gist P(C|v G ) P(O|v O ) P(O|v G ) P(O|v o , v G ) Pruned detector outputs For each type of object, we plot the single most probable detection if it is above a threshold (set to give 80% detection rate) If we know we are in a street, we can prune false positives such as chair and coffee-machine (which are hard to detect, and hence must have low Raw detector outputs Pruning false positives using gist Pruned detector outputs P(C|v G ) P(O|v O ) P(O|v G ) P(O|v o , v G ) Top-down and bottom-up object detection Video input Likely location for a car, given current context Detected car Raw detector outputs Best training set wins ? Character recognition ? Computer vision training set options ? Sowerby/BAE database expensive and slow. ? ? Speech recognition Real world data, hand labeled – Example: – In general: Real world, partially labeled Synthetic world, automatically labeled. – Will training there transfer to the real world? Research goals ? develop efficient system to recognize 1000 different objects, generalizing current feature detection cascades. ? ? end Scale up: Train exhaustively. Apply in wearable or other real-world systems –Lifelog – VACE ? Future directions ? detection. ? just a single HMM for the global scene context. Overview of talk – – ? ? – – ? ? detectors using a probabilistic graphical model ? ? ? Improve local-feature-based object – Training set – Efficient use of local feature information. Include temporal information, more than Why scene context? Disambiguate local features Reduce search space Data set Object detection Features Classifier Scene categorization and object priming Combining global scene information with local Scene categorization over time Location/scale priors using the scene and other objects Summary/ future work Classifier: based on boosting ? ∑ t α t h t (f), where –h t (f) = output of weak classifier at round t – α t = weight assigned by boosting ?h t (f) picks best feature f k and corresponding threshold to minimize classification error on validation set ? 100-500 rounds of boosting Converting boosting output to a probability distribution P(P i k =1|b) = σ(λ T [1; b]) sigmoid Offset/bias term Weighted output is h = – f = feature vector for patch weights Combining top-down with bottom-up: graphical model showing assumed statistical relationships between variables Scene category Visual “gist” observations Object class Particular objects Local image features kitchen, office, lab, conference Pruning false positives using gist coffee machine drops below threshold. room, open area, corridor, elevator and street. For each type of object, we plot the single most probable detection if it is above a threshold (set to give 80% detection rate) Using the gist, we figured out we’re in a street, so probability of chair and Specific location Location category Indoor/outdoor Ground truth System estimate Place and scene recognition using gist Building 400 Outdoor AI-lab Frame number Place recognition demo Corridor recognition Targets=400, Distractors=2806 400 350 300 250 200 150 100 Correct recognition misses Correct recognition misses 50 0 0 100 200 300 400 boosting gaussian Number false alarms Office recognition Targets=400, Distractors=2940 400 350 boosting 300 250 gaussian 200 150 100 50 Number false alarms 0 0 200 400 600 Scene categorization using the gist Estimate P(C|v G ) using multi-class boosting or mixture of Gaussians for 7 pre-chosen categories. office corridorstreet Object priming using scene category office corridorstreet c m / c d e c / c k e p b c m / c k Compute for 9 object classes O p e d e s t r i a n b o o k s h e l f a r c h a i r c o f f e e e s k k e y b o a r d m o u s s c r e e n p e d e s t r i a n b o o k s h e l f a r c h a i r c o f f e e m d e s k e y b o a r d m o u s s c r e e n e d e s t r i a n o o k s h e l f c a r h a i r c o f f e e d e s k e y b o a r d m o u s e s c r e e n where P(O|C) is estimated by counting co-occurrences in labeled images “Cars are likely in streets, but not in offices or corridors”