Using the Forest to See the Trees:
Context-based Object Recognition
Bill Freeman
Computer Science and Artificial Intelligence
Laboratory
MIT
A computer vision goal
?
many viewing conditions in unconstrained
settings.
? restricted cases:
? But the general problem is difficult and
unsolved.
Joint work with Antonio Torralba and Kevin Murphy
Recognize many different objects under
There has been progress on
– one object and one pose (frontal view faces)
– Isolated objects on uniform backgrounds.
?
?
How we hope to make progress on
this hard problem
Classify image patches/features at each location and scale
features
No car
Classifier
p( car | V
L
)
V
L
Local (bottom-up) approach to object
detection
Various technical improvements
Exploit scene context:
– “if this is a forest, these must be trees”.
Local
Problem 1:
Local features can be ambiguous
Solution 1: Context can
disambiguate local features
Effect of context on object detection
car
pedestrian
Identical local image features!
Even high-resolution images can
be locally ambiguous
Images by Antonio Torralba
Object in context
(Courtesy of Fredo Durand and William Freeman. Used with permission.)
Isolated object
Object in context
Problem 2: search space is HUGE
x
1,000,000 images/day
Plus, we want to do this for ~ 1000 objects
y
s
positive rate)
“Like finding needles in a haystack”
Need to search over x,y locations
and scales s
- Error prone (classifier must have very low false
- Slow (many patches to examine)
10,000 patches/object/image
Solution 2: context can provide a
prior on what to look for,
and where to look for it
People most likely here
Torralba, IJCV 2003
c
a
r
s
1.0
0.0
n
Talk outline
? Context-based vision
?
?
p
e
d
e
s
t
r
i
a
c
o
m
p
u
t
e
r
des
k
Computers/desks unlikely outdoors
Feature-based object detection
Graphical model to combine both sources
Talk outline
? Context-based vision
?
?
Context-based vision
?
?
? Combine with bottom-up object detection
? training set acquisition.
Feature-based object detection
Graphical model to combine both sources
Measure overall scene context or “gist”
Use that scene context for:
– Location identification
– Location categorization
– Top-down info for object recognition
Future focus:
Contextual machine-vision system
? Low-dimensional representation of overall scene:
– Gabor-filter outputs at multiple scales, orientations,
locations
– Dimensionality reduction via PCA
Feature vector for an image:
the “gist” of the scene
– Compute 12 x 30 = 360 dim. feature vector
–
over 4x4 regions = 384 dim. feature vector
– Reduce to ~ 80 dimensions using PCA
The “Visual Gist” System
Or use steerable filter bank, 6 orientations, 4 scales, averaged
Oliva & Torralba, IJCV 2001
Low-dimensional representation
for image context
Images
80-dimensional
representation
Hardware set-up
?
? Computer: Sony laptop
?
Wearable system
– Gives immediate feedback to the user
– Must handle general camera view
– Capable of wireless link for audience display
Designed for utility, not fashion…
Our mobile rig, version 1
Kevin Murphy
Our mobile rig, version 2.
Antonio Torralba
(Courtesy of Kevin Murphy. Used with permission.)
(Courtsey of Antonio Torralba. Used with permission.)
Experiments
–
th
floor of 200 Tech. Square
–
? Test:
–
th
floor (seen in training)
–
–
?
–
–
Specific location
Location category
Indoor/outdoor
Ground truth
System estimate
Location
recognition
for mobile
vision system
?Train:
Rooms and halls on 9
Outdoors
Interior of 200 Tech. Square, 9
Interior of 400 Tech. Square (unseen)
Outdoors (unseen places)
Goals:
Identify previously seen locations
Identify category of previously unseen locations
Classifying isolated scenes
can be hard
Corridors
Offices
Correct recognition
misses
Correct recognition
misses
Scene recognition over time
P
C
t-1
O
k
k
1
…
P
k
n
P
C
t
…
O
s
s
1
…
P
s
n
V
s
V
k
1
V
k
n
V
G
V
s
1
n
P(C
t
|C
t-1
) is a transition matrix, P(v
G
|C) is a mixture of Gaussians
Cf. topological localization in robotics
Torralba, Murphy, Freeman, Rubin, ICCV 2003
Benefit of using temporal integration
G
Place recognition demo
p( q
t
| v
t
)
Instantaneous detection
P( q
t
| v
1:t
)
G
Using HMM over time
Categorization of new places
frame
Specific location
Location category
Indoor/outdoor
Top-down information for object
detection
Talk outline
? Context-based vision
?
?
Bottom-up object recognition
?
?
(each view of an object)
Feature-based object detection
Graphical model to combine both sources
Use labelled training set
Use local features to categorize each object
Training data
?Hand-annotated 1200 frames of video from a wearable webcam
?Trained detectors for 9 types of objects: bookshelf, desk,
screen (frontal) , steps, building facade, etc.
?100-200 positive patches, > 10,000 negative patches
Feature vector for a patch: step 1
derivatives
Laplacian
Corner
Long edges
convolve
bank of 12 filters
Gaussian
Feature vector for a patch: step 2
exponentiate
γ = 2 (variance) or
4 (4th moment)
Kurtosis
Useful for texture analysis
Feature vector for a patch: step 3
dictionary of 30 spatial masks
.*
mask
characterizes shape of filter response
bank of 12 filters
Feature vector for a patch: step 4
dictionary of 30 spatial masks
.*
57.3
Average response
γ
k
= 2 (variance) or
4 (4th moment)
Summary: Features
image
12 x 30 x 2 = 720 features. Special cases include:
-g
k
= delta function, w
k
-f
i
(γ)=4/ f
i
(γ=2) gives kurtosis for texture analysis
-w
k
mask to capture spatial arrangement of parts
dictionary of 12 filters
dictionary of 30 masks
bank of 12 filters
k’th feature of i’th patch
i’th patch
= Haar wavelets – Viola & Jones, Poggio et al
Rectangular masks support integral image trick for fast computation
Classifier: boosted features
where
–
–h
t
(f) = output of weak classifier at round t
?α
t
= weight assigned by boosting
?
h
t
(f) picks best feature and threshold:
?
?
?
Viola & Jones, IJCV 2001
Boosting demo
?Output is
f = feature vector for patch
Weak learners are single features:
~500 rounds of boosting
~200 positive patches, ~ 10,000 negative patches
No cascade (yet)
Examples of learned features
Example detections
deskscreen
Example detections
desk
screen
bookshelf
Bottom-up detection: ROC curves
Talk outline
? Context-based vision
?
?
Probabilistic models: graphical
models
?
? Build up complex models from simple
components describing conditional
independence assumptions.
?
combine evidence from different parts of
the model.
A B C
Feature-based object detection
Graphical model to combine both sources
Tinker toys for probabilistic models
Standard inference algorithms let you
Combining global scene information
with local detectors using a
probabilistic graphical model
Freeman, NIPS 2003
V
k
n
P
k
1
O
k
P
k
n
V
k
1
…
C
V
s
n
P
s
1
O
s
P
s
n
V
s
1
…
V
G
We use a conditional
(discriminative) model since
the local and global features
are not independent
Local image features
Scene category
Object class
Particular objects
Global image
features
Scene categorization using the gist
V
k
n
P
k
1
O
k
P
k
n
V
k
1
…
C
V
s
n
P
s
1
O
s
P
s
n
V
s
1
…
V
G
Scene category (street, office, corridor,…)
Global gist (output of PCA)
P(C|v
G
Murphy, Torralba &
) modeled using multi-class boosting or by a mixture of Gaussians
Local patches for object detection
and localization
6000 nodes (outputs of
screen detector)
9000 nodes (outputs of
keyboard detector)
V
k
n
P
k
1
O
k
P
k
n
V
k
1
…
C
V
s
n
P
s
1
O
s
P
s
n
V
s
1
…
V
G
P
s
i
screen in patch i
Location invariant object detection
V
k
n
P
k
1
O
k
P
k
n
V
k
1
…
C
V
s
n
P
s
1
O
s
P
s
n
V
s
1
…
V
G
O
s
more screens
Modeled as a (non-noisy) OR function
=1 iff there is a
=1 iff there is one or
anywhere in the image
O nodes useful for image retrieval, scene categorization and object priming
Probability of object given scene
V
k
n
P
k
1
O
k
P
k
n
V
k
1
…
C
V
s
n
P
s
1
O
s
P
s
n
V
s
1
…
V
G
Inference in the model
V
k
n
P
k
1
O
k
P
k
n
V
k
1
…
C
V
s
n
P
s
1
O
s
P
s
n
V
s
1
…
V
G
Bottom-up, from leaves to root
estimated from co-occurrence counts
Inference in the model
V
k
n
P
k
1
O
k
P
k
n
V
k
1
…
C
V
s
n
P
s
1
O
s
P
s
n
V
s
1
…
V
G
1. Run detectors and classify patches
V
k
n
P
k
1
O
k
P
k
n
V
k
1
…
C
V
s
n
P
s
1
O
s
P
s
n
V
s
1
…
V
G
Top-down, from root to leaves
“Boy, exhaustive search is tiring!”
2. Infer object presence given detectors
V
k
n
P
k
1
O
k
P
k
n
V
k
1
…
C
V
s
n
P
s
1
O
s
P
s
n
V
s
1
…
V
G
3. Classify scene using parts (objects)
V
k
n
P
k
1
O
k
P
k
n
V
k
1
…
C
V
s
n
P
s
1
O
s
P
s
n
V
s
1
…
V
G
“I think I saw a screen and a car, so I may be in an office or a street”
“Some screen detectors fired, so there’s probably a screen somewhere”
4. Classify scene holistically (gist)
V
k
n
P
k
1
O
k
P
k
n
V
k
1
…
C
V
s
n
P
s
1
O
s
P
s
n
V
s
1
…
V
G
“This looks like a street to me”
5. Update object estimates using scene
V
k
n
P
k
1
O
k
P
k
n
V
k
1
…
C
V
s
n
P
s
1
O
s
P
s
n
V
s
1
…
V
G
“Since I’ve decided I’m in a street, there is no screen in the image”
6. Update patch estimates using objects
“Since there’s no screen in the image, this patch is a false positive”
C
O
k
O
s
…
P
s
V
P
k
1
…
P
k
n
P
s
1 n
s
V
k
1
V
k
n
V
G
V
s
1
n
Effect of context on object detection
bookshelf
building
detector
G
detector + v
t
detector + v
G
1:t
(hmm)
detector + scene oracle
Example of object priming using gist
thresholds to get 80% hit rate)
Pruning false positives using gist
P(C|v
G
)
P(O|v
O
) P(O|v
G
) P(O|v
o
, v
G
)
Pruned detector outputs
For each type of object, we plot the single most probable detection
if it is above a threshold (set to give 80% detection rate)
If we know we are in a street, we can prune false positives such as chair
and coffee-machine (which are hard to detect, and hence must have low
Raw detector outputs
Pruning false positives using gist
Pruned detector outputs
P(C|v
G
)
P(O|v
O
) P(O|v
G
) P(O|v
o
, v
G
)
Top-down and bottom-up object detection
Video input
Likely location for a car,
given current context
Detected car
Raw detector outputs
Best training set wins
? Character recognition
?
Computer vision training set options
?
Sowerby/BAE database
expensive and slow.
?
?
Speech recognition
Real world data, hand labeled
– Example:
– In general:
Real world, partially labeled
Synthetic world, automatically labeled.
– Will training there transfer to the real world?
Research goals
? develop efficient system to recognize
1000 different objects, generalizing current
feature detection cascades.
?
?
end
Scale up:
Train exhaustively.
Apply in wearable or other real-world systems
–Lifelog
– VACE
?
Future directions
?
detection.
?
just a single HMM for the global scene
context.
Overview of talk
–
–
?
?
–
–
?
?
detectors using a probabilistic graphical model
?
?
?
Improve local-feature-based object
– Training set
– Efficient use of local feature information.
Include temporal information, more than
Why scene context?
Disambiguate local features
Reduce search space
Data set
Object detection
Features
Classifier
Scene categorization and object priming
Combining global scene information with local
Scene categorization over time
Location/scale priors using the scene and other objects
Summary/ future work
Classifier: based on boosting
? ∑
t
α
t
h
t
(f), where
–h
t
(f) = output of weak classifier at round t
– α
t
= weight assigned by boosting
?h
t
(f) picks best feature f
k
and corresponding
threshold to minimize classification error on
validation set
? 100-500 rounds of boosting
Converting boosting output to a
probability distribution
P(P
i
k
=1|b) = σ(λ
T
[1; b])
sigmoid
Offset/bias term
Weighted output is h =
– f = feature vector for patch
weights
Combining top-down with bottom-up:
graphical model showing assumed
statistical relationships between variables
Scene category
Visual “gist”
observations
Object class
Particular objects
Local image features
kitchen, office, lab, conference
Pruning false positives using gist
coffee machine drops below threshold.
room, open area, corridor,
elevator and street.
For each type of object, we plot the single most probable detection
if it is above a threshold (set to give 80% detection rate)
Using the gist, we figured out we’re in a street, so probability of chair and
Specific location
Location category
Indoor/outdoor
Ground truth
System estimate
Place and scene recognition using gist
Building 400 Outdoor AI-lab
Frame number
Place recognition demo
Corridor recognition
Targets=400, Distractors=2806
400
350
300
250
200
150
100
Correct recognition
misses
Correct recognition
misses
50
0
0 100 200 300 400
boosting
gaussian
Number false alarms
Office recognition
Targets=400, Distractors=2940
400
350 boosting
300
250
gaussian
200
150
100
50
Number false alarms
0
0 200 400 600
Scene categorization using the gist
Estimate P(C|v
G
) using multi-class boosting or
mixture of Gaussians for 7 pre-chosen categories.
office corridorstreet
Object priming using scene category
office corridorstreet
c
m
/
c
d
e
c
/
c
k
e
p
b
c
m
/
c
k
Compute for 9 object classes O
p
e
d
e
s
t
r
i
a
n
b
o
o
k
s
h
e
l
f
a
r
c
h
a
i
r
c
o
f
f
e
e
e
s
k
k
e
y
b
o
a
r
d
m
o
u
s
s
c
r
e
e
n
p
e
d
e
s
t
r
i
a
n
b
o
o
k
s
h
e
l
f
a
r
c
h
a
i
r
c
o
f
f
e
e
m
d
e
s
k
e
y
b
o
a
r
d
m
o
u
s
s
c
r
e
e
n
e
d
e
s
t
r
i
a
n
o
o
k
s
h
e
l
f
c
a
r
h
a
i
r
c
o
f
f
e
e d
e
s
k
e
y
b
o
a
r
d
m
o
u
s
e
s
c
r
e
e
n
where P(O|C) is estimated by counting co-occurrences in labeled images
“Cars are likely in streets, but not in offices or corridors”