Using Prediction to Build a Compact Visual Memex Memory for Rapid Analysis and Understanding of Egocentric Video Data
PI: Kayvon Fatahalian, Carnegie Mellon University (now at Stanford University)
Funding: NSF IIS-1422767
This project develops new data-driven techniques for processing
large-scale video streams that exploit the structure and redundancy in
streams (captured over days, months, and even years) to improve video
analysis accuracy or reduce processing costs. In initial efforts
focused on capture and processing of egocentric video streams, project
activities also include processing other forms of video streams, such
as stationary webcams, or sports broadcasts. While the focus of this
research is the design of core algorithms and systems, success stands
to enable the development of new classes of applications (in domains
such as navigation, personal assistance, health/behavior monitoring)
that use the extensive visual history of a camera to intelligently
interpret continuous visual data sources and immediately respond to
the observed input. A further output of this research is the
collection and organization of two video datasets: an egocentric video
database from the life of a single individual, and the collection of a
number of long-running video streams (each from a single camera).
Research Activities
We record, and analyze, and present to the community, KrishnaCam, a
large (7.6 million frames, 70 hours) egocentric video stream along
with GPS position, acceleration and body orientation data spanning
nine months of the life of a computer vision graduate student. We
explore and exploit the inherent redundancies in this rich visual data
stream to answer simple scene understanding questions such as: How
much novel visual information does the student see each day? Given a
single egocentric photograph of a scene, can we predict where the
student might walk next? We find that given our large video database,
simple, nearest-neighbor methods are surprisingly adept baselines for
these tasks, even in scenes and scenarios where the camera wearer has
never been before. For example, we demonstrate the ability to predict
the near-future trajectory of the student in broad set of outdoor
situations that includes following sidewalks, stopping to wait for a
bus, taking a daily path to work, and the lack of movement while
eating food.
The KrishnaCam video dataset is available here.
K. Singh, K. Fatahalian, A. Efros
WACV 2016
There is growing interest in improving the design of deep network architectures
to be both accurate and low cost. This paper explores semantic specialization as
a mechanism for improving the computational efficiency (accuracy-per-unit-cost)
of inference in the context of image classification. Specifically, we propose a
network architecture template called HydraNet, which enables state-of-the-art
architectures for image classification to be transformed into dynamic
architectures which exploit conditional execution for efficient inference.
HydraNets are wide networks containing distinct components specialized to
compute features for visually similar classes, but they retain efficiency by
dynamically selecting only a small number of components to evaluate for any one
input image. This design is made possible by a soft gating mechanism that
encourages component specialization during training and accurately performs
component selection during inference. We evaluate the HydraNet approach on both
the CIFAR-100 and ImageNet classification tasks. On CIFAR, applying the HydraNet
template to the ResNet and DenseNet family of models reduces inference cost by
2-4 times while retaining the accuracy of the baseline architectures. On
ImageNet, applying the HydraNet template improves accuracy up to 2.5% when
compared to an efficient baseline architecture with similar inference cost.
A portion of this project was also supported by IIS-1539069.
R. Mullapudi, W. R. Mark, N. Shazeer, K. Fatahalian
CVPR 2018
High-quality computer vision models typically address the problem of
understanding the general distribution of real-world images. However,
most cameras observe only a very small fraction of this
distribution. This offers the possibility of achieving more efficient
inference by specializing compact, low-cost models to the specific
distribution of frames observed by a single camera. In this paper, we
employ the technique of model distillation (supervising a low-cost
student model using the output of a high-cost teacher) to specialize
accurate, low-cost semantic segmentation models to a target video
stream. Rather than learn a specialized student model on offline data
from the video stream, we train the student in an online fashion on
the live video, intermittently running the teacher to provide a target
for learning. Online model distillation yields semantic segmentation
models that closely approximate their Mask R-CNN teacher with
7 to 17 times lower inference runtime cost (11 to 26 times in
FLOPs), even when the target video's distribution is non-stationary.
Our method requires no offline pretraining on the target video stream,
and achieves higher accuracy and lower cost than solutions based on
flow or video object segmentation. We also provide a new video
dataset for evaluating the efficiency of inference over long running
video streams.
The Long Video Streams Dataset introduced in this project is available here.
R. Mullapudi, S. Chen, K. Zhang, D. Ramanan, K. Fatahalian
ArXiv 1812.02699
Open Source Infrastructure for Large-Scale Video Analysis
Early on in our efforts to process large amounts of egocentric video
(e.g., the KrishnaCam project), we learned that we lacked system
infrastructure support for this task. Simply put, many grad students
did not have the system implementation skillset to do computer vision
research on video at scale. It is notable that these early
experiences motivated the design of the open
source Scanner project that became
the focus
of IIS-1539069.
Outreach and Dissemination
A Quintillion Live Pixels:
The Challenge of Continuously Interpreting, Organizing, and Generating the World's Visual Information, ISCA 2016, Arch2030 Keynote (slides, youtube)
Towards an Worldwide Exapixel Per Second: Efficient Visual Data Analysis at Scale, HPG 2017 Keynote
Support
This project is supported by the National Science Foundation:
Title: Using Prediction to Build a Compact Visual Memex Memory for Rapid Analysis and Understanding of Egocentric Video Data
Funding for research assistants working on the project is also provided by NVIDIA, Intel, and Google.
This material is based upon work supported by the National Science
Foundation under Grant No. IIS-1422767. Any opinions, findings, and
conclusions or recommendations expressed in this material are those of
the author(s) and do not necessarily reflect the views of the National
Science Foundation.
Last updated June 2019.