Using Prediction to Build a Compact Visual Memex Memory for Rapid Analysis and Understanding of Egocentric Video Data

PI: Kayvon Fatahalian, Carnegie Mellon University (IIS-1422767)

This project develops new data-driven techniques for processing large-scale video streams that exploit the structure and redundancy in streams captured over days, months, and even years, to significantly reduce the size of these datasets without losing the most useful visual information. Simultaneously, the research team is developing parallel programming frameworks that simplify expression and acceleration of these video analysis algorithms at scale. While the focus of this research is the design of core algorithms and systems, success stands to enable the development of new classes of applications (in domains such as navigation, personal assistance, health/behavior monitoring) that use the extensive visual history of a camera to intelligently interpret continuous visual data sources and immediately respond to the observed input. A further output of this research is the collection and organization of an egocentric video database from the life of a single individual.

The KrishnaCam data will be available here.

Research Activities

KrishnaCam: Using a Longitudinal, Single-Person, Egocentric Dataset for Scene Understanding Tasks.
We record, and analyze, and present to the community, KrishnaCam, a large (7.6 million frames, 70 hours) egocentric video stream along with GPS position, acceleration and body orientation data spanning nine months of the life of a computer vision graduate student. We explore and exploit the inherent redundancies in this rich visual data stream to answer simple scene understanding questions such as: How much novel visual information does the student see each day? Given a single egocentric photograph of a scene, can we predict where the student might walk next? We find that given our large video database, simple, nearest-neighbor methods are surprisingly adept baselines for these tasks, even in scenes and scenarios where the camera wearer has never been before. For example, we demonstrate the ability to predict the near-future trajectory of the student in broad set of outdoor situations that includes following sidewalks, stopping to wait for a bus, taking a daily path to work, and the lack of movement while eating food.

HydraNets: Specialized Dynamic Architectures for Efficient Inference.
There is growing interest in improving the design of deep network architectures to be both accurate and low cost. This paper explores semantic specialization as a mechanism for improving the computational efficiency (accuracy-per-unit-cost) of inference in the context of image classification. Specifically, we propose a network architecture template called HydraNet, which enables state-of-the-art architectures for image classification to be transformed into dynamic architectures which exploit conditional execution for efficient inference. HydraNets are wide networks containing distinct components specialized to compute features for visually similar classes, but they retain efficiency by dynamically selecting only a small number of components to evaluate for any one input image. This design is made possible by a soft gating mechanism that encourages component specialization during training and accurately performs component selection during inference. We evaluate the HydraNet approach on both the CIFAR-100 and ImageNet classification tasks. On CIFAR, applying the HydraNet template to the ResNet and DenseNet family of models reduces inference cost by 2-4 times while retaining the accuracy of the baseline architectures. On ImageNet, applying the HydraNet template improves accuracy up to 2.5% when compared to an efficient baseline architecture with similar inference cost. (This activity is also supported by PI Fatahalian's grant IIS-1539069.)

R. Mullapudi, W. R. Mark, N. Shazeer, K. Fatahalian
HydraNets: Specialized Dynamic Architectures for Efficient Inference
CVPR 2018 (to appear)

Scanner: Efficient Video Analysis at Scale.
A growing number of visual computing applications depend on the analysis of large video collections. The challenge is that scaling applications to operate on these datasets requires efficient systems for pixel data access and parallel processing across large numbers of machines. Few programmers have the capability to operate efficiently at these scales, limiting the field's ability to explore new applications that leverage big video data. In response, we have created \scanner, a system for productive and efficient video analysis at scale. \scanner\ organizes video collections as tables in a data store optimized for sampling frames from compressed video, and executes pixel processing computations, expressed as dataflow graphs, on these frames. \scanner\ schedules video analysis applications expressed using these abstractions onto heterogeneous throughput computing hardware, such as multi-core CPUs, GPUs, and media processing ASICs, for high-throughput pixel processing. We demonstrate the productivity of \scanner\ by authoring a variety of video processing applications including the synthesis of stereo VR video streams from multi-camera rigs, markerless 3D human pose reconstruction from video, and data-mining big video datasets such as hundreds of feature-length films or over 70,000 hours of TV news. These applications achieve near-expert performance on a single machine and scale efficiently to hundreds of machines, enabling formerly long-running big video data analysis tasks to be carried out in minutes to hours.

This work is jointly supported by NSF IIS-1253530.

A. Poms. W. Crichton, P. Hanrahan, K. Fatahalian
Scanner: Efficient Video Analysis at Scale
SIGGRAPH 2018, to appear

Outreach and Dissemination

A Quintillion Live Pixels: The Challenge of Continuously Interpreting, Organizing, and Generating the World's Visual Information, ISCA 2016, Arch2030 Keynote (youtube)

Towards an Worldwide Exapixel Per Second: Efficient Visual Data Analysis at Scale, HPG 2017 Keynote


This project is supported by the National Science Foundation:

Proposal: IIS-1422767
Title: Using Prediction to Build a Compact Visual Memex Memory for Rapid Analysis and Understanding of Egocentric Video Data
PI: Kayvon Fatahalian, Carnegie Mellon University

Funding for research assistants working on the project is also provided by NVIDIA, Intel, and Google.

This material is based upon work supported by the National Science Foundation under Grant No. IIS-1422767. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Last updated April 2018.