Using Prediction to Build a Compact Visual Memex Memory for Rapid Analysis and Understanding of Egocentric Video Data

PI: Kayvon Fatahalian, Carnegie Mellon University (now at Stanford University)

Funding: NSF IIS-1422767

This project develops new data-driven techniques for processing large-scale video streams that exploit the structure and redundancy in streams (captured over days, months, and even years) to improve video analysis accuracy or reduce processing costs. In initial efforts focused on capture and processing of egocentric video streams, project activities also include processing other forms of video streams, such as stationary webcams, or sports broadcasts. While the focus of this research is the design of core algorithms and systems, success stands to enable the development of new classes of applications (in domains such as navigation, personal assistance, health/behavior monitoring) that use the extensive visual history of a camera to intelligently interpret continuous visual data sources and immediately respond to the observed input. A further output of this research is the collection and organization of two video datasets: an egocentric video database from the life of a single individual, and the collection of a number of long-running video streams (each from a single camera).

Research Activities

KrishnaCam: Using a Longitudinal, Single-Person, Egocentric Dataset for Scene Understanding Tasks.
We record, and analyze, and present to the community, KrishnaCam, a large (7.6 million frames, 70 hours) egocentric video stream along with GPS position, acceleration and body orientation data spanning nine months of the life of a computer vision graduate student. We explore and exploit the inherent redundancies in this rich visual data stream to answer simple scene understanding questions such as: How much novel visual information does the student see each day? Given a single egocentric photograph of a scene, can we predict where the student might walk next? We find that given our large video database, simple, nearest-neighbor methods are surprisingly adept baselines for these tasks, even in scenes and scenarios where the camera wearer has never been before. For example, we demonstrate the ability to predict the near-future trajectory of the student in broad set of outdoor situations that includes following sidewalks, stopping to wait for a bus, taking a daily path to work, and the lack of movement while eating food.

The KrishnaCam video dataset is available here.


HydraNets: Specialized Dynamic Architectures for Efficient Inference.
There is growing interest in improving the design of deep network architectures to be both accurate and low cost. This paper explores semantic specialization as a mechanism for improving the computational efficiency (accuracy-per-unit-cost) of inference in the context of image classification. Specifically, we propose a network architecture template called HydraNet, which enables state-of-the-art architectures for image classification to be transformed into dynamic architectures which exploit conditional execution for efficient inference. HydraNets are wide networks containing distinct components specialized to compute features for visually similar classes, but they retain efficiency by dynamically selecting only a small number of components to evaluate for any one input image. This design is made possible by a soft gating mechanism that encourages component specialization during training and accurately performs component selection during inference. We evaluate the HydraNet approach on both the CIFAR-100 and ImageNet classification tasks. On CIFAR, applying the HydraNet template to the ResNet and DenseNet family of models reduces inference cost by 2-4 times while retaining the accuracy of the baseline architectures. On ImageNet, applying the HydraNet template improves accuracy up to 2.5% when compared to an efficient baseline architecture with similar inference cost.

This work is jointly supported by IIS-1539069.

R. Mullapudi, W. R. Mark, N. Shazeer, K. Fatahalian
HydraNets: Specialized Dynamic Architectures for Efficient Inference
CVPR 2018


Online Model Distillation for Efficient Video Inference.
High-quality computer vision models typically address the problem of understanding the general distribution of real-world images. However, most cameras observe only a very small fraction of this distribution. This offers the possibility of achieving more efficient inference by specializing compact, low-cost models to the specific distribution of frames observed by a single camera. In this paper, we employ the technique of model distillation (supervising a low-cost student model using the output of a high-cost teacher) to specialize accurate, low-cost semantic segmentation models to a target video stream. Rather than learn a specialized student model on offline data from the video stream, we train the student in an online fashion on the live video, intermittently running the teacher to provide a target for learning. Online model distillation yields semantic segmentation models that closely approximate their Mask R-CNN teacher with 7 to 17 times lower inference runtime cost (11 to 26 times in FLOPs), even when the target video's distribution is non-stationary. Our method requires no offline pretraining on the target video stream, and achieves higher accuracy and lower cost than solutions based on flow or video object segmentation. We also provide a new video dataset for evaluating the efficiency of inference over long running video streams.

The Long Video Streams Dataset introduced in this project is available here.

R. Mullapudi, S. Chen, K. Zhang, D. Ramanan, K. Fatahalian
Online Model Distillation for Efficient Video Inference
ArXiv 1812.02699


Scanner: Efficient Video Analysis at Scale.
Early on in our efforts to process large amounts of egocentric video (e.g., the KrishnaCam project), we learned that we lacked system infrastructure support for this task. Simply put, many grad students did not have the system implementation skillset to do computer vision research on video at scale. The Scanner project (which is jointly supported by (and in fact the focus of) IIS-1539069, began as the result of this observation.

A growing number of visual computing applications depend on the analysis of large video collections. The challenge is that scaling applications to operate on these datasets requires efficient systems for pixel data access and parallel processing across large numbers of machines. Few programmers have the capability to operate efficiently at these scales, limiting the field's ability to explore new applications that leverage big video data. In response, we have created Scanner, a system for productive and efficient video analysis at scale. Scanner organizes video collections as tables in a data store optimized for sampling frames from compressed video, and executes pixel processing computations, expressed as dataflow graphs, on these frames. Scanner schedules video analysis applications expressed using these abstractions onto heterogeneous throughput computing hardware, such as multi-core CPUs, GPUs, and media processing ASICs, for high-throughput pixel processing. We demonstrate the productivity of Scanner by authoring a variety of video processing applications including the synthesis of stereo VR video streams from multi-camera rigs, markerless 3D human pose reconstruction from video, and data-mining big video datasets such as hundreds of feature-length films or over 70,000 hours of TV news. These applications achieve near-expert performance on a single machine and scale efficiently to hundreds of machines, enabling formerly long-running big video data analysis tasks to be carried out in minutes to hours.

Please see the Scanner Github site for code, documentation, and examples.

This work is jointly supported by NSF IIS-1539069.

A. Poms. W. Crichton, P. Hanrahan, K. Fatahalian
Scanner: Efficient Video Analysis at Scale
ACM Transactions on Graphics (Proceedings of SIGGRAPH 2018)

Outreach and Dissemination

A Quintillion Live Pixels: The Challenge of Continuously Interpreting, Organizing, and Generating the World's Visual Information, ISCA 2016, Arch2030 Keynote (slides, youtube)

Towards an Worldwide Exapixel Per Second: Efficient Visual Data Analysis at Scale, HPG 2017 Keynote

Support

This project is supported by the National Science Foundation:

Proposal: IIS-1422767
Title: Using Prediction to Build a Compact Visual Memex Memory for Rapid Analysis and Understanding of Egocentric Video Data
PI: Kayvon Fatahalian, Carnegie Mellon University (now at Stanford University)

Funding for research assistants working on the project is also provided by NVIDIA, Intel, and Google.

This material is based upon work supported by the National Science Foundation under Grant No. IIS-1422767. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


Last updated March 2019.