The Visual Computing Database: A Platform for Visual Data Processing and Analysis at Internet Scale
Funding: NSF IIS-1539069, Google 2015 Faculty Fellowship, Intel Machine Learning ISRA
Today, it is clear that the next generation of visual computing applications will require efficient analysis and mining of large repositories of visual data (images, videos, RGBD). But scaling visual data analysis to operate on collections the size of all public photos and videos on Facebook, all security video cameras in a major city, or petabytes of images in an astronomy sky survey, presents supercomputing-scale storage and computation challenges. Very few programmers have the capability to operate efficiently at these scales, inhibiting the field's ability to explore advanced data-driven visual computing applications. To meet this challenge, we are developing a distributed computing platform -- combining ideas from high-performance image processing languages, data analytics, and database functionality -- that facilitates the development of applications that process, query, analyze, and data mine video collections at scale.
This ongoing work presents two significant questions:
- What is the programming system for large-scale visual data analytics? What are scalable primitives for expressing visual data analysis applications and describing visual concepts of interest?
- What are the rrequirements of an efficient distributed computing runtime system for executing visual analysis pipelines, and what is the ideal hardware platform for such computations?
Scanner: Efficient Video Processing at Scale. Scanner is a distributed computing platform for productive and efficient video analysis at scale. In brief, Scanner provides support for organizing large video collections as tables in a data store that is optimized for providing frame-level access (including sparse access) to compressed video. Scanner allows application developers to create video processing applications by defining dataflow graphs of image processing kernels that operate on streams of data extracted from the data store. Scanner efficiently schedules video analysis applications expressed using these abstractions onto heterogeneous throughput computing hardware, such as multi-core CPUs, GPUs, and media processing ASICs.
We have used Scanner to scale video processing applications to thousands of CPUs and hundreds of GPUs on the Google Cloud Platform, and demonstrated applications ranging from VR video processing, to 3D human pose reconstruction, to supporting data analytics task on large video databases (hundreds of feature-length movies or thousands of hours of TV News footage.)
Please see the Scanner Github site for code, documentation, and examples.
The Esper video analytics framework. Esper is a framework for exploratory analysis of large video collections. Esper takes as input set of videos and a database of metadata about the videos (e.g. bounding boxes, poses, tracks) and provides a web UI (shown below) and a programmatic interface (Jupyter notebook) for visualizing and analyzing this metadata. Esper was originally intended as a framework for data scientists and analyts seeking to extract insight and value from video collections, but computer vision researchers may also find Esper a useful tool for understanding and debugging the accuracy of their trained models.
Esper source code is available on Github.
HydraNets: Specialized Dynamic Architectures for Efficient Inference. In working with large visual data collections, it quickly became appearant that the ability to scale to larger datasets required faster algorithms for inference. HydraNets were one effort to reduce the cost of inference via improved DNN architecture design. (This activity is also supported by PI Fatahalian's grant IIS-1422767).
HydraNets explore semantic specialization as a mechanism for improving the computational efficiency (accuracy-per-unit-cost) of inference in the context of image classification. HydraNets are a DNN design pattern that enables state-of-the-art architectures for image classification to be transformed into dynamic architectures which exploit conditional execution for efficient inference. HydraNets are wide networks containing distinct components specialized to compute features for visually similar classes, but they retain efficiency by dynamically selecting only a small number of components to evaluate for any one input image. This design is made possible by a soft gating mechanism that encourages component specialization during training and accurately performs component selection during inference. We evaluate the HydraNet approach on both the CIFAR-100 and ImageNet classification tasks. On CIFAR, applying the HydraNet template to the ResNet and DenseNet family of models reduces inference cost by 2-4x while retaining the accuracy of the baseline architectures. On ImageNet, applying the HydraNet template improves accuracy up to 2.5% when compared to an efficient baseline architecture with similar inference cost.
Automatically Scheduling Halide Programs. In recent years, the Halide image processing language has proven to be an effective system for authoring high-performance image processing code, as evidenced by its use at companies like Google to author popular computational photography applications used in hundreds of millions of smart phones. However, although Halide enables programmers to work more quickly, to obtain high performance it still requires requires programmers to have expertise in modern code optimization techniques and hardware architectures. We have developed an algorithm for automatically generating high-performance implementations of Halide image processing programs. In seconds, the algorithm generates schedules for a wide set of image processing benchmarks that are competitive with (and often better than) schedules manually authored by expert Halide developers on both server and mobile platforms. (This activity is also supported by PI Fatahalian's CAREER grant IIS-1253530.)
The autoscheduler is now available to the public as part of the mainline Halide distribution. Please see the Halide website or source on Github. Tutorials 21 and 22 describe usage of the autoscheduler.
Lantern: A Query Language for Visual Concept Retrieval. Lantern addresses a rapidly growing need to efficiently explore and mine massive visual datasets for information, tasks like locating people in a video or determining similarity between images. A number of recent top-performing computer vision tools for these tasks rely on machine learning methods, specifically end-to-end training and evaluation which can take days or weeks to learn effective concept detectors. The language provides an abstraction, the spatial concept hierarchy, for combining existing vision algorithms with coarse grained rules for quickly developing new queries and interactively exploring visual data. Lantern compiles queries into operations on distributed collections to enable rapid execution on large clusters. We demonstrate demonstrate the use of Lantern by building an interactive system for exploration of visual datasets, an object detector error analysis platform, and a tool to blur faces in videos. We show Lantern queries running across a cluster running on the Google Cloud Platform.
This project is supported by the National Science Foundation, :
Funding for research assistants working on the project is also provided by Google through a 2016 Faculty Fellowship, and Intel through the Machine Learning ISRA program.
Last updated November 2018.