Automatic Generation of High-Efficiency Graphics Systems

PI: Kayvon Fatahalian, Carnegie Mellon University (IIS-1253530)

This project seeks to develop new algorithms, programming abstractions, and compiler frameworks that enable the creation of extensible, productive graphics systems that automatically (or semi-automatically under developer supervision) optimize themselves to parallel machine details or specific scene content. Success stands to significantly increase the efficiency of interactive graphics applications, enabling synthesis of increasingly realistic virtual environments across future computing platforms ranging from energy-constrained mobile devices to collections of machines in the cloud.

As of June 2018, this project is now complete. Notable impacts of the project's research included the Halide autoscheduler (Mullapudi et al. SIGGRAPH 2016), which is now part of the mainline Halide distribution and in frequent use at Google, and the design of the Slang shading language (He et al. SIGGRAPH 2018), which is now the sole shader compilation system for NVIDIA's Falcor rendering framework.

Research Activities

The Slang Shading Language. Slang is an extension of HLSL that included first-class language constructs for authoring clear, extensible real time shading systems, also also retaining high levels of performance. In the past, to meet these two goals, engine architects have established design patterns for authoring shading systems, and developed engine-specific code synthesis tools, ranging from preprocessor hacking to domain-specific shading languages, to productively implement these patterns. The problem is that proprietary tools add significant complexity to modern engines, lack advanced language features, and create additional challenges for learning and adoption. In this project we argue that the advantages of engine-specific code generation tools can be achieved using the underlying GPU shading language directly, provided the shading language is extended with a small number of best-practice principles from modern, well-established programming languages. We identify that adding generics with interface constraints, associated types, and interface/structure extensions to existing C-like GPU shading languages enables real-time renderer developers to build shading systems that are extensible, maintainable, and execute efficiently on modern GPUs without the need for additional domain-specific tools. We've embodied these ideas in an extension of HLSL called Slang, and provide a reference design for a large, extensible shader library implemented using Slang's features. We rearchitect an open source renderer to use this library and Slang's compiler services, and demonstrate the resulting shading system is substantially simpler, easier to extend with new features, and achieves higher rendering performance than the original HLSL-based implementation.

Slang is now the shader compiler for NVIDIA's Falcor open source research rendering framework.

Please see the Slang Github site for code and examples.

The most thorough documentation of the design decisions underlying Slang can be found in Yong He's CMU Ph.D. dissertation:

Y. He, K. Fatahalian, T. Foley
Slang: language mechanisms for extensible real-time shading systems
ACM Transactions on Graphics (Proceedings of SIGGRAPH 2018)

Shader Components: modular shader development, while retaining efficient parameter binding. Modern game engines seek to balance the conflicting goals of high rendering performance and productive software development. To address the issue of parameter binding performance, while maintaining the modular shader code structure that is desirable in today's high-end game engines, we proposed shader components, a design pattern based on a first-class unit of modularity in a shader program. Shader components encapsulates a unit of shader logic and the parameters that must be bound when that logic is in use. We show that by building sophisticated shaders out of components, we can retain essential aspects of performance (static specialization of the shader logic in use and efficient update of parameters at component granularity) while maintaining the modular shader code structure.

Y. He, T. Foley, T. Hofstee, H. Long, K. Fatahalian
Shader Components: Modular and High Performance Shader Development
ACM Transactions on Graphics (Proceedings of SIGGRAPH 2017)

NVIDIA's Falcor rendering framework has since adopted the shader components design pattern.

Spire: A new real-time graphics shading language that enables rapid exploration of shader optimization choices. We are developing new graphics programming abstractions and compiler frameworks for graphics engines of the future -- engines that will deliver rich virtual worlds to platforms ranging from beefy GPUs powering VR headsets to energy-limited operation on mobile devices. Specifically, our language extends concepts from prior work in rate-based shader programming with new language features that expand the scope of shader execution beyond traditional GPU hardware pipelines, and enable a diverse set of shader optimizations to be described by a single mechanism: the placement of overloaded shader terms at various spatio-temporal computation rates provided by the pipeline. We demonstrate use of this language and compiler framework to author complex shaders for different rendering pipelines and rapidly explore shader optimization decisions that impact logic spanning CPU, GPU, and preprocessing computations. We further demonstrate the utility of the proposed system by developing a shader level-of-detail library and shader auto-tuning system on top of its abstractions, and demonstrate rapid re-optimization of shaders for different target hardware platforms.

Y. He, T. Foley, K. Fatahalian
A System for Rapid Exploration of Shader Optimization Choices
ACM Transactions on Graphics (Proceedings of SIGGRAPH 2016)

Code and examples on Github. However, our work on Spire has been subsumed by our more recent work on Slang. (See links above.)

Automatic Shader Level-of-Detail. we developed a fully automatic, end-to-end system for generating a level-of-detail policies for unmodified GLSL input shaders. The system operates on shaders used in both forward and deferred rendering pipelines, requires no additional semantic information beyond input shader source code, and in only seconds to minutes generates LOD policies (consisting of simplified shader, the desired LOD distance set, and transition generation) with performance and quality characteristics comparable to custom hand-authored solutions. Our design contributes new shader simplification transforms such as approximate common subexpression elimination and movement of GPU logic to parameter bind-time processing on the CPU, and it uses a greedy search algorithm that employs extensive caching and upfront collection of input shader statistics to rapidly identify simplified shaders with desirable performance-quality trade-offs.

Y. He, T. Foley, N. Tatarchuk, K. Fatahalian
A System for Rapid, Automatic Shader Level-of-Detail
ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2015)

Automatically Scheduling Halide Programs. In recent years, the Halide image processing language has proven to be an effective system for authoring high-performance image processing code, as evidenced by its use at companies like Google to author popular computational photography applications used in hundreds of millions of smart phones. However, although Halide enables programmers to work more quickly, to obtain high performance it still requires requires programmers to have expertise in modern code optimization techniques and hardware architectures. We have developed an algorithm for automatically generating high-performance implementations of Halide image processing programs. In seconds, the algorithm generates schedules for a wide set of image processing benchmarks that are competitive with (and often better than) schedules manually authored by expert Halide developers on both server and mobile platforms. (This project is also part of the NSF/VEC Visual Computing Database project, IIS-1539069.)

R. T. Mullapudi, J. Ragan-Kelley, A. Adams, D. Sharlet, K. Fatahalian
Automatically Scheduling Halide Image Processing Pipelines
ACM Transactions on Graphics (Proceedings of SIGGRAPH 2016)

The autoscheduler is now available to the public as part of the mainline Halide distribution. Please see the Halide website or source on Github. Tutorials 21 and 22 describe usage of the autoscheduler.

Data Specialization of a Staged Lambda Calculus. Staged computations, i.e., computations that can be structured into multiple phases, are common in graphics. For example, the graphics pipeline has a number of stages, each of which operating at a different rate (e.g., primitives, vertices, fragments, pixels.). Another example is a ray tracer, where the first stage of computation constructs an acceleration structure, and the second stage uses the acceleration structure to rapidly trace rays. We have developed a "splitting" algorithm (also called data specialization) that automatically splits a staged program (where computation in each of the stages may be interleaved), into a set of unstaged programs that run entirely within each pipeline stage. Unlike prior with in graphics that used staged shader programming languages (e.g., shading languages such as RTSL or Spark) our algorithm is able to perform this splitting transformation on a full lambda calculus. One outcome of this algorithm is that for certain class of recursive program inputs, the staged language compiler is to automatically synthesize a data-structure (like a bounding volume hierarchy tree in a ray tracer) that can be used to improve the asymptotic complexity of the original program.

N. Feltman, C. Angiuli, U. Acar, K. Fatahalian
Automatically Splitting a Two-Stage Stage Lambda Calculus
European Symposium on Programming (ESOP) 2016

Adaptive, Multi-Rate Shading. Due to complex shaders and high-resolution displays (particularly on mobile graphics platforms), fragment shading often dominates the cost of rendering in games. To reduce the cost of shading on GPUs, we developed new algorithms for robustly determining when parts of the shading function can be evaluated at coarser than per-pixel rates without significant loss of visual quality. The solution required modification of the GPU pipeline to support multiple rates of fragment shader execution as well as the design of a new shading language for productively authoring multi-rate shaders.

Y. He, Y. Gu, K. Fatahalian
Extending the Graphics Pipeline with Adaptive, Multi-Rate Shading
ACM Transactions on Graphics (Proceedings of SIGGRAPH 2014)

Self-Refining Games. Automatic system optimization need not be limited to optimizing the code of a graphics system, it also encompasses optimization and generation of the content used in the interactive system. In this project we asked the question, how can a system automatically determine the most important parts of an interactive experience, so that these parts can be presented to the user with the highest-quality graphics? We find that by recording statistics of users playing a game, we can build a model of user behavior, and then concentrate large-scale, cloud-based precomputation of graphics and physics around the states that users are most likely to encounter. The result is a self-refining game whose dynamics improve with play, ultimately providing realistically rendered, rich fluid dynamics in real time on a mobile device.

M. Stanton, B. Humberston, B. Kase, J. O'Brien, K. Fatahalian, A. Treuille
Self-Refining Games using Player Analytics
ACM Transactions on Graphics (Proceedings of SIGGRAPH 2014)

Education and Curriculum Development

15-418/618: Parallel Computer Architecture and Programming. Only a few years ago, parallel programming was considered an advanced skill mastered only by elite software developers. Today hundreds of thousands of programmers have written massively-parallel programs for GPUs. There is a reason for this. High performance graphics is an engaging way to introduce parallel computing topics. Second, it is engaging to introduce parallel computing using familiar applications running on multi-core CPUs, GPUs, mobile SoCs, and the cloud. We have updated the curriculum of CMU's course 15-418/618: Parallel Computer Architecture and Programming to embody these principles. All lecture materials, recorded videos, and programming assignments/exercises are online and open for the public to view on the course web site. NVIDIA, Intel, Qualcomm, Apple, and Amazon have all provided direct support for this course through gifts and by sending representatives to judge student final projects. In addition to CMU offerings, the course was also given by PI Fatahalian at Tsinghua University in Summer 2017.

15-769: Visual Computing Systems. To encourage collaboration between the computer systems and computer graphics communities at CMU, I have created a new course called Visual Computing Systems. This unique course covers fundamental algorithms and hardware architectures used in real-time graphics pipelines, camera image processing pipelines, and emerging systems for real-time vision. The course advocates for a holistic view of system design and promote the value of co-design of new algorithms and parallel architectures based on deep knowledge of visual computing characteristics. Students are encouraged to work together on cross-disciplinary projects. All lecture materials are online and available to the public.


This project is supported by the National Science Foundation's CAREER Award Program:

Proposal: IIS-1253530
Title: CAREER: Automatic Generation of High Efficiency Graphics Systems
PI: Kayvon Fatahalian, Carnegie Mellon University

Funding for research assistants working on the project was also provided by grants and gifts from NVIDIA, Intel, and Google.

Last updated June 2018.