GPUs: the next must-have

21st October 2013

Nat Bowers

0 0

Like the CPU or video processing unit found in today’s consumer and embedded systems, the GPU has turned from a nice-to-have feature in previous generations to a must-have necessity if you want a competitive solution for your target application. Benson Tao, Product Marketing Manager at Vivante Corporation, explore further in this ES Design magazine article.

Architected as a massively parallel SIMD (single instruction, multiple data) processing engine built for parallel workloads, a graphics processing unit or GPU is a specialised visual processor that is designed to accelerate graphics applications by rendering and creating images; 3D games and dynamic graphical user interfaces (GUIs) are examples of GPU workloads.

Graphics is the most well-known use of parallelism, due to the need to process billions of pixels or vertices concurrently, per frame. Deep down, GPUs process independent vertices, primitives and fragments (pixels) in great numbers using a shader unit (SIMD processing unit). A shader is a basic computing element that executes graphics programs on a per-vertex, per-pixel or other primitive basis. Vertex programs manipulate object properties to enable fine grained control over positions, movement, vertex lighting and colour. Pixel or fragment programs compute pixel colour, shadows, object texture and lighting, and can be written to enhance images by adding real-life effects like blurring, edge enhancements or filtering. Other types of shader programs include geometry and tessellation shaders. GPUs can scale from one shader to thousands of shaders to give you an idea of the amount of parallelism you can achieve with the architecture. Leading edge, high performance shaders can run at over 1GHz, executing billions of instructions per cycle. Figure 1 shows a summary of the main architectural differences between a CPU and GPU, heterogeneous systems will feature both, partitioning the workload to take advantage of each core’s strengths.

Figure 1: A summary of the main architectural differences between a CPU and GPU

Figure 1: A summary of the main architectural differences between a CPU and GPU

Xhead: GPU pipeline

The latest GPUs use unified shaders at the core which enables the best use of hardware resources across multiple types of shader programs to balance workloads. You can assign each unified shader core towards vertex or pixel processing, minimising hardware resource bottlenecks in cases where you have a vertex heavy or pixel heavy image. In non-unified shader architectures, there are fixed vertex (VS) and pixel (PS) shaders, causing stalls when hardware resources are locked due to VS or PS bound cases. From a high level, the mobile GPU pipeline consists of multiple key blocks.

Host Interface and MMU:

Communicates with the CPU through the ACE-Lite/AXI/AHB interfaces,
Processes commands from the CPU and accesses geometry data from frame buffer or system memory,
Outputs a stream of vertices that is sent to the next stage (vertex shader),
Manages all GPU transactions including instruction/data dispatch to shaders, allocates/deallocates resources, provides security for secure transactions and data compression.

Programmable Unified Shaders:

Vertex shaders (VS) include vertex transformation, lighting and interpolation and can be simple (linear) to complex (morphing effects),
Pixel/Fragment shaders (PS) compute the final pixel value that take into account lighting, shadows and other attributes, including texture mapping,
Geometry shader (GS) takes a primitive (line, point or triangle) and creates additional vertices to increase the level of detail of an object,
Tessellation shaders (TS) include a couple fixed function units called the Hull and Domain shaders in the TS pipeline. The TS takes a curved surface (rectangular or triangular) and converts them into polygon representations with varying counts that can be changed based on quality requirements. Higher counts create more detail and lower count creates less detail on an object.

In addition, the programmable rasteriser converts objects from geometric to pixel form and culls (removes) any back facing or hidden surfaces. The memory interface removes unseen or hidden pixels using the Z-buffer, stencil/alpha tests before writing pixels to the frame buffer, and performs compression including Z and colour buffers.

Immediate Mode vs Deferred Rendering

Currently in the market there are two main GPU architectures and methods of rendering an image. Both methods use the same general pipeline described above, but differ in the mechanism they use to draw. One method is called Tile Based Deferred Rendering (TBDR) and the other is Immediate Mode Rendering (IMR). Both have pros and cons based on their respective use cases.

In 1995 (before the days of smartphones/tablets) there were many graphics companies supporting both methods on the PC and game console markets. In the TBDR corner there were companies like Intel, Microsoft (Talisman), Matrox, PowerVR, Oak and others. In the IMR side there were names like SGI, S3, Nvidia, ATi, 3dfx and many more. By 2012, there were no TBDR survivors left in the PC and game console markets. All PC and console architectures including the PS3/PS4, Xbox 360/One, Wii, etc. were IMR based. The reason this happened is because the IMR architecture allowed game quality and realism to improve to a point where TBDRs could not keep up because of architectural bottlenecks.

Fast forward to the mobile market today, which closely mirrors the historical trends of the PC and game console market. Today in the TBDR market you have two companies in the UK; Imagination and ARM. On the IMR side you have California based companies – Vivante, Qualcomm, Nvidia, Intel, and AMD. Game developers and application developers used to seeing their (game) assets run on complex GPUs want to port advanced features to mobile devices. Since mobile devices have stringent limitations on power/thermal and die area, the best way to close the performance gap and minimise application porting or significant code changes is to use a similar architecture to that used for development. IMR gives developers that choice in addition to getting equivalent, high quality PC-like rendering, all packed into a smaller die area than TBDR solutions. IMR also improves internal system and external memory bandwidth as the industry moves to OpenGL ES 3.0 and DirectX 11.x, and newer application programming interfaces (API).

Beyond graphics

Over the last several years, industry and university researchers found the computational resources of modern GPUs were suitable for certain general parallel computations due to the inherent parallel processing capabilities of the architecture. The calculation speed-up shown and proven on SIMD processors was quickly recognised throughout the industry and another area of high performance computing (HPC) built on the vast processing power of GPUs was born. GPUs that go beyond graphics can be referred to as GPU Compute cores or GPGPUs (general purpose GPUs). Different industry standard APIs like OpenCL, Google Renderscript/Filterscript and Microsoft DirectCompute have come to fruition where task and instruction parallelism are now optimised to take advantage of different processing cores. In the near future, mobile devices will better take advantage of system resources by offloading the CPU, DSP or custom cores and using the GPU to achieve the highest compute performance, calculation density, time savings and overall system speed-up.

The table in Figure 2 shows a comparison of CPU versus GPU when looking at different GPGPU factors including memory latency, thread management and execution parallelism. The winning solution uses a hybrid approach where both types of processors can be tightly interleaved and integrated to hit performance targets.

Figure 2: A comparison of GPGPU performance factors offered by CPUs and GPUs

Figure 2: A comparison of GPGPU performance factors offered by CPUs and GPUs

Many computational problems like image processing, vision processing, analytics, mathematical calculations and other parallel algorithms map well to the GPU SIMD architecture. You can even program the GPU to use a hybrid approach where a frame can use multiple GPGPU and graphics contexts to improve user experience. An example would be where GPU cores 1 and 2 are assigned to render images and cores 3 and 4 are dedicated to GPGPU functions like game enhancements (realistic particle effects such as smoke/water or game physics to mimic real world motion) or natural user interface (NUI) processing for in game gesture support. Other areas and markets where GPGPU usage is being adopted include:

Feature extraction – this is vital to many vision algorithms since image ‘interest points’ and descriptors need to be created so the GPU knows what to process. SURF (Speeded Up Robust Features) and SIFT are examples of algorithms that can be parallelised effectively on the GPU. Object recognition and sign recognition are forms of this application,
Point cloud processing – includes feature extraction to create 3D images to detect shapes and segment objects in a cluttered image. Uses could include adding augmented reality to street view maps,
Advanced Driver Assistance Systems (ADAS) – multiple safety features are constantly calculated in real time including line detection/lane assist (Hough Transform, Sobel/Canny algorithms), pedestrian detection (Histogram of Oriented Gradients – HOGS), image de-warping, blind spot detection and others,
Security and surveillance – includes face recognition that goes through face landmark localisation (Haar feature classifiers), face feature extraction, face feature classification and object recognition,
Motion processing – natural user interfaces like hand gesture recognition which separates the hand from background (colour space conversion to the HSV colour space) and then performs structural analysis on the hand to process motion,
Video processing – HEVC video co-processing using shader programs and high speed integer/floating point computations,
Image processing – combining GPGPU with image signal processors (ISP) for a streamlined image processing pipeline.

The rapid adoption of GPUs in the mobile market created by the smartphone revolution has generated a wonderful opportunity for vendors in the mobile ecosystem, especially GPU IP providers like Vivante. Since consumers and the industry are pushing leading edge technology forward, IP providers need to constantly innovate to keep up with the latest trends, APIs and use cases, all while maximising performance and minimising power and die area increments specified by advanced API features in a fully programmable GPU. New industry initiatives and markets are looking at ways to bring the latest graphics and GPGPU applications to all consumer products, not just those accessible in high end products. Vivante GPUs specifically designed for mobile applications and architected from the algorithm level to achieve the industry’s smallest integrated design, enables SOC partners and mobile OEMs to support the latest APIs in hardware. These plug-and-play GC cores scale to fit any product, from entry level and high volume mass market devices, to high end products that speed up the adoption of the latest technologies.

Author profile: Benson Tao is a Product Marketing Manager at Vivante Corporation, responsible for marketing initiatives, product planning, and business development. Tao has over 12 years experience in the GPU and video industry including designing notebook and desktop GPU platforms for OEM systems, low power mobile/handheld and embedded systems.