Hardware-accelerated video stitching on GPU

Vincent Jordan
Level Up Coding
Published in
5 min readOct 29, 2020

--

Stitching with OpenCV is all nice, but what if you wanted to do it all in hardware on GPU with vanilla OpenGL?

Introduction

OpenCV comes with an advanced sample implementation which produces great results on still images, however, using this program on every single frame of video streams is unsurprisingly extremely slow. This is the solution which was used for 360° video using Raspberry Pi(s).

This article follows the one mentioned above, where video frames were transformed on the CPU with OpenCV, and introduce a full GPU pipeline.

It should be noted that OpenCV has GPU support for many operations, but enabling it still results in many inefficient copying of data back and forth between CPU and GPU.

Why OpenGL instead of OpenCV?

Aiming for low-latency and real-time video stitching, OpenCV 2D pixel transformation is replaced with a mini OpenGL 3D engine.

It has multiple benefits:

  • Camera ISP or hardware video decoder can deliver video frames directly to an OpenGL texture in the GPU (avoiding a copy of the buffer through the CPU)
  • GPU has specific hardware acceleration for pixel processing and texture sampling
  • Stitched frame is already in the GPU and can be pushed to display with almost no latency
  • OpenGL is an open standard and has good support on most embedded targets

It has drawbacks too:

  • Reading back an OpenGL framebuffer to CPU memory is often slow and inefficient
  • OpenGL rendering capabilities (esp. OpenGL ES) often depends on attached screen (e.g., cannot render faster than 60fps)
  • OpenGL’s API is big and complicated to understand

An architecture for embedded video stitching

Stitching pipeline involving CPU, video hardware and GPU

🅐 — CPU reads the stream from MJPEG (i.e., concatenated JPEG files)
🅑 — JPEG parser finds individual frames in the stream, and associates them with their capture timestamp.
🅒 — Hardware JPEG decoder produces a bitmap frame to an OpenGL texture buffer.
🅓 — GPU samples each texture to apply the 2D perspective transformation (with a pixel shader)
🅔 — GPU output framebuffer is read back by JPEG encoding hardware
🅕 — CPU gets a callback and append the JPEG buffer to the output file

Introducing Inatech’s open-source stitcher

inastitch is an open-source project aiming at implementing this stitching pipeline:

Note: at the time of writing this article, hardware JPEG decoding is not implemented and decoding/encoding is done on CPU with libturbojpeg(still pretty fast, thanks to CPU vectorization).

Inside inastitch

inastitch takes as input a set of video streams along with frame timestamps as well as a 2D transformation matrix for each stream.

The synchronized video streams and timestamps are generated by a modified Raspberry Pi camera tool: raspivid-inatech
https://github.com/inastitch/raspivid-inatech

Calibration tool

The transformation matrix is generated by a calibration tool using OpenCV libraries: inastitch_cal
https://github.com/inastitch/inastitch/tree/master/tools/calibration

The purpose of the calibration phase is to find the homography matrix.

The homography matrix is the transformation of the point of view of a second camera to make it look like the point of view from a first camera.
See “What is a homography matrix?” in the OpenCV homography tutorial.

Note: In the calibration tool, the left image is kept as reference point of view (the first camera) and the transformation matrix applies to the right image (the second camera).

Description of the OpenGL rendering scene

Each video frame is turned into a texture, then bound to a simple two-triangle flat rectangle for rendering each matching frame side-by-side.

The perspective 2D transformation is applied to the texture with a very simple pixel shader:

varying vec2 texCoordVar;
uniform sampler2D texture1;
uniform mat3 warp;
void main() {
vec3 dst = warp * vec3((texCoordVar.x+1.0), texCoordVar.y, 1.0f);
gl_FragColor = texture2D(texture1, vec2((dst.x/dst.z), (dst.y/dst.z)) );
}

warp is the homography matrix from OpenCV, normalized to OpenGL coordinates.

It’s demo time!

Here is a demo of inastitch stitching three synchronized streams recorded at 100fps.

First testdrive with inastitch

The same video with inside view:

A portion of the same video playing at 25fps instead of 100fps (i.e., 4 time slowmo):

Note: Youtube scaled the original video down to 60fps.

Conclusion

For stitching still imagery, OpenCV is great and very customizable, but not adapted for video processing where the whole pipeline needs hardware acceleration.
A mini 3D engine in OpenGL ES is a better choice here, with standardized support for hardware acceleration from embedded SoC to datacenter GPU.
Of course Vulkan would be even nicer than OpenGL, but at the cost of being compatible with less GPU hardware.

Next step

The next step is real-time stitching using the same GPU pipeline.

The main challenge here will be to use live video stream over Ethernet. Synchronized video frames need to be delivered together and with consistent latency to avoid buffering on the stitcher side of the network. Plain Ethernet network does not have support for such use-case.

Time-Sensitive Network (TSN) is an addition to the Ethernet network designed to address issues of time-sensitive packet delivery.

Going further

Use 3D transformation instead of 2D

You may have noticed that inastitch does a 2D perspective transformation in a texture using a pixel shader instead of a 3D rendering with a vertex shader to produce the same perspective, which would a better/simpler use of the GPU hardware.

An idea is to decompose the homography matrix into simpler 3D rotations and translations, which could be then applied to the vertices of the video rectangle.

See “Demo 4: Decompose the homography matrix” in OpenCV homography tutorial, and the function cv::decomposeHomographyMat.

Better stitching

As explained in details in the previous article, since the camera transformation is not a pure rotation, stitching cannot be perfect. Seam blending and masking can be performed efficiently in OpenGL using alpha channel and a mesh for the video texture.

Make an automotive-friendly prototype

inastitch would make a nice rear-view mirror, isn’t it?
Here are some advises:

Beware the dog!

--

--