Virtual camera rotation with face tracking on live stitched stream

Vincent Jordan

5 min readFeb 9, 2021

Using simple OpenGL 3D view rotation

Previously

Do you remember inastitch, the video stitcher for live stream?

Faster video stitching with OpenGL

Stitching with OpenCV is all nice, but what if you wanted to do the same with OpenGL?

levelup.gitconnected.com

Quick summary: inastitch merges three live video streams into a single wider stream.

In this previous project, homography of the input streams was simply achieved in 2D by warping pixels using an OpenGL pixel shader. This article introduces homography in 3D with a vertex shader. It becomes very easy to rotate the view.

Concave camera setup

The first evolution from the previous demo is to replace the convex camera setup with a concave camera setup.

Concave three camera setup for **inastitch**

This step is not required for rotation to work but improves stitching quality by reducing the translation between the camera and the center of rotation. Result: stitching works better for short range (esp. on a desk).
➥ See this article about why translation is to be avoided when stitching for panorama pictures.

Stitching in 3D space

The second and main evolution compared to the previous version is found in rendering itself. Stitching is now done by scaling and rotating each input camera in 3D in order to achieve the same warp operation previously obtained by moving pixels directly.

OpenCV’s stitching_detailed is the sample implementation (docs here) used as model for stitching calibration (i.e., finding the transformation matrices).

Like before the transformation matrices are used in OpenGL instead of OpenCV for improved performances.

Getting K and R

stitching_detailed shows how the homography matrix is a combination of a matrix of camera-specific parameters (K) and a matrix of camera rotation (R).
➥ Detailed explanations are found in a nice OpenCV homography tutorial.

// Find camera K and R matrices for each input image
std::vector<cv::detail::CameraParams> cameraParams(inputImageCount);
auto estimator = cv::detail::HomographyBasedEstimator();
const bool isSuccess = estimator(
    features,        // IN: image features for each cam (e.g. SIFT)
    pairwiseMatches, // IN: feature matches between images
    cameraParams     // OUT: K and R matrices for each cam
);for(uint32_t imgIdx=0; imgIdx<inputImageCount; imgIdx++)
{
    std::cout << ": K=" << cameraParams.at(imgIdx).K() << std::endl;
    std::cout << ": R=" << cameraParams.at(imgIdx).R << std::endl;
}

The matrix K depends on the camera and lens. It remains same unless focus is changed (i.e., by screwing/unscrewing the camera lens). In other words, it is intrinsic to the camera, it does not change with what is captured by the camera or its position.

The matrix R depends in the rotation of the camera. It changes with its position in the scene. It is extrinsic to the camera. Stitching can happen closer or further to the camera, which also implies a different rotation (even if the physical angle of the camera on the 3D-printed holder part does not change itself).

Using K and R in OpenGL

Camera texture 1 (center image) is used a reference (rotation = 0°).
Camera 2 and 3 are rotated by about 30° on the Y-axis in each direction.

After the OpenCV computation of K and R matrices:

The scale of each input camera depends on the focal distance. This value is found in the camera matrix K:

const float focalDistance = kMatrices.at(camIdx)[0][0];
const float invFocalDistance = 1.0 / focalDistance;openGlModelMat[camIdx] = glm::scale(centerShift, glm::vec3(-invFocalDistance, invFocalDistance, 1.0f));

The rotation of each input camera is directly the rotation matrix R:

openGlViewMat[camIdx] = rMatrices.at(camIdx);# Note: this is simplified, keep in mind OpenCV and OpenGL matrices
        are stored in different formats.

For fun, you can find the rotation vector back, by using Rodrigues:

cv::Mat rotationVect;
cv::Rodrigues(rMatricesCv.at(camIdx), rotationVect);// transform to degree for printing
const auto rotationVectDeg = rotationVect * RAD_TO_DEG;
std::cout "RotationVect(degree)=" << rotationVectDeg << std::endl;

Example of result:

RotationVect(degree)=[-1.3412853; -33.804646; 3.6750107]

It matches the camera angle of 30 degrees. See 3D-printed model.

Rotation on the Y axis

Just add an additional rotation to the scene (i.e., OpenGL’s view matrix) on the Y axis and it simulates a rotation of the camera point of view.

Here I rotate manually with the keyboard:

keyboard rotation

…but why using a keyboard when you can use your face?

Rotation with face tracking

Face detection is easily achieved with the dlib library.
➥ See dlib sample code.
Note: face needs to be detected on each camera and corrected with the R matrix (calculated previously).

final result

Some comments about this video:

Latency is low (can be used as webcam with v4l2loopback), this video was recorded live (no post-processing) at 30fps
Once calibration is done (with OpenCV), it only needs OpenGL and can run on a Raspberry Pi (but face detection would need external acceleration)
This demo only works with one face
Yes, it snows outside

Conclusion

If you want a more dynamic camera for your video meetings, you can build it with a bunch synchronized Raspberry Pi camera and a mini OpenGL engine.

Use ideas

Meeting camera: with face recognition (example with dlib here), it is possible to follow a specific face among others (as opposed to only detecting them like in this demo).
Bonus point for tracking the speaking face.

TikTok camera (idea from Arnaud): with image segmentation, track the head instead of just the face, and let the camera follow you while you record your next dance mini-clip.