Robust Face Segmentation

PUBLISHED ON NOV 28, 2020 — CATEGORIES: utilities
Please note that the system described in this post is a fusion of 2 preexisting Deep Learning systems. For several reasons this post does not provide the actual code, but the given descriptions, references and discussion should provide enough information to get started.


Whenever collecting a dataset with human faces, chances are that these need to be blurred out. This can be e.g. for the sake of anonymization, or to analyze the information that facial expressions provide.

Either case, face segmentation (i.e. detecting the pixels that belong to a face) can be an extremely time-consuming task, if manually done. Luckily, nowadays this issue is almost entirely removed thanks to Deep Learning. In fact, looking at existing open-source solutions (October 2020), It seems that the field has advanced to more complex scenarios like face parsing (i.e. segmentation of the different facial regions). Furthermore, models are not only able to discriminate but also to generate photorealistic faces, the most notable example being StyleGANs (see


Among the few open-source solutions tested, the following one proved to be the most reliable and robust: Deep Face Segmentation in Extremely Hard Conditions (many thanks to the authors Yuval Nirkin et al). The corresponding paper, On Face Segmentation, Face Swapping and Face Perception can be found online, and also hosted here. Here is a small teaser from the paper (Figure 1): can you recognize who is that?

Given a close-up, well iluminated image of a face, the system performed quite well against many forms of occlusion.The image at the beginning of this post (from the COCO dataset) is an example of such an ideal output. But there were still 2 technicalities to address:

  • Detection of the faces in a broad and potentially complex context
  • Segmentation of the faces at different (smaller) scales

The first one can be addressed robustly if we first look for human keypoints, and then filter the ones related to the head. Note that, if the model is able to detect any noses, eyes and/or ears, the chances of missing a face (even sideways or slightly from the back) are pretty low. Luckily, I had been doing quite some work on Human Keypoint Estimation (see e.g. here) and had already a state-of-the-art HigherHRnet up and running:

The second issue is a little trickier. In our setup, the distance between the person and the camera was fixed, so fixing the size of the bounding boxes ended up providing quite robust results.

For variable scales, a naive implementation discussed below is to infer the size of the head by the distances between keypoints, as follows:

  1. Perform keypoint estimation and grouping
  2. For each person, keep all nose, eye and ear keypoints above a confidence threshold $t$.
  3. Center the bounding box on the average of all head keypoints
  4. Assuming a square bounding box, measure the maximal pixel distance $d$ between any 2 head keypoints. Multiply this by a factor to obtain the side length of the box.

As you may see, this is not robust against occlusions: If we only detect a single keypoint (e.g. the nose), the length of the box cannot be determined, and more refined approaches (like e.g. head detectors) are needed. This is a problem inherent to using keypoint estimation.

Brief Experiment

To illustrate the performance, the system is run below on the COCO 2017 validation dataset, presenting different kinds of scales, occlusions, contexts and faces.

After running the keypoint estimation, segmentations were extracted using a bounding-box radius of 3 times the max distance between the found head keypoints (or alternatively at least 20 pixels radius). The images in the bounding boxes were resized to (400, 400) pixels and centered around the following RGB value: (104.00698793,116.66876762,122.67891434) before passing them to the face segmentation network. The value was picked from the original code and left unchanged.

Qualitative Discussion

As already mentioned, due to the particularities of the face segmentation model, the system is very sensitive to the bounding box size. The following images illustrate various out-of-scale segmentations:

Also, segmentation seems to be sensitive to challenging lighting conditions and small resolutions (pixelized input). E.g. changing the normalization RGB vector impacted the performance notably. This may also be improved with better preprocessing. It remains to be quantified to what extent this could entail biases agains people of different genders, ages and skin colors:

In general, there is potential for improvement if both the bounding-box size and preprocessing steps are refined. In our simple and well-lit scene, it turned out to work very well. In case you are considering doing something similar, the following images (also from COCO) provide an idea of the system’s performance on various scenes:

Last but not least, we went for the best quality that can fit on a single 8GB GPU, regardless of runtime. The resulting performance is far from real-time (expect between 1 and 5 fps on an RTX2… series), but still much faster than manual. This was in any case good enough for our purpose of curating/preprocessing a mid-sized dataset.

Deployment Bonus: Docker Image

You may have noted that the face segmentation system runs on Caffe, and the human keypoint estimation system on PyTorch. Not a problem :)

But if you ever tried to get PyCaffe running on your system, chances are that it wasn’t a smooth ride, mainly due to dependency and version compatibility issues.

As for November 2020 (Ubuntu 20.04), the challenge was to triangulate the compilation of Caffe and the face recognition system together with OpenCV4 and the miniconda environment, while keeping compatibility with relatively recent CUDA and CUDNN versions.

These kinds of problems can be greatly reduced by containerizing the application. For that sake, I’ve prepared a CUDA-compatible Docker image with the following facilities:

  • Ubuntu 18
  • CUDA 11.1
  • CUDNN 8
  • OpenCV 4
  • TensorFlow 2
  • PyTorch 1.6

The corresponding container is rather big (around 20GB in memory), but likely worth it. Feel free to pull/build it or simply take a look at the Dockerfile at my Dockerhub profile! (Note that the container does not have the face segmentation system installed)

Another problem that came up to mind is if both systems would fit on the same GPU. The HigherHRNet is quite large (takes around 7GB of the GPU). Luckily, the Caffe model did fit in the remainder space and the Python code was remarkably stable (no hiccups whatsoever).

The images presented here are derivations from the COCO dataset, released under a Creative Commons license. I do not own the copyrights. Please see here for more details.

TAGS: c++, caffe, computer vision, docker, machine learning, pytorch