Real-Time Multiperson Human Pose Estimation

PUBLISHED ON MAY 18, 2020 — CATEGORIES: explorations

This ongoing project has the goal of achieving real-time, multi-person, keypoint-based pose estimation, with a competitive compromise between runtime/size and performance, potentially also traslated to 3D estimation.

The applications for such a setup are countless, not only in industry, security and sports but also in arts, since new ways of interfacing between humans and systems bear always room for novelty.

The code and main lines of work, as well as detailed documentation on how to run the system can be found in this GitHub repository. Feel free to check it out!

Human Pose Estimation

The problem of human pose estimation has experienced great advances in the last few years thanks to diverse deep learning (DL) techniques and the availability of large datasets, like MS COCO (with over 110000 annotated images). All of the currently best approaches attempt to identify the keypoints by first using a neural network to detect the corresponding heatmaps. In the following picture you can see how, for a given image, the dataloader provides a mask and a set of such heatmaps. In some setups like mine, the “spread” of the heatmaps is configurable, which can help in some coarse-to-fine strategies like this one (section 3.3).

When training, the mask can be used to prevent the network from learning (neither in one nor in the other direction). This can be useful if e.g. an existing human hasn't been annotated, we don't want to teach the network that this is background, so we mask it out and the network will ignore that region alltogether.

The DL-based approaches can be classified in 2 groups: top-down, where a detector first frames every individual person in the image and then for each person the keypoints are estimated, and bottom-up, where first all separate keypoints are estimated and then they are ensembled into whole humans.

Bottom-up models have the advantage of not depending on a priori human detectors and not having to run once per person. On the other hand they perform worse than top-down in COCO, since top-down methods can adjust the size of the input human crop, which makes the problem simpler in terms of scale (a crucial factor). In crowded datasets like CrowdPose, bottom-up methods seem to be not only orders of magnitude faster but also more promising in terms of performance.

I included a much more detailled review of the literature here.


My conclusion after the literature review was that neural distillation using a bottom-up state-of-the-art teacher is the most plausible approach to achieve the goals. In short, teacher-student distillation is a model compression technique by which the teacher learns from the ground truth (here called the hard labels), whereas the student combines that with the teacher's predictions (or soft labels). Since the information provided by the teacher tends to provide more informative gradients, this usually allows the student to approximate the performance of the teacher with much lower memory and computational requirements. Apart from the paper, a good intuitive explanation can be found in this post by Prakhar Ganesh:

In the following image we can see a clear example of that: The figures in the left aren't humans, but a neural network of limited capacity would have a lot of trouble trying to distinguish posters from people. The soft labels help to mitigate that:

Last but not least, to help generalizability, many setups include different forms of data augmentation, like e.g. random rotation, rescaling, flipping… this is how the final augmented distillation dataloader looks like in our setup, for 3 different ground truth spreads:

For the distillation, HigherHRNet was chosen as teacher due to its high performance (best existing bottom-up approach), manageable size and availability of software.

Since most of the models use a backbone/stem for representation learning (and many of them transfer from imagenet), I considered that this representation may be beneficial for the student as well, and planned a 3-stage distillation, aiming for maximal compression:

  1. Fix the teacher-stem, and train the student-detector on the top
  2. Once good detection is achieved, replace the stem with a student-stem, fix the student-detector on the top and train the stem. Explore the usage of HSV/LAB color spaces and change from fp16 to fp32 to support CPU computation.
  3. Expand/transfer the student architecture into 3D exploring the techniques covered at the end of the literature review.


The project is currently at the end of phase 1: The best performing HigherHRNet has been successfully integrated and reproduced, as it can be seen in the following images, that exemplify the output of the script. Note that the heatmaps have been extracted with the pretrained HigherHRNet model, unmodified:

The model has been used to generate ahead-of-time predictions for all the COCO train2017 and val2017 images (around 400GB of results). This is to prevent re-running the teacher on the same image multiple times, which would be very inefficient in terms of time and energy. Dataloaders and optimizers for distillation have been implemented, as shown above. Infrastructure for distillation (logging, model serialization, minival…) has been completed and is up and running, as it can be seen in the TensorBoard screenshots below.

Here we can see that, as the training progresses, the distillation loss function decreases swiftly. For the learning rate scheduler, we use SGDR and a regular stochastic gradient descent optimizer with momentum and weight decay.

In consonance with the loss decay, the gradients also converge to zero, showing that the model is successfully reacting to gradient descent.

The parameters also adapt to the training. Note that many of the histograms don't seem to be adapting but the quantities are actually changing by small amounts.

So far, a few student architectures have been explored (see code), all of them inspired by different ideas from the literature review. The current best approach (heavily based on the Context Aware Modules) features an attention pipeline, trained with the human segmentation masks, that is able to capture the silhouettes. As usual in the attention mechanism, it filters out the responses that don't correspond to human predictions. This helps the keypoint pipeline to use its capacity more efficiently.

The following image illustrates the attention ground truth and prediction for a given COCO training instance:

And this the attention with the keypoint detector that goes on top: At its current capacity (around 8MB size), the detector shows some ability to differentiate parts of the body:

The model also generalizes well: the performance on the 5000 COCO validation images is very consistent with the training results.

Capacity will be increasingly added to refine the predictions. Currently, the size of the smallest model in the TensorFlow detection model zoo is about 25MB, so there is still room for improvement.

Original media in this post is licensed under CC BY-NC-ND 4.0. Software licenses are provided separately.