An Empirical Study of the Collapsing Problem in Semi-Supervised 2D Human Pose Estimation
Semi-supervised learning aims to boost the accuracy of a model by exploring unlabeled images. The state-of-the-art methods are consistency-based which learn about unlabeled images by encouraging the model to give consistent predictions for images under different augmentations. However, when applied to pose estimation, the methods degenerate and predict every pixel in unlabeled images as background. This is because contradictory predictions are gradually pushed to the background class due to highly imbalanced class distribution. But this is not an issue in supervised learning because it has accurate labels. This inspires us to stabilize the training by obtaining reliable pseudo labels. Specifically, we learn two networks to mutually teach each other. In particular, for each image, we compose an easy-hard pair by applying different augmentations and feed them to both networks. The more reliable predictions on easy images in each network are used to teach the other network to learn about the corresponding hard images. The approach successfully avoids degeneration and achieves promising results on public datasets. The source code will be released.