Humans in 4D: Reconstructing and Tracking Humans with Transformers | Notion

Jargon Explained

HMR: Human Mesh Reconstruction
HMR 2.0 is reconstruction from a video, i.e.: we import the time sequence knowledge
While for HMR 1.0 we only have one picture

Background Knowledge Explained

This paper is build on top of a ViTPose paper which is the very baseline of a pose estimation using the ViT. One can think it is a ViT based PoseNet. This the link to that project pape:

https://github.com/ViTAE-Transformer/ViTPose

This paper also build on top of PHALP:

https://openaccess.thecvf.com/content/CVPR2022/papers/Rajasegaran_Tracking_People_by_Predicting_3D_Appearance_Location_and_Pose_CVPR_2022_paper.pdf

and here is the project page:

http://people.eecs.berkeley.edu/~jathushan/PHALP/

Tracking People by Predicting 3D Appearance, Location and Pose

Problem Definition

In this paper, the problem is to recovering the 3D meshes of human bodies from single images, and tracking them over time in video.