Pipeline

Untitled

  1. In their words, when we first observed the person in the 2D frame, we will lift them to a 3D space
  2. The so called 3D space is actually the 3D dynamic model. After hat, we want to construct a 3D dynamic model by actually connecting single frame result. When we have the time sequence result, we than can have a prediction according to the time sequence
  3. The 3D dynamic model can infer person’s location and appearance at the next time say t+1.
  4. Then we use the inferred result to do the association among multiple person

Lifting Procedure

This paper, author used HMAR model. This is applied on every detected bounding box of the input video. Therefore, they also used some object detection strategy such as YOLO(I guess). This HMAR give initial, single frame, and location observations for 3D pose.

Build Dynamic Model (Used for predicting future Steps)