Skeleton-Aware Networks for Deep Motion Retargeting

Abstraction

Create a special deep learning frame work that have the skeleton structure awareness. This deep learning architecture includes differentiable convolution, pooling and unpooling operators.

Motivation

For motion capture, different company usually used different set of motion capture equipment, different configuration, and different software. “motion retargeting” is introduced
Current convolution network usually used in grid-like data processing rather than articulated network. i.e.: Skeletons of different characters exhibit irregular connectivity.

Contribution

In this paper, we introduce a new motion processing framework consisting of a representation for motion of articulated skeletons
designed for deep learning, and several differentiable operators, including convolution, pooling and unpooling, that operate on this representation.

Problem Formulation

We treat the retargeting problem as a multimodal translation between unpaired domains. All skeletons are homeomorphic.

Previous work state that multimodal unpaired image translation tasks may be carried out effectively using shared latent space. See the paper list below:

Image-to-image translation for cross-domain disentanglement

The method mentioned above is not doable for the graph question. Since different picture can use a very simple interpolation or simple pooling method to increase or decrease the resolution. When they are at the same resolution we can apply the same backbone to get the same latent space.
However, for the graph, we don’t have a good up-sampling and down-sampling method to do this.
In this paper, the author’s basic idea is based under an intuitive observation that all skeleton can be reduced to a common primal skeleton. To be precise: one can merged the adjacent edges or armatures.
In this paper, authors claimed that they can show the temporal features associated with the joints of the primal skeleton.
Author also claimed that, they disentangle the features space into two parts. One is for motion features, the other is for temporal features. They claimed that their encoder will make the latent code into two parts, one is dynamic part, while the other is not.