SMPLX-Lite: A Realistic and Drivable Avatar Benchmark with Rich Geometry and Texture Annotations

ICME 2024


Yujiao Jiang1, Qingmin Liao1,2, Zhaolong Wang1,3, Xiangru Lin3, Zongqing Lu1, Yuxi Zhao1, Hanqing Wei4, Jingrui Ye1, Yu Zhang3, Zhijing Shao3,5*

1Shenzhen International Graduate School, Tsinghua University
2Department of Electronic Engineering, Tsinghua University
3Prometheus Vision Technology Co., Ltd.
4Beijing University of Aeronautics and Astronautics
5The Hong Kong University of Science and Technology (Guangzhou)
* Corresponding author     

Abstract




We present SMPLX-Lite dataset, the most comprehensive clothing avatar dataset with multi-view RGB sequences, 3D keypoints annotations, textured scan meshes, and textured SMPLX-Lite-D models. SMPLX-Lite provides over 20k high-resolution scan models of 5 subjects performing 15 types of actions.


Image 1
(a) Color Image
Image 2
(b) Keypoints
Image 3
(c) SMPL-X
Image 4
(d) Scanned Mesh
Image 5
(e) Scanned Texture
Image 6
(f) Lite-D
Image 7
(g) Lite-D Texture

We propose a new parametric model SMPLX-Lite-D, which can fit the detailed geometry of the scanned mesh while maintaining stable geometry in the nose, mouse and foot areas and reasonable shapes of the face and fingers. We present SMPLX-Lite dataset, the most comprehensive clothing avatar dataset with multi-view RGB sequences, 3D keypoints annotations, textured scan meshes, and textured SMPLX-Lite-D models. The reconstructed meshes and models come with fine geometry details and 4k resolution texture atlas. We use SMPLX-Lite to train a conditional variational autoencoder model that takes human pose and facial keypoints as input, and generates a photorealistic drivable human avatar.



SMPLX-Lite-D Model


Image 1
(a) SMPL
Image 2
(b) SMPL-X
Image 3
(c) SMPLX-Lite
Image 4
(d) SMPLX-Lite-D
Image 5
(e) Scanned Mesh

Fitting Results of Different Model. (a) SMPL model cannot control facial expressions and hand movements. (b) SMPL-X model has overly complex faces and toes, making it unsuitable for vertex fitting. (c) SMPLX-Lite model, plus vertex displacement (d) can fit scanned mesh(e) perfectly, especially in hand regions.

 

Data Process


Data Process Pipeline. Our pipeline produces a variety of data annotations, including 3D keypoints, SMPL-X parameters, textured scanned models, and textured SMPLX-Lite-D models.

 

Data Visualization


 

Multi-View Capture. SMPLX-Lite deploys 24 standard cameras and 8 telephoto cameras to capture multi-view synchronized RGB sequences. We show several frames of images from a part of these cameras.

 

Method


 

Method Overview. The CVAE model generates mesh and texture maps via a decoder, which employs pose and face keypoints as driving signals, overlays camera view information, and utilizes latent codes sampled from the distribution obtained by the encoder. The output mesh obtained by LBS, together with the texture map and camera parameters, undergoes the differentiable renderer to produce photorealistic rendered images. The entire training process is end-to-end, and mesh, texture, and final rendered images are all supervisable.

 



Driving Results


 

Driving results of 5 models by the same driving signal. Each column represents a different driving signal.

 



Citation



  @inproceedings{SMPLX-Lite,
    title = {{SMPLX-Lite: A Realistic and Drivable Avatar Benchmark with Rich Geometry and Texture Annotations}},
    author = {Yujiao Jiang, Qingmin Liao, Zhaolong Wang, Xiangru Lin, Zongqing Lu, Yuxi Zhao, Hanqing Wei, Jingrui Ye, Yu Zhang, Zhijing Shao},
    booktitle = {ICME 2024},
    year = {2024}
  }