1Shenzhen International Graduate School, Tsinghua University
2Department of Electronic Engineering, Tsinghua University
3Prometheus Vision Technology Co., Ltd.
4Beijing University of Aeronautics and Astronautics
5The Hong Kong University of Science and Technology (Guangzhou)
*
Corresponding author
We propose a new parametric model SMPLX-Lite-D, which can fit the detailed geometry of the scanned mesh while maintaining stable geometry in the nose, mouse and foot areas and reasonable shapes of the face and fingers. We present SMPLX-Lite dataset, the most comprehensive clothing avatar dataset with multi-view RGB sequences, 3D keypoints annotations, textured scan meshes, and textured SMPLX-Lite-D models. The reconstructed meshes and models come with fine geometry details and 4k resolution texture atlas. We use SMPLX-Lite to train a conditional variational autoencoder model that takes human pose and facial keypoints as input, and generates a photorealistic drivable human avatar.
Fitting Results of Different Model. (a) SMPL model cannot control facial expressions and hand movements. (b) SMPL-X model has overly complex faces and toes, making it unsuitable for vertex fitting. (c) SMPLX-Lite model, plus vertex displacement (d) can fit scanned mesh(e) perfectly, especially in hand regions.
Data Process Pipeline. Our pipeline produces a variety of data annotations, including 3D keypoints, SMPL-X parameters, textured scanned models, and textured SMPLX-Lite-D models.
Multi-View Capture. SMPLX-Lite deploys 24 standard cameras and 8 telephoto cameras to capture multi-view synchronized RGB sequences. We show several frames of images from a part of these cameras.
Method Overview. The CVAE model generates mesh and texture maps via a decoder, which employs pose and face keypoints as driving signals, overlays camera view information, and utilizes latent codes sampled from the distribution obtained by the encoder. The output mesh obtained by LBS, together with the texture map and camera parameters, undergoes the differentiable renderer to produce photorealistic rendered images. The entire training process is end-to-end, and mesh, texture, and final rendered images are all supervisable.
Driving results of 5 models by the same driving signal. Each column represents a different driving signal.
@inproceedings{SMPLX-Lite,
title = {{SMPLX-Lite: A Realistic and Drivable Avatar Benchmark with Rich Geometry and Texture Annotations}},
author = {Yujiao Jiang, Qingmin Liao, Zhaolong Wang, Xiangru Lin, Zongqing Lu, Yuxi Zhao, Hanqing Wei, Jingrui Ye, Yu Zhang, Zhijing Shao},
booktitle = {ICME 2024},
year = {2024}
}