Estimating Body and Hand Motion in an Ego-sensed World

Brent Yi1 Vickie Ye1 Maya Zheng1 Lea Müller1 Georgios Pavlakos2 Yi Ma1 Jitendra Malik1 Angjoo Kanazawa1
1 UC Berkeley
2 UT Austin

TLDR; We use egocentric ( ) SLAM poses and images to estimate 3D human body pose, height, and hands in the world.

Some results from our method, with input view on top-left:

The visualization above is interactive!

Visualized scenes are outputs from Nerfstudio and Project Aria MPS.

Overview

Our system, EgoAllo, uses egocentric observations to estimate the wearer of a head-mounted device's actions in the allocentric scene coordinate frame. To do this, we:

Head Pose Conditioning

Our key insight for improving estimation is in conditioning representation.

We aim to condition a human motion prior on head motion. Naively conditioning on absolute poses, however, would introduce sensitivity to arbitrary world frame choices. Consider these trajectories, which have identical local body motion but completely different absolute head poses:

Training a model using these poses as conditioning would result in poor generalization, as inputs become susceptible to infinite possible world frame shifts.

Prior works have solved this by aligning trajectories with their first frame. We observe, however, that canonicalizing sequences this way leads to sensitivity to time. Consider two slices of the same motion:

Head poses from canonicalized sequences can still differ significantly, even for the same body motion (circled):

This also hinders generalization: networks must "re-learn" outputs for each slice of the input.

Motivated by this, our paper proposes (1) spatial and temporal invariance properties that are desirable for head pose conditioning, and (2) an alternative parameterization that achieves them.

Using the central pupil frame (CPF) to measure head motion, our invariant parameterization couples relative CPF motion $\Delta\mathbf{T}_\text{cpf}^t$ with per-timestep canonicalized pose $\mathbf{T}_{\text{canonical},\text{cpf}}^t$.

These transformations have improved invariance properties over prior methods, while fully defining head pose trajectories relative to the floor plane.

Quantitatively, this explains joint position estimation error differences between 5% and 18%.

Qualitatively, we observe consistent improvements in realism. For example, see the subtle but critical improvements in foot motion for this dynamic sequence:

Trajectory source: EgoExo4D, unc_soccer_09-22-23_01_27.

Hand Guidance

Head motion encodes significant information about body motion, but articulated hands require a richer input. In EgoAllo, we extract visual hand observations using HaMeR and (optionally) Project Aria's wrist and palm estimator. We then incorporate these into sampling via diffusion guidance.

We observe that jointly estimating human hands with bodies (purple) reduces ambiguities and errors when compared to single-frame monocular estimates (blue):

Compared to naive HaMeR, EgoAllo with head pose + HaMeR input drops world-frame hand joint errors by as much as 40%.

Related links

If this problem is interesting to you, here are some papers that you might enjoy!

Citation

@article{yi2024egoallo,
    title={Estimating Body and Hand Motion in an Ego-sensed World},
    author={Brent Yi and Vickie Ye and Maya Zheng and Lea M\"uller and Georgios Pavlakos and Yi Ma and Jitendra Malik and Angjoo Kanazawa},
    year={2024},
    journal={arXiv preprint arXiv:2410.03665},
}