Whole-Body Compliance for Heavy Humanoids via Force Latent Estimation and Residual Impedance Targets

Paper (PDF) Appendix (PDF) arXiv Code BibTeX

CompliantWBC teaser: a whole-body compliant policy demonstrated on a heavy humanoid across a range of real-world force-aware tasks

A heavy humanoid recruits its lower body to yield compliantly to external forces at arbitrary contact sites — static and dynamic force reaction, board wiping, squatting under payload, and cooperative payload transport.

Abstract

Whole-body compliant control is essential for deploying heavy humanoids under high payload, yet prior force-aware learning pipelines stop at the end-effector or upper body, leaving arbitrary-site perturbations with lower-body engagement unaddressed. CompliantWBC closes this gap with three pieces: a multi-site whole-body impedance reference controller whose per-link virtual targets supervise an RL policy, a Force Latent Encoder that infers a wrench-grounded latent of the unobserved contact state, and a bounded residual that edits the impedance equilibrium — preserving classical Cartesian-impedance passivity by construction. We validate on a real 70 kg humanoid across static and dynamic force reaction, board wiping, squat under payload, and cooperative payload transport.

Overview

Video

Approach

Method

The two-stage CompliantWBC training pipeline: base-policy training with the Force Latent Encoder, and residual-policy training on the frozen base — The two-stage **CompliantWBC** pipeline. Stage 1 jointly trains the force-aware base policy and the Force Latent Encoder against the analytical controller's compliance-fidelity reward; Stage 2 trains a bounded residual on the frozen base that edits the impedance equilibrium to compensate physics discrepancies.

Whole-Body Impedance Reference Controller

An analytical controller composes centroidal-momentum impedance distributed to support contacts by a balancing QP, per-link Cartesian impedance at arbitrary sites, and an energy tank. It produces reference torques and virtual-target trajectories used as a reward signal — never cloned.

Force Latent Encoder & Residual on Equilibrium

A variational latent encoder, grounded in privileged wrench supervision and clustered by perturbation class behind a gradient barrier, conditions a bounded residual that edits the impedance equilibrium — not motor commands or gains — preserving analytic compliance under sim-to-real drift.

Phong-Weighted Sampling & Axis-Decoupled Pelvis Anchor

A Phong-weighted force-origin sampler with an axis-decoupled pelvis anchor — base-relative horizontal, gravity-anchored vertical — induces multi-contact, lower-body-inclusive compliance without a hand-crafted site or magnitude curriculum.

Reference Controller

A Multi-Site Whole-Body Impedance Heuristic

Classical Cartesian impedance couples a single end-effector to its commanded equilibrium. We extend it to a hierarchy of links — hands, elbows, knees, pelvis, torso — coupled by balance: per-link impedance is projected through a centroidal-balance null-space and realised by support contacts through a balancing QP. Knee contacts enter as extra support sites, generalising two-footed compliance to lower-body-inclusive postures. It runs only as a reference — its per-link virtual targets supervise the policy reward and are never cloned.

Force-Aware Base Policy

Force Latent Encoder

A variational encoder maps a 160 ms window of proprioception, torques, and motion command to a 32-D force latent. Its mean is detached before the policy — a gradient barrier so PPO never reshapes it — while four auxiliary losses (wrench reconstruction, contrastive clustering by perturbation class, a KL bottleneck, and temporal smoothness) keep it physically grounded.

The result is a wrench-grounded latent that clusters by perturbation class — none, left, right, center, mixed — as the t-SNE shows.

t-SNE of the force latent: embeddings cluster by perturbation class (none, left, right, center, mixed)

Residual on Impedance Targets

A Bounded, Passivity-Constrained Residual

A second 3-layer MLP sits on top of the frozen base policy and frozen Force Latent Encoder. It reads the latent embedding and the decoded wrench estimate and outputs a bounded delta on the impedance equilibrium — a 5 cm cap on the virtual-target shift, with per-link stiffness held fixed by the curriculum. When the frozen wrench estimate is biased or noisy, the residual edits the impedance equilibrium to absorb the discrepancy without changing the base policy and without modulating gains.

Editing the equilibrium rather than the torque preserves the base controller's analytic compliance: a 5 cm shift induces at most stiffness × 5 cm of force — an explicit physical bound the residual cannot violate. Because the stiffness is fixed, the closed loop inherits classical Cartesian-impedance passivity by construction — a safety property for human–robot contact that action-residual baselines editing torque directly do not admit.

Force-Origin Sampling

Phong-Weighted Sampling & Compliant Virtual Targets

Upper-body links draw force origins from a reachability point cloud; the pelvis instead draws from a continuous Phong-weighted distribution lobed on the downward axis, modelling gravity acting on a carried load.

The anchor is axis-decoupled — base-relative horizontal, gravity-pinned vertical — so the policy learns to sink under load rather than fight it, with no hand-crafted site or magnitude schedule.

Training

A Two-Stage Reinforcement-Learning Pipeline

The pipeline deliberately avoids joint optimisation: under a single PPO loss the residual would absorb compliance behaviour the base could have learnt, making its bounded form a constraint on the joint system rather than an interpretable discrepancy compensator. The policy is deployed directly after Stage 2 with no real-rollout fine-tuning.

Stage 1

Force-Aware Base Policy & Force Latent Encoder

PPO jointly trains the latent encoder, the wrench-reconstruction head, and the base policy against the compliance-fidelity reward, internalising multi-site compliance behaviour.

Stage 2

Residual on the Frozen Base

With the base policy and wrench estimator frozen, PPO trains the bounded residual; the only way to reduce the reward is to edit the impedance interface to absorb wrench-prediction error.

Training configuration

Base policy — 3-layer MLP, width 512
Force Latent Encoder — 3-layer MLP (512→512→256)
Latent — dim 32, history 16 steps (160 ms @ 100 Hz)
Residual — 3-layer MLP, width 256, 5 cm equilibrium bound
Aux-loss weights — λ_w 1.0, λ_s 0.3, λ_k 1e−3, λ_m 0.1
Optimiser — PPO (both stages)

Evaluation

Experiments

Positioning

Compliant Humanoid Whole-Body-Control Landscape

CompliantWBC is the only pipeline that combines lower-body compliance, multi-site contact, an explicit learned force latent, structured force sampling, and a residual on the impedance equilibrium. It admits both FALCON and GentleHumanoid as restrictions.

Method	Lower-body compliance	Multi-site	Force latent	Residual architecture	Humanoid platform
FALCON	–	EE	–	–	G1/Booster
GentleHumanoid	–	upper only	implicit	–	G1
FACET	quadruped	CoM	–	impedance ref.	Go2/G1
SoftMimic	✓	whole-body	–	–	G1
CHIP	–	EE	–	hindsight goal	G1
ResMimic	–	–	–	on motion	G1
ASAP	–	–	–	on action	G1
CompliantWBC (ours)	✓	✓	✓	on equilibrium	in-house, 70 kg

Ablation Study

Compliance–Tracking Trade-off in Simulation

We compare the full method against two baselines — the stiff whole-body tracker TWIST2 and the upper-body-compliant Gentle Humanoid — and two ablations, removing one component at a time. Each controller runs 100 paired rollouts under a ramp–hold–release force protocol across a fixed pool of contact sites (wrist, elbow, torso, pelvis, hip, knee), split evenly between static and dynamic motions.

Controller	E_cmd^free ↓	E_cmd^force ↓	ρ_τ ↓	R_LB ↑	S ↑
	×10⁻²	×10⁻²	×10⁻²
TWIST2	2.18±0.31	3.14±0.42	2.74±0.38	0.16±0.04	0.59
Gentle Humanoid	5.22±0.47	7.86±0.61	2.12±0.29	0.14±0.03	0.91
Ours w/o pelvis force sampling	4.12±0.38	5.89±0.51	1.87±0.24	0.16±0.03	0.93
Ours w/o residual policy	3.89±0.35	5.43±0.47	1.56±0.21	0.20±0.03	0.94
CompliantWBC (full)	4.65±0.41	6.57±0.53	0.93±0.15	0.31±0.04	0.98

TWIST2 tracks best in free space but fights every perturbation (S = 0.59). Gentle Humanoid yields with its upper body but shows the lowest whole-body engagement. CompliantWBC (full) more than doubles lower-body participation R_LB over Gentle Humanoid, attains the lowest joint saturation ρ_τ and the highest success S = 0.98, at only a modest free-tracking cost. Removing pelvis force sampling collapses R_LB toward the upper-body-only regime; removing the residual hurts force-aware tracking E_cmd^force the most.

Joint-torque trajectories of compliant vs. non-compliant controllers under a pelvis perturbation — Joint torque under a pelvis perturbation. The compliant policy (orange) keeps lower-body torques far smaller than the non-compliant tracker (blue) and stays clear of the saturation bound — making concrete the resist-and-fail behaviour behind TWIST2's low success rate.

Qualitative Results

Compliant Yielding and Multi-Site Flexibility in Simulation

Two short simulated demonstrations make the compliance behaviour visible. First, a side-by-side comparison of our compliant policy against a stiff tracker under the same upward arm force; second, the same single policy yielding to forces applied at three different contact sites.

Compliant policy vs. stiff policy

Under an identical upward force at the arm, the compliant policy recruits the whole body to yield, while the stiff policy resists and is driven off balance.

Ours (compliant)

Stiff policy

Multi-site flexibility

A single policy yields compliantly to forces applied at the shoulder, the wrist, and the pelvis — no per-site tuning.

Shoulder

Wrist

Pelvis

Real-World

Hardware Experiments

We deploy the trained policy on a real heavy humanoid via teleoperation across five hardware tasks. Static and dynamic force reaction are quantitative and directly comparable to the simulation ablation; board wiping, squat under payload, and cooperative payload transport are qualitative demonstrations of sustained whole-body compliance and lower-body participation.

Static force reaction

Measured, ramped pushes at wrist, torso, and pelvis while the robot holds a fixed pose — the primary quantitative real-world result.

Static and dynamic force reaction mirror the simulation ablation; full quantitative hardware tables are reported in the paper and will be added here for the camera-ready version.

Cite

BibTeX

Reference

BibTeX will be available once the paper is published.