2026-02-19 Papers

1/2

Paper 1

Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

Published: 2026-02-18

Link: http://arxiv.org/pdf/2602.16705

1. 📘 Topic and Domain: The paper focuses on humanoid robot control for open-vocabulary visual loco-manipulation, combining perception and whole-body control for object grasping tasks.
2. 💡 Previous Research and New Ideas: The paper builds on existing humanoid tracking methods and vision-language models, proposing HERO - a modular system that combines accurate residual-aware end-effector tracking with large pre-trained vision models for generalization.
3. ❓ Problem: The paper addresses the challenge of enabling humanoid robots to accurately manipulate novel objects in novel environments using only onboard sensors, which requires both precise end-effector control and strong visual generalization.
4. 🛠️ Methods: The authors use a modular approach combining inverse kinematics, learned neural forward models for accurate pose estimation, motion planning with replanning, goal adjustment, and integration with pre-trained vision models (Grounding DINO, SAM, AnyGrasp).
5. 📊 Results and Evaluation: HERO achieves 2.5cm end-effector tracking error (3.2× better than baselines) and 83.8% success rate in real-world open-vocabulary object grasping across diverse objects and environments.

Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

HERO: Humanoid End-Effector Control Workflow Visual Input RGB-D Camera Language Query Open-Vocabulary Perception Grounding DINO SAM-3 Grasp Synthesis AnyGrasp Model Grasp Retargeting EE Goal Pose Inverse Kinematics Upper-body goals Motion Planning cuRobo Reference trajectory Neural Forward Models FK correction η Odometry ξ EE Tracking Policy πt Whole-Body MLP 29-DOF Control Goal Adjustment Replanning Every 6 seconds Robot Execution Object Successfully Grasped Systematic error correction Closed-loop replanning
Q1
1. What was the primary innovation that enabled HERO to achieve 3.2× better end-effector tracking accuracy compared to previous methods?
Using a larger neural network with more parameters for direct end-to-end control
Combining residual neural forward models with classical robotics components like inverse kinematics
Training exclusively on real-world data collected from human demonstrations
Q2
2. Why did the authors find it necessary to develop learned neural forward kinematics models instead of using analytical forward kinematics?
Analytical forward kinematics on the G1 humanoid had systematic errors of 1.76cm due to hardware inaccuracies
The robot's geometry was too complex to compute analytical forward kinematics
Neural models were faster to compute than analytical methods during real-time control
Q3
3. What was a surprising limitation discovered about the humanoid's egocentric vision during whole-body reaching motions?
The RGB-D camera resolution was too low to detect small objects accurately
Objects often went out of view due to large body movements, making closed-loop visual adjustment infeasible
The vision models couldn't process depth information fast enough for real-time control
1/2

Paper 2

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Published: 2026-02-15

Link: http://arxiv.org/pdf/2602.14111

1. 📘 Topic and Domain: Evaluating Sparse Autoencoders (SAEs) for neural network interpretability, specifically testing whether SAEs genuinely learn meaningful feature decompositions in language models.
2. 💡 Previous Research and New Ideas: Based on prior work using SAEs to decompose neural activations into interpretable features (Bricken et al., 2023), the paper proposes novel sanity checks using random baselines to test if SAEs truly discover meaningful features or merely optimize metrics.
3. ❓ Problem: The paper addresses the fundamental question of whether SAEs actually recover true underlying features in neural networks or if their apparent success on standard metrics is misleading.
4. 🛠️ Methods: Two complementary approaches: (1) synthetic experiments with known ground-truth features to test SAE recovery, and (2) comparing fully-trained SAEs against three frozen random baselines (Frozen Decoder, Soft-Frozen Decoder, Frozen Encoder) on real LLM activations.
5. 📊 Results and Evaluation: SAEs recover only 9% of true features despite 71% explained variance in synthetic tests; frozen random baselines match fully-trained SAEs on interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72), suggesting SAEs don't reliably learn meaningful features.

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

SAE Sanity Check Workflow Research Question Do SAEs learn meaningful features? Case Study #1 Synthetic Data with Ground Truth Case Study #2 Real LLM Activations Generate Synthetic Features Train SAEs (BatchTopK, JumpReLU) Train Full SAEs Train Frozen Baselines Frozen Decoder Soft-Frozen Decoder Frozen Encoder Evaluation Metrics Feature Recovery Explained Variance Interpretability Sparse Probing Causal Editing Synthetic Results 71% explained variance, only 9% feature recovery Real LLM Results Frozen baselines match fully-trained SAEs Conclusion SAEs fail to reliably decompose model mechanisms Key Insight 1 Reconstruction quality ≠ meaningful feature learning Key Insight 2 Random components achieve comparable performance Recommendation Use frozen baselines as sanity checks for SAEs
Q1
1. In the synthetic experiment with known ground-truth features, what paradoxical result did the authors discover about SAE performance?
SAEs achieved 71% explained variance but only recovered 9% of true features
SAEs recovered 71% of features but only achieved 9% explained variance
SAEs failed completely with both 0% explained variance and 0% feature recovery
Q2
2. What is the 'Soft-Frozen Decoder' baseline designed to test?
Whether SAEs can function without any decoder weights at all
Whether SAEs operate in a 'lazy training' regime where decoder vectors remain close to random initialization
Whether SAEs require perfectly orthogonal decoder vectors to achieve interpretability
Q3
3. What surprising finding emerged when comparing frozen random baselines to fully-trained SAEs on real LLM activations?
Frozen baselines completely failed on all metrics, proving SAEs are essential
Frozen baselines matched fully-trained SAEs across interpretability, sparse probing, and causal editing metrics
Frozen baselines outperformed fully-trained SAEs by learning better features through random initialization
1/2

Paper 3

Experiential Reinforcement Learning

Published: 2026-02-14

Link: http://arxiv.org/pdf/2602.13949

1. 📘 Topic and Domain: The paper focuses on reinforcement learning for language models, specifically in the domain of agentic reasoning and sparse-reward environments.
2. 💡 Previous Research and New Ideas: The paper builds on standard reinforcement learning with verifiable rewards (RLVR) and proposes Experiential Reinforcement Learning (ERL), which adds an explicit experience-reflection-consolidation loop where models generate self-reflections to guide improved second attempts.
3. ❓ Problem: The paper addresses the challenge of learning from sparse and delayed environmental feedback in reinforcement learning, where models struggle to implicitly infer how failures should translate into behavioral improvements.
4. 🛠️ Methods: ERL employs a four-stage process: initial attempt, self-reflection generation based on feedback, refined second attempt guided by reflection, and internalization through selective distillation to consolidate improvements into the base policy.
5. 📊 Results and Evaluation: ERL outperforms RLVR across all tested environments, achieving gains of up to +81% in Sokoban, +27% in FrozenLake, and +11% in HotpotQA, demonstrating improved learning efficiency and final performance.

Experiential Reinforcement Learning

Experiential Reinforcement Learning (ERL) Workflow Input Task (x) Policy π_θ First Attempt y^(1) ~ π_θ(·|x) Environment Env. Feedback (f^(1), r^(1)) Self-Reflection Δ ~ π_θ(·|x, y^(1), f^(1), r^(1), m) Memory (m) Second Attempt y^(2) ~ π_θ(·|x, Δ) Environment RL Update L_policy(θ) Internalization L_distill(θ) Legend: Input Policy Model Action/Attempt Environment Feedback Reflection Key: Experience → Reflection → Consolidation Loop
Q1
1. What is the key innovation that distinguishes Experiential Reinforcement Learning (ERL) from standard RLVR?
ERL uses larger language models with more parameters to improve performance
ERL embeds an experience-reflection-consolidation loop where models generate self-reflections to guide improved attempts
ERL provides denser rewards by adding intermediate checkpoints in the environment
Q2
2. Why does ERL employ a 'gated reflection' mechanism that only triggers when the first attempt fails (r(1) < τ)?
To reduce computational costs by limiting the number of reflection steps
To prevent reward hacking and maintain stable on-policy learning signals for successful trajectories
To ensure the model always generates exactly two attempts per task
Q3
3. In which environment did ERL show the most dramatic improvement over RLVR, and what was the likely reason?
HotpotQA with +11% improvement, because it had the most complex language understanding requirements
FrozenLake with +27% improvement, because it had the simplest grid structure
Sokoban with up to +81% improvement, because it required long-horizon planning and recovery from compounding errors