2025-07-17 Papers

1/2

Paper 1

PhysX: Physical-Grounded 3D Asset Generation

Published: 2025-07-16

Link: http://arxiv.org/pdf/2507.12465

1. 📘 Topic and Domain: Physical-grounded 3D asset generation, combining computer vision, 3D modeling, and physics simulation.
2. 💡 Previous Research and New Ideas: Based on existing 3D datasets like ShapeNet and PartNet that focus mainly on geometry/appearance, this paper introduces the first comprehensive physics-annotated 3D dataset and generation framework.
3. ❓ Problem: Current 3D generative models overlook physical properties of objects, limiting their real-world applications in simulation and embodied AI.
4. 🛠️ Methods: Developed PhysXNet (a physics-annotated 3D dataset with 26K objects) using human-in-the-loop annotation pipeline, and PhysXGen (a dual-branch framework) that jointly models geometry and physics during generation.
5. 📊 Results and Evaluation: The framework outperformed baselines across multiple metrics including geometry quality (PSNR, CD, F-Score) and physics predictions (scale, material, affordance, kinematics, descriptions), while maintaining good generalization capability.

PhysX: Physical-Grounded 3D Asset Generation

PhysX: Physical-Grounded 3D Asset Generation Workflow Phase 1: PhysXNet Dataset Raw 3D Assets (PartNet) Human-in-the-loop Annotation Pipeline VLM + Expert Physical Properties Annotation Absolute Scale Material Affordance Kinematics Function Description PhysXNet: 26K PhysXNet-XL: 6M Phase 2: PhysXGen Framework Image Input Dual-Branch VAE Structural Encoder Physical Encoder Structural Latent Physical Latent Latent Generation Structural Flow Transformer Physical Flow Transformer Joint Training with CFM Loss Decoding Stage Structural Decoder Physical Decoder Residual Connection Joint Optimization Physical 3D Asset Output Geometry & Texture Material Properties Kinematic Parameters Physical Dimensions Affordance Ranking Function Descriptions Applications: Simulation, Robotics, Embodied AI
Q1
1. What is the key innovation of PhysXNet compared to previous 3D datasets?
It has more 3D objects than any previous dataset
It contains comprehensive physics-based annotations including material, scale, and kinematics
It focuses only on geometric properties of 3D objects
Q2
2. How does PhysXGen achieve both physical accuracy and geometric quality in generated 3D assets?
By completely replacing geometric features with physical properties
By using separate networks for physics and geometry
By using a dual-branch architecture that jointly models correlations between physical and structural features
Q3
3. What is the scale difference between the base PhysXNet and PhysXNet-XL datasets?
PhysXNet-XL has 10 times more objects
PhysXNet-XL has 100 times more objects
PhysXNet-XL has over 200 times more objects
1/2

Paper 2

MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding

Published: 2025-07-16

Link: http://arxiv.org/pdf/2507.12463

1. 📘 Topic and Domain: This paper introduces MMHU, a large-scale multimodal benchmark dataset for understanding human behavior in autonomous driving scenarios.
2. 💡 Previous Research and New Ideas: Previous research focused on individual aspects of human behavior (motion, intention, trajectory) in driving, but this paper proposes the first unified comprehensive dataset combining multiple behavior aspects with rich annotations.
3. ❓ Problem: The lack of a unified benchmark dataset for evaluating algorithms that comprehensively understand human behaviors in autonomous driving scenarios, which is crucial for driving safety.
4. 🛠️ Methods: The authors developed a human-in-the-loop annotation pipeline to collect and label 57k human instances from diverse video sources (Waymo, YouTube, self-collected), providing motion data, trajectories, text descriptions, and critical behavior labels.
5. 📊 Results and Evaluation: The dataset improved performance across multiple tasks when used for model training - motion prediction accuracy improved by 9.49 MPJPE, intention prediction accuracy increased by 7.4%, behavior QA accuracy rose by 15.96%, and motion generation showed significant qualitative improvements in driving scenarios.

MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding

MMHU: Workflow and Methodology Data Collection • Waymo Dataset (73K frames) • YouTube Videos (318K frames) • Self-collected (2.4M frames) Total: 1.73M frames, 57K instances Video Processing • Human Detection & Filtering • Video Cutting (>10 seconds) • Individual Tracking • Frame Rate Unification (10 FPS) Motion Reconstruction • SMPL Parameter Extraction • Motion Completion (Interpolation) • Trajectory Generation • 3D Human Motion Sequences Hierarchical Text Annotation Low-Level Description Joint-wise motion details High-Level Description Semantic behavior summary Critical Behavior Recognition 13 Driving-Critical Behaviors: Walking pets, Talking, Using phone Crossing street, Wheelchair, Bicycle Scooter, Skateboard, Motorcycle Umbrella, Headphones, Carrying items Using stroller Human-in-the-Loop Annotation Pipeline • 10% Human-labeled subset • VLM Fine-tuning • Automated labeling • Quality assurance • Scalable annotation Dataset Splits MMHU-V: 47K (VLM-labeled) MMHU-H: 9.5K (Human-labeled) MMHU-T: 840 (Testing) Supported Tasks Motion Prediction • Historical → Future Motion • MPJPE Evaluation • Trajectory Forecasting • Physical Plausibility Motion Generation • Text → Motion • FID & Multi-modality • Driving Scene Specific • Data Augmentation Behavior VQA • Multi-modal Understanding • 13 Critical Behaviors • Binary Classification • Safety-oriented Intention Prediction • Street Crossing Intent • Temporal Analysis • Safety Critical • Accuracy & F1-score Key Achievements • Comprehensive human behavior understanding benchmark for autonomous driving • Significant performance improvements across all tasks when training with MMHU • Scalable annotation pipeline with minimal human effort
Q1
1. What is the primary innovation of the MMHU dataset compared to previous datasets?
It has more video hours than any previous dataset
It unifies multiple aspects of human behavior with comprehensive annotations
It only focuses on accident scenarios
Q2
2. How many critical behaviors does MMHU recognize and label in its annotation system?
7 behaviors
10 behaviors
13 behaviors
Q3
3. What unique annotation approach did the authors use to bridge the gap between SMPL parameters and semantic descriptions?
Direct manual annotation by experts
Fully automated AI labeling
Hierarchical text annotation with low-level and high-level descriptions
1/2

Paper 3

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

Published: 2025-07-16

Link: http://arxiv.org/pdf/2507.12415

1. 📘 Topic and Domain: Evaluating Large Language Models' ability to optimize code performance in real-world software repositories through the SWE-Perf benchmark.
2. 💡 Previous Research and New Ideas: Based on previous work in code correctness benchmarks like SWE-Bench and function-level optimization; introduces the first repository-level code performance optimization benchmark.
3. ❓ Problem: Addressing the gap in evaluating LLMs' capability to enhance code performance at the repository level, which requires more complex optimization than function-level improvements.
4. 🛠️ Methods: Created SWE-Perf benchmark with 140 curated instances from GitHub pull requests, including codebases, target functions, tests, and expert patches, evaluated under both file-level (oracle) and repo-level (realistic) settings.
5. 📊 Results and Evaluation: All tested LLMs showed significant performance gaps compared to expert-level optimization, with OpenHands performing best but still trailing expert performance by 8.59%, highlighting substantial room for improvement in LLMs' code optimization capabilities.

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

SWE-Perf: Code Performance Optimization Workflow Phase 1 Collect Pull Requests • Popular GitHub repos • Crawl PRs • Filter attributes 102K → 19.8K PRs Phase 2 Measure Performance • Build Docker env • Execute unit tests • Record runtimes 34K codebases Phase 3 Identify Optimizations • Filter performance PRs • Select relevant tests • Ratio < 0.3 threshold 1.7K instances Phase 4 Verify Improvements • Add warm-up • 20 repetitions • Filter outliers 140 final instances Task Formulation Input: Codebase + Target Functions Output: Performance Optimization Patch Two Settings: Oracle (File-level) | Realistic (Repo-level) Evaluation Methods Oracle Setting: Direct LLM prompting 10 popular models Realistic Setting: Agentless (Pipeline) OpenHands (Agent) Three-Level Evaluation Metrics Apply Patch successfully applied Correctness All unit tests pass Performance Statistical runtime gain Key Findings • Significant gap between LLMs and experts • OpenHands outperforms other methods • Expert performance: 10.85% • Best model performance: 2.26% Dataset Characteristics • 140 instances from 9 repositories • Average 447 files, 170K lines per codebase • Performance ratio: 10.9% average • Expert patches: 131 lines edited
Q1
1. What is the main innovation of SWE-Perf compared to previous code optimization benchmarks?
It focuses on individual function optimization
It evaluates repository-level performance optimization
It only tests code correctness
Q2
2. In the data collection process, how many repetitions were performed to ensure runtime stability?
3 repetitions
10 repetitions
20 repetitions
Q3
3. Which approach showed the best performance in optimizing code among the tested methods?
Agentless pipeline-based approach
Direct model oracle approach
OpenHands agent-based system