2025-11-18 Papers

1/2

Paper 1

P1: Mastering Physics Olympiads with Reinforcement Learning

Published: 2025-11-17

Link: http://arxiv.org/pdf/2511.13612

1. 📘 Topic and Domain: Development of large language models (P1) specialized in physics reasoning and solving Physics Olympiad problems, in the domain of artificial intelligence and scientific reasoning.
2. 💡 Previous Research and New Ideas: Based on recent advances in LLMs for scientific reasoning, introduces new reinforcement learning techniques for physics problem-solving and proposes a novel multi-stage training framework with adaptive learnability adjustment.
3. ❓ Problem: Addresses the challenge of developing open-source language models capable of mastering complex physics problems at the Olympiad level, requiring deep scientific reasoning rather than simple pattern matching.
4. 🛠️ Methods: Employs reinforcement learning with Group Sequence Policy Optimization (GSPO), adaptive learnability adjustment, and test-time scaling through an agentic framework called PhysicsMinions.
5. 📊 Results and Evaluation: P1-235B-A22B achieved gold-medal performance at IPhO 2025, winning 12 gold medals out of 13 competitions, while P1-30B-A3B earned silver medal performance, surpassing most open-source models, with further improvements when combined with PhysicsMinions.

P1: Mastering Physics Olympiads with Reinforcement Learning

P1: Mastering Physics Olympiads with Reinforcement Learning Physics Dataset 5,065 Problems • Olympiads (81%) • Textbooks (19%) • Rule-Verifiable Answers • Expert Solutions RL Formulation MDP: (S, A, P, r) Binary Reward: r ∈ {0,1} Policy Gradient GSPO Algorithm Multi-Stage RL Training Stage 1 G=16 48k tokens Stage 2 G=32 48k tokens Stage 3 G=32 64k tokens Stage 4 (235B only) 80k tokens Adaptive Learnability Adjustment: • Pass Rate Filtering (0 < pass ≤ 0.7) • Group Size Expansion • Generation Window Expansion Training Stabilization: • Truncated Importance Sampling (TIS) • Train-Inference Mismatch Mitigation P1 Models P1-30B-A3B (Silver Medal) P1-235B-A22B (Gold Medal) PhysicsMinions Agentic Framework Logic Studio Solver + Introspector Review Studio Physics + General Verifier Iterative Refinement CV=2 Evaluation on HiPhO Benchmark (13 Physics Olympiads 2024-2025) P1-235B-A22B IPhO 2025: 21.2/30 12 Gold, 1 Silver First Open-Source Gold Medal Rank #3 globally P1-30B-A3B IPhO 2025: 18.5/30 8 Gold, 4 Silver, 1 Bronze Silver Medal Performance Rank #8 overall P1 + PhysicsMinions IPhO 2025: 23.2/30 Overall #1 Position Beats Gemini-2.5-Pro and GPT-5 Avg Score: 38.4 Generalizability Superior performance on Math, STEM, and Coding tasks vs base models Transferable reasoning
Q1
1. What unique approach did P1 take in its training process compared to other language models?
It was trained exclusively using traditional supervised learning
It was trained entirely through reinforcement learning with adaptive learnability adjustment
It was trained using a combination of supervised learning and imitation learning
Q2
2. When the P1-235B-A22B model was integrated with PhysicsMinions, what was its most significant achievement?
It reached bronze medal performance in Physics Olympiads
It matched human performance but couldn't exceed it
It achieved the overall No.1 position across all models, outperforming leading closed-source models
Q3
3. What unexpected benefit was discovered about P1's training approach?
It improved the model's performance only in physics
It made the model perform worse in other domains
It enhanced the model's reasoning abilities across multiple domains including mathematics and coding
1/2

Paper 2

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

Published: 2025-11-17

Link: http://arxiv.org/pdf/2511.13647

1. 📘 Topic and Domain: A part-aware 3D multimodal large language model (Part-X-MLLM) for understanding and manipulating 3D shapes at the part level in computer vision and graphics.
2. 💡 Previous Research and New Ideas: Previous research focused on scene-level 3D understanding or holistic shape generation, while this paper proposes a unified framework that treats parts as first-class citizens and provides a single executable interface for part-based operations.
3. ❓ Problem: The lack of a native language model that can understand, name, and manipulate 3D object parts while providing precise spatial grounding and executable programs for downstream geometry tasks.
4. 🛠️ Methods: Uses a dual-encoder architecture (structure + semantics) with an autoregressive decoder to generate structured programs containing part-level bounding boxes and edit commands, followed by specialized geometry modules for execution.
5. 📊 Results and Evaluation: Achieved superior performance on UniPart-Bench across 11 task families, with significant improvements in bounding box generation (74.11% voxel recall, 48.74% voxel IoU) and consistent gains in part-level Q&A and grounding tasks over baseline models.

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model Workflow Input Data RGB Point Cloud (40960,6) XYZ+Normal (10240,6) XYZ+RGB Natural Language Prompt Structure Encoder XYZ + Normals Geometry Features Semantic Encoder XYZ + RGB Appearance Features Feature Fusion Autoregressive Decoder Qwen 2.5 VL Model Program Generation <box> <edit> tokens Stage 1: Geometry Pretraining 3.6M objects, 10 epochs BBox prediction task Structure encoder only Stage 2: Full Instruction Tuning Dual-encoder + LLM 85,771 annotated objects 11 task families Part-Aware Generation BBox-guided synthesis OmniPart backend High-fidelity mesh Semantic granularity control Grounded Q&A BBox token embedding Persistent references Part-level reasoning Spatial understanding Localized Editing Cuboid mask generation Nano3D/VoxHammer Add/Delete/Modify Precise spatial control Advanced Applications Confidence-aware face segmentation Part clustering Hierarchical structure Structured Planning Grammar <boxs>...<boxe> for bounding boxes <adds>/<dels>/<mods> for edit operations Quantized coordinates (128 bins) UniPart-Bench 30k entries 11 task families BBox IoU: 42.55% Voxel Recall: 74.11% SBERT: 78.98
Q1
1. What is the key innovation in Part-X-MLLM's architecture that helps it better understand both structure and appearance of 3D objects?
A single unified encoder for all features
A dual-encoder design separating structure and semantics
A triple-encoder system with dedicated texture analysis
Q2
2. How does Part-X-MLLM handle different levels of part detail in 3D objects?
By using fixed predefined part categories
Through manual user intervention
By clustering part bounding boxes based on text similarity
Q3
3. What percentage improvement did Part-X-MLLM achieve in Voxel recall compared to the PartField baseline?
About 4.5%
About 5.8%
About 7.1%
1/2

Paper 3

PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

Published: 2025-11-17

Link: http://arxiv.org/pdf/2511.13648

1. 📘 Topic and Domain: Generating simulation-ready 3D physical assets from single images, in the domain of computer vision and 3D graphics.
2. 💡 Previous Research and New Ideas: Based on previous 3D generation and physical modeling research, introduces a novel VLM-based approach with a new efficient 3D representation that reduces tokens by 193x while maintaining structural information.
3. ❓ Problem: Existing 3D generation methods lack physical and articulation properties needed for simulation, limiting their use in embodied AI applications.
4. 🛠️ Methods: Uses a multi-round VLM conversation to generate physical descriptions and geometry, with a controllable flow transformer for fine-grained details, and introduces PhysX-Mobility dataset with rich physical annotations.
5. 📊 Results and Evaluation: Achieves superior performance across geometric and physical metrics compared to state-of-the-art methods, with 99% improvement in absolute scale accuracy and strong generalization to real-world images.

PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

PhysX-Anything Workflow Input Image Single real-world image VLM Multi-round Conversation Qwen2.5 Fine-tuned Physical 3D Generation Physical Representation Overall Information Structure, Scale, Materials Geometry Information 32³ Voxel Grid (193x compression) Controllable Flow Transformer Fine-grained geometry synthesis Physical Representation Decoder Mesh segmentation & format conversion Simulation-Ready Outputs 3D Mesh URDF Files XML Files 3D Gaussians Radiance Fields Part-level Mesh PhysX-Mobility Dataset 47 categories, 2K+ objects Rich physical annotations 2x category expansion Applications Robotic Policy Learning Physics Simulation Embodied AI Key Innovation 193x Token Compression No Special Tokens VLM-based Pipeline Evaluation PhysX-Mobility Benchmark In-the-wild Testing User Studies Technical Architecture Details Voxel Tokenization Multi-round Dialogue Flow Transformer Format Decoder URDF/XML Export Performance Highlights: 99% improvement in absolute scale, Direct simulator deployment, Contact-rich manipulation tasks
Q1
1. What is the main innovation in PhysX-Anything's token compression strategy?
Using a new tokenizer and special tokens
Converting meshes to 32³ voxel grid and merging neighboring indices
Applying vertex quantization only
Q2
2. How does PhysX-Anything generate the geometric information for different parts of an object?
Generates all parts simultaneously using parallel processing
Uses the previous part's information to generate the next part
Generates each part independently based only on shared overall information
Q3
3. What unique capability does PhysX-Anything demonstrate in the experimental results?
Fastest processing speed among all 3D generation methods
Direct deployment in physics simulators for robotic policy learning
Perfect photorealistic rendering of 3D objects