👍 115
06/16 20:00
While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevi
中文介绍 本文提出了 Moebius,一个轻量级的图像修复框架,能够在保持10B级性能的同时,显著降低计算成本。通过构建高度优化的任务特定专家,Moebius在修复任务中表现出色。该方法的重要性在于它解决了大型基础模型在实际应用中的高计算开销问题,助力图像处理的实用化。
👍 69
06/12 20:00
Dexterous interaction with articulated objects is important for household, assistive, and humanoid manipulation, where multi-finger hands can provide compliant contact patterns beyond parallel-jaw grasping. However, articulated-object manipulation differs from static-object manipulation: the target
中文介绍 本文提出了 DragMesh-2,通过物理上合理的方式实现灵巧手与关节物体的交互,解决了多指操作在家庭和助理中的重要性。DragMesh-2考虑了动态的交互过程,展示了对关节物体的灵活抓取能力,为人机协作与助智能机器提供了新思路。
👍 55
06/17 20:00
LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware ev
中文介绍 Multi-LCB扩展了 LiveCodeBench,针对多个编程语言的代码生成任务进行了评估。通过策划竞争性编程问题并不断增加新问题,Multi-LCB为大型语言模型提供了污染意识的评估,推动了代码生成基准的发展。该方法为评估多语言环境下的模型性能提供了新的视角。
👍 48
06/16 20:00
Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDL
中文介绍 PerceptionDLM 提出了一种多模态扩散语言模型,用于高效的区域感知任务。相较于现有的自回归生成模型,PerceptionDLM使用更好的并行处理能力,显著提高了对多个区域的描述效率。这一进展对于视觉理解和图像标注应用具有重要意义。
👍 44
06/16 20:00
Current agentic robot systems can write executable Code-as-Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task-driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied c
中文介绍 本文研究了 Playful Agentic Robot Learning,聚焦于如何让机器人在多次尝试中观测反馈并修正行为。不同于传统的任务驱动方法,此技术使机器人能够在无须明确指令的情况下学习可复用的技能,推动了自身改进的自主学习能力,具有广泛的应用潜力。
👍 38
06/17 20:00
Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textsc{S-Agent}, a spatial tool-use agentic paradigm for underst
中文介绍 S-Agent 提出了一个空间工具使用的智能体框架,以应对真实世界中的空间推理问题。该模型实现了针对动态环境的推理能力,相较于传统的静态推理方法,S-Agent在视觉理解和空间智能应用中显示出更高的适应性,预示着空间智能的发展方向。
👍 30
06/17 20:00
Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a
中文介绍 DF3DV-1K是一个大规模数据集和基准,专注于无干扰的新视角合成。该数据集促进了基于放射场的合成技术的发展,为评估无干扰场的表现提供了新的评测标准,推动了图像合成及计算机视觉领域的进步。
👍 30
06/17 20:00
Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep-dive of one MCP-based industrial-agent benchmark to date: fourteen parallel implementation studies covering new asse
中文介绍 本文探讨了大型语言模型代理的评估问题,提出了超越静态排行榜的预测有效性指标。通过对14项实施研究的协调深度分析,该研究建立了更全面的性能评估框架,为未来的代理系统测试和优化提供了有力支持。
👍 26
06/17 20:00
Style-content dual-reference generation aims to synthesize an image that preserves the structure and semantics of a content reference while adopting the style of a separate style reference.Despite recent progress, this setting remains challenging because models must balance content fidelity, style a
中文介绍 FreeStyle是一种框架,旨在优化多步骤 LLM 管道的提示生成。通过对模型内部的标准化功能优化,FreeStyle 能够克服共存的检索、推理和格式化步骤之间的瓶颈,有效提升了管道的整体性能,对应用程序开发具有重要意义。
👍 12
06/14 20:00
Personalized presentation generation requires more than conditioning on a current prompt or template: agents must preserve stable user preferences across tasks, retain newly introduced preferences and constraints during multi-turn revision, and carry out local edits reliably. We propose MemSlides, a
中文介绍 MemSlides提出了一种层次记忆驱动的智能体框架,支持个性化幻灯片生成及多轮修订。该方法能够在多任务和用户偏好的变化中保持稳定,体现了在个性化内容生成领域的先进性,为智能工具的开发提供了新思路。
👍 11
06/14 20:00
Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that imp
中文介绍 ContextRL提出了一种上下文敏感的强化学习方法,解决了大型语言模型在复杂上下文中提取关键信息的困难。通过结合上下文信息,该方法显著提高了模型回答复杂问题的精度,为代理和推理系统的发展指明了方向。
👍 11
06/17 20:00
Achieving dexterous robotic manipulation in the real world heavily relies on human supervision and algorithm engineering, which becomes a central bottleneck in the pursuit of general physical intelligence. Although emerging coding agents can generate code to automate algorithm search, their successe
中文介绍 ENPIRE研究通过自主改进策略实现在真实世界的灵巧机器人操作,解决了人类监督不足的问题。该方法通过机器人自主学习算法,逐步提升其操作能力,促进了一般物理智能的形成,具有较高的实际应用价值。
👍 10
06/16 20:00
Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so
中文介绍 GateMem对多主体共享内存代理进行了基准测试,聚焦于如何针对不同角色和范围有效管理公共内存。该研究为共享助手在多用户环境下的运作提供了详细的研究视角,对未来的智能助手设计有重要启示。
👍 10
06/16 20:00
Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized
中文介绍 FAPO实现了多步骤 LLM 管道的完全自主提示优化,解决了检索、推理和格式化步骤间的互动问题。该框架提高了管道的整体效率,为多任务模型的设计与应用提供了有益的支持,打开了优化新思路。
👍 10
06/14 20:00
Visual thinking should not only sound right; it should show its evidence. While recent vision-language models (VLMs) can produce natural-language reasoning traces, these traces often leave the supporting image regions implicit, making them hard to verify and difficult to supervise. We introduce visu
中文介绍 本文提出了Visual Grounding概念,强调了视觉思维的证据展示。通过发展新的视觉语言模型,该方法使推理过程的支持图像区域更加明确,解决了验证困难的问题,为解释性人工智能的发展提供了基础。
👍 8
06/17 20:00
Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalabilit
中文介绍 HumanScale探讨了自我嵌入的基础模型在视频预训练中的应用,指出如智能机器人需借助人类操作轨迹进行有效的预训练。研究表明,虚拟环境中的数据可能对增强实际机器人性能更具优势,为智能体的训练方式提供了新思路。
👍 7
06/17 20:00
FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limit
中文介绍 本文分析了LLM FP4预训练中的收缩偏差问题,探讨了其几何起源及系统性影响。通过识别现有FP4路径的基本限制,研究提供了优化训练方案的实际指导,有助于提升大型语言模型的效率与效果。
👍 6
06/17 20:00
Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and conditions observed through user interaction and tool calls. In standard agents
中文介绍 LedgerAgent提出了一种结构化的状态管理方法,用于在客户服务领域的工具调用智能体中维持任务状态。通过有效整合用户交互的数据,LedgerAgent优化了智能体的任务执行,向实现更高效的服务智能体迈出了重要一步。
👍 6
06/16 20:00
Progress in legal AI increasingly depends on access to authoritative legal text at scale. Yet one of the most consequential layers of American law remains largely absent from existing machine-readable corpora: local ordinances. Local codes govern zoning, housing, business licensing, public health, n
中文介绍 LOCUS为美国地方条例构建了一个本地法典语料库,弥补了法律AI在获取权威法律文本方面的不足。此工作关注地方法规的标准化,为法律文本的机器可读性提供了基础,有助于人工智能在法律领域的应用。
👍 6
06/14 20:00
Recent retrieval-augmented generation (RAG) approaches have demonstrated strong capability in handling complex queries, yet current research overlooks a critical challenge: different retrievers require fundamentally different query formulation strategies for optimal performance. In this work, we pre
中文介绍 本文探讨了环境感知信息检索中的行为理解问题,提出预检索策略以匹配不同检索器的查询需求。该研究不仅优化了复杂查询的处理,还推动了检索增强生成(RAG)方法的发展,对信息检索策略有重要启示。