👍 95
05/27 20:00
Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication-quality illustrations remains one of the most labor-intensive parts of paper preparation. Existing automated systems each target a single figure type under text-only input, leavin
中文介绍 Crafter 提出了一个多智能体框架,用于从多样化输入生成可编辑的科学图形,以减轻制作高质量插图的劳动密集度。该方法结合了文本与非文本输入,通过设计灵活的生成策略来满足不同科学领域的需求,从而提高了图形生成的效率和可操作性。
👍 48
05/26 20:00
As agent capabilities advance, existing benchmarks, such as τ^2-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to
中文介绍 本研究提出了一种新的基准框架 A Matter of TASTE,以改善现有智能体基准的覆盖率和任务难度。论文指出,现有基准任务的构建既复杂又昂贵,并尝试通过简化任务生成过程来提高任务的多样性和实用性,以更好地评估智能体的能力。
👍 41
05/31 20:00
Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400
中文介绍 K-BrowseComp 是一个针对韩国语境的网页浏览智能体基准,包含400个任务。该基准评估了智能体在执行多步网页操作时的能力,以填补当前在韩语方面缺乏有效基准的空白,从而推动网页智能体的发展。
👍 24
05/27 20:00
Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However
中文介绍 Draft-OPD 提出了一种新的策略,采用 On-Policy Distillation 来提升推测性解码模型的性能。该方法通过结合轻量级草稿模型与目标模型的协同验证,加快了大语言模型的推理速度,显著减少了推理时间,提高了效率。
👍 20
05/31 20:00
The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to log
中文介绍 该论文探讨了 VLM 在视频推理中的应用,提出通过自适应的测试时优化来改善视频生成模型在特定任务中的表现。尽管现有的 VGM 在视觉质量上表现出色,但它们在遵循特定任务规则上仍然存在困难。因此,提出了一种新的优化方法来增强其推理能力。
👍 20
05/30 20:00
Large language model (LLM) agents increasingly rely on reusable external skills to solve long-horizon interactive tasks. Existing training-free skill adaptation pipelines usually update skills from full trajectories or session-level feedback, which makes failure attribution coarse and often produces
中文介绍 SkillAdaptor 提出了基于轨迹的自适应技能方法,以提高大语言模型智能体的任务解决能力。该方法通过有效利用完整的互动轨迹和会话级反馈,克服了现有技能适配过程中的粗糙失败归因问题,大幅提升了任务执行的效率和成功率。
👍 20
05/31 20:00
While video streaming understanding has made significant strides, real-world applications, such as live sports broadcasting, autonomous driving, and multi-screen collaboration, inherently demand continuous, multi-stream interactions. However, existing benchmarks are confined to single-stream paradig
中文介绍 X-Stream 研究了多流理解的可能性,尤其是在视频流理解的应用中。该研究解决了现有基准局限于单流交互的问题,通过探索多流数据的交互和处理,并为现实应用,如实时滚动播报,提供了新的评估框架。
👍 17
05/30 20:00
Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where
中文介绍 论文引入了目标视点复制(TVR)任务,以研究基础模型在主动探索中重现目标视点的能力。该研究通过人类的运动模仿挑战基础模型的空间智能,探索在动态环境中实现主动镜头控制的有效性。
👍 17
05/23 20:00
Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurat
中文介绍 NITP 提出了用于大语言模型预训练的下一隐含令牌预测方法。该研究指出,传统的稀疏监督方法限制了潜在表示空间,导致模型表示能力不足,提出的新方法有效改进了潜在表示的质量。
👍 13
05/21 20:00
Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base model
中文介绍 此研究探讨了多智能体强化学习在大语言模型工作流程中的作用,分析了在不同工作流程、规模和策略共享条件下的表现。通过对比实验,揭示了强化学习的训练模式对模型准确性的提升影响。
👍 12
05/31 20:00
The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic
中文介绍 MCP-Persona 提出了通过环境模拟对大语言模型智能体进行真实世界个人应用的基准测试,适应模型上下文协议(MCP)以提高其在具体应用中的有效性。这一研究推动了个性化应用的评估标准化。
👍 11
05/31 20:00
In open-ended environments, exploration is fundamental for autonomous agents, yet current language model agents struggle with this. Effective exploration requires memory, but retaining raw interaction histories is computationally expensive over long trajectories. While latent memory offers a solutio
中文介绍 本论文探讨了开放环境中的智能体记忆与探索学习,提出通过新颖性信号促使智能体更好地进行探索。该方法克服了传统记忆方法的计算开销瓶颈,提高了长轨迹下的交互能力。
👍 11
05/24 20:00
Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive gene
中文介绍 StreamChar 开发了一种长时间流式角色音视频生成的方法,以实现角色动画的实时生成。该方法通过解耦的编排策略,克服了生成实时音视频所需满足的严格时限要求,提高了流畅度与一致性。
👍 9
05/31 20:00
Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory:
中文介绍 LongLive-RAG 提出了一种通用的长视频生成增强检索框架,以应对自回归视频扩散中的长序列生成问题。研究表明,采用滑动窗口注意力的策略可有效减少生成过程中的错误累积,提高生成质量。
👍 9
05/31 20:00
Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large coll
中文介绍 OpenWebRL 研究了在线多回合强化学习在视觉网页智能体中的应用,强调了长远推理与动态网站交互的重要性。该框架旨在通过开放源代码的方式促进视觉智能体的开发与普及。
👍 9
05/28 20:00
Long-horizon search agents accumulate large amounts of retrieved content across many tool calls, making context-budget efficiency increasingly important. A minimal intervention is to mask stale observations from the context as the trajectory progresses, but it remains unclear when this form of conte
中文介绍 该论文探索了在有效的长远搜索中,如何通过掩盖过时观察来提高智能体的上下文预算效率。通过研究不同场景下的效果,构建了一个机理地图,以揭示何时应采用此类策略。
👍 9
05/28 20:00
LLM agents increasingly retrieve externally curated skills-procedural instructions retrieved at decision time-to improve performance on long-horizon interactive tasks. Existing skill libraries are typically treated as model-agnostic, reusing the same skill formulations across backbones with substant
中文介绍 Skill is Not One-Size-Fits-All 讨论了针对长时交互任务优化智能体技能的方法,强调模型感知的技能对齐的重要性。研究表明,所提出的框架更好地适应了不同模型背景下的技能集成,提高了任务成功率。
👍 6
05/29 20:00
Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused b
中文介绍 RoboStressBench 提供了一种基准,评估视觉语言模型在物理视觉压力下的鲁棒性。该研究通过引入真实复杂场景中的压力测试,推动了对视觉语言模型可靠性的深入理解。
👍 5
05/30 20:00
Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajector
中文介绍 本研究强调,智能体技能应超越文本,倡导引入视觉技能以扩展智能体的能力。通过分析现有技能学习方法的局限性,提供了将视觉经验与文本指令结合的新思路,以满足复杂任务的需求。
👍 5
05/28 20:00
Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tas
中文介绍 MineExplorer 评估了多模态大型语言模型在 Minecraft 开放世界探索中的表现,研究指出当前的基准无法充分反映智能体在动态环境下的持久探索能力。该论文为改进未来智能体的开放性探索提供了新视角。