👍 104
06/22 20:00
Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolutio
中文介绍 本论文探讨了针对大型语言模型(LLM)代理的内存系统的进展,重点在于如何从简单的检索增强机制进化为支持持久信息存储、检索和动态生命周期管理的数据管理系统。该研究强调在代理执行过程中信息的更新和整合能力,标志着内存系统功能的演变。意义:为智能体与记忆系统的结合提供了重要信号,影响智能体执行的长期记忆管理策略。
👍 44
06/23 20:00
Real-world photography requires capture-time guidance for both camera framing and subject pose. Yet existing aesthetic cropping benchmarks mainly evaluate post-hoc crop prediction and overlook subject-side recommendations, leaving the capture-time guidance capabilities of multimodal large language m
中文介绍 提出了一种名为 ShutterMuse 的方法,用于提供拍摄时的摄影指导,既包括相机构图,还考虑被摄者姿态。现有技术主要在后期进行美学评估,而忽视了拍摄时的指导能力。本研究通过改进理论和模型,补充了多模态大语言模型在拍摄时指导的不足,对摄影相关应用领域有重要影响。
👍 42
06/24 20:00
Modern Vision-Language-Action (VLA) models often fail to generalize to novel setups, such as altered camera viewpoints or robot morphologies, because they are typically conditioned only on current observations and language instructions. By ignoring the underlying system configuration as a variable,
中文介绍 论文提出了一种新的视角-语言-动作(VLA)模型,这种模型在控制机器人时,通过在上下文中考虑系统配置来增强其泛化能力。与传统方法相比,该模型改善了对新环境的适应性,为机械控制领域的实用应用提供了更好的基础。意义:加强了智能体在动态环境中执行任务的能力,提高了机器人控制的有效性。
👍 40
06/24 20:00
Outcome-based reinforcement learning provides a stable optimization backbone for language agents, but its sparse trajectory-level rewards provide little guidance on which intermediate decisions should be reinforced or suppressed. On-policy self-distillation offers dense token-level supervision, yet
中文介绍 OPID(On-Policy Skill Distillation)是一种针对代理强化学习的新方法,它利用基于结果的强化学习,结合自我蒸馏提供密集的令牌级监督,旨在克服稀疏奖励带来的决策指导不足问题。本研究展示了该方法在提高学习效率和决策优化方面的潜力,为智能体强化学习的实践应用开辟新路径。
👍 38
06/24 20:00
While text-to-image (T2I) models have achieved remarkable progress, they struggle with real-world requests that are often underspecified, implicit, or dependent on up-to-date knowledge. We identify this challenge as the Context Gap: the mismatch between the user context and the sufficient generation
中文介绍 Qwen-Image-Agent 通过解决文本到图像生成中的上下文缺口问题,从而促进了真实世界图像的生成。研究识别了用户上下文与所需生成之间的差距,并提出了一种新方法来填补这一差距,增强了生成图像的适应性。意义:该方法为文本驱动的图像生成技术带来改进,可能在广告、游戏等领域产生影响。
👍 37
06/23 20:00
A classical intuition holds that verifying a solution is easier than producing one. For today's coding agents, this intuition is being inverted: as foundation models develop stronger reasoning capabilities and engineering harnesses grow more sophisticated, generating complex candidate solutions is n
中文介绍 本研究探讨了在编码代理奖励方面的挑战,强调了在生成复杂候选解决方案时面临的验证难题。随着基础模型推理能力的提升,代理的设计也需要重新审视以适应更复杂的任务,从而为编码代理评估提供新视角。意义:对编程代理在智能体能力评估方面提供了重要启示,推进了相关领域的研究发展。
👍 37
06/24 20:00
A unified representation for text and vision is a natural pursuit, as it enables simpler multimodal modeling and more efficient training. However, representing images as discrete signals in the same way as text inevitably introduces severe information loss. Existing work struggles to balance low-lev
中文介绍 ViQ 是一种文本对齐的视觉量化表示方法,旨在实现文本与视觉的统一表示,简化多模态建模过程。论文讨论了如何解决在表示图像时避免严重信息损失的挑战,从而提升了多模态系统的训练效率。意义:提高了视觉理解能力,对 Agent 和推理相关领域的发展具有重要影响。
👍 34
06/23 20:00
Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representations are limited by the accuracy of off-the-shelf reconstruc
中文介绍 MVTrack4Gen 提出了一种将多视角点跟踪作为 4D 视频生成几何监督的新方法,该方法重点维持和参考视频的一致性和运动保真性。研究表明,通过几何一致性和运动忠实度的控制,可以改善从单眼视角视频合成新视角视频的效果。意义:为视频合成技术在影视制作和虚拟现实等领域的应用提供了新思路。
👍 26
06/24 20:00
Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling
中文介绍 本论文提出了一种用于加速自回归大型语言模型(LLM)的并行树草拟的策略,称为 JetSpec。尽管该方法提高了速率,但在草拟预算方面面临限制,需确保高接受率与低草拟开销的平衡。研究结果展示了提高处理效率的内在挑战,为模型的实际应用提供了重要见解。
👍 25
06/21 20:00
Computer-use agents can execute software tasks through either graphical interfaces or programmatic command interfaces, but existing evaluations confound interaction modality with differences in tasks, initial states, verifiers, and permitted actions. We introduce a matched execution-layer benchmark
中文介绍 本研究比较了通过图形界面和命令界面执行软件任务的计算机使用代理的表现。引入了一个匹配的执行层基准,消除了任务、初始状态与批准行为的干扰,从而得出更精准的评估结果。意义:为计算机代理的评估和优化提供了深入的依据,有望推动智能软件代理的发展。
👍 25
06/23 20:00
Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods typically rely on reinforcement learning with verifiable rewards or supervised fine-tuning on large-sc
中文介绍 V-Zero 是一种新颖的无监督视觉推理框架,利用对比证据门控实现精细化视觉推理。它通过多模态大语言模型(MLLMs)识别任务相关的视觉证据,并在局部图像区域中扎根推理,以克服现有方法在奖励或监督微调上的依赖。意义:这一方法对增强视觉推理和智能体的自主学习能力具有重要作用。
👍 24
06/18 20:00
Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow
中文介绍 UnityShots 提出了基于记忆驱动的多镜头音频-视频生成方法,关注跨镜头间的结构化记忆。该研究通过引入边界感知门控,确保人物外观、场景背景和说话者身份在镜头切换的连续性。意义:为音视频生成提供了更高水平的连贯性,推动了多模态生成技术的发展。
👍 21
06/23 20:00
Joint-Embedding Predictive Architectures (JEPAs), including recent LeWorldModel (LeWM), have become a promising foundation for reconstruction-free visual world models. For visual planning, however, LeWM evaluates candidate action sequences by repeatedly applying a local one-step latent transition mo
中文介绍 Fast LeWorldModel 是一种新的联合嵌入预测架构(JEPA),用于无重建的视觉世界建模。该模型通过不断应用局部一阶潜在转变模型来评估候选动作序列,提升视觉规划的效率。意义:为无缝的视觉规划提供了基础,对智能体的应用有着重要的推动作用。
👍 15
06/24 20:00
As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications with relatively simple tasks and focus on a narrow set of capabilities w
中文介绍 本研究重评了智能体在非熟悉环境中的能力,指出当前基准通常仅关注狭窄的能力集与简单任务。作者强调开发更为全面的评估框架,以真实反映智能体在多样化环境中的表现。意义:为智能体在实际应用中的能力评估提供了新方向,促进相关领域的标准化进程。
👍 15
06/23 20:00
Tool use enables large language models (LLMs) to perform complex tasks, and recent agentic reinforcement learning (RL) methods show promise for enhancing model capabilities. However, RL alone often leads to instability or limited gains in tool-use tasks. In our experiments, some models exhibit catas
中文介绍 本论文探讨了多步骤工具使用强化学习方法的崩溃问题,并提出了通过引入监督信号来修复这一问题的方法。研究表明,传统的强化学习在工具使用任务中往往导致不稳定性,新的实验结果显示,改进后的方法在工具使用上的表现显著提升。意义:为强化学习在复杂任务中的应用提供了新的思路。
👍 13
06/24 20:00
The prevalent dual-branch paradigm, i.e., training a side network to encode visual conditions and fusing its intermediate-layer features to a frozen pretrained main network, has shown remarkable success in visual-condition controllable generation. Despite its widespread adoption, the role of the sid
中文介绍 LISA 是一种新颖的视觉条件控制生成方法,采用条件概率对齐的双支路架构。该研究探讨了副网络在主网络上的特征融合作用,揭示了在视觉条件生成中的重要性。意义:为可控的视觉生成模型提供了新视角,推动了计算机视觉与生成模型的结合。
👍 9
06/24 20:00
Reasoning capability has advanced rapidly in large language models (LLMs), leading to an increasing size of key-value (KV) cache in both prefilling and decoding stages. Existing KV cache compression methods mainly rely on attention weights to estimate token importance. While attention effectively ca
中文介绍 本研究探讨了大语言模型(LLM)中的推理能力以及长推理时的键值(KV)缓存压缩问题,强调现有方法对令牌重要性的估算存在的限制。通过提出新的压缩方法,有效解决了在推理过程中信息流失的挑战。意义:对提升 LLM 在复杂任务中推理效率的研究具有重要价值。
👍 9
06/24 20:00
Video reasoning language models implicitly assume that every input frame is equally reliable. This leads to what we term the Blind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontier video reasoning models can suffer 15-30%p accuracy drops on real-world em
中文介绍 提出了一种基于信心感知的工具协调框架,以增强视频理解能力。研究揭示,现有的视频推理模型假设每个输入帧同样可靠,导致在真实场景条件下准确性下降。该框架通过考虑帧的可靠性,旨在提升模型在真实世界场景中的表现。意义:为视频智能理解技术提供了重要的思路,可能影响后续多媒体内容分析的应用。
👍 9
06/24 20:00
We present PhysiFormer, a diffusion transformer for physically-plausible 3D object motion. Unlike video world models that operate in view-dependent pixel space, PhysiFormer represents objects as 3D meshes expressed in world coordinates. Given the initial vertex positions and velocities, as well as o
中文介绍 PhysiFormer 是一种新型的扩散变换器,专注于在世界空间中进行物理上合理的三维物体运动模拟。与依赖视角的模型不同,该模型采用世界坐标表示物体,显著提升了物理仿真的真实感。意义:为物理模拟与智能体设计的结合提供了新的方法,推动了多智能体环境的应用研究。
👍 9
06/23 20:00
We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a s
中文介绍 Autodata 是一套使人工智能代理能够充当高质量合成数据科学家的方法。研究展示了如何训练这样的数据科学家代理,使其能够创建更强的数据集,以提高训练和评估数据的质量。意义:为数据生成和优化处理的新路径,推动了智能数据科学的应用。