👍 104
06/22 20:00
Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolutio
中文介绍 论文探讨了为大规模语言模型(LLM)代理构建本土化记忆系统的问题,提出了一种数据管理框架,用于持久信息存储、检索、更新和动态生命周期管理。通过整合更复杂的记忆机制,论文增强了代理的执行能力,支持持久化的知识管理。意义:推进了智能代理在动态环境下的信息处理能力。
👍 47
06/24 20:00
Modern Vision-Language-Action (VLA) models often fail to generalize to novel setups, such as altered camera viewpoints or robot morphologies, because they are typically conditioned only on current observations and language instructions. By ignoring the underlying system configuration as a variable,
中文介绍 提出了一种称为“上下文世界建模”(In-Context World Modeling)的方法,旨在解决现代视觉-语言-动作(VLA)模型在新设定下的泛化能力不足问题。通过考虑系统配置作为变量,该方法提高了模型对新环境的适应性,展示了更强的通用性。意义:增强了机器人控制领域的应用潜力。
👍 45
06/24 20:00
Outcome-based reinforcement learning provides a stable optimization backbone for language agents, but its sparse trajectory-level rewards provide little guidance on which intermediate decisions should be reinforced or suppressed. On-policy self-distillation offers dense token-level supervision, yet
中文介绍 本研究提出了一种基于结果的强化学习框架,结合政策内自蒸馏技术,以提供更密集的符号级监督。通过改善代理在决策过程中的引导,达到了在复杂环境下更高的决策精度,显著提升了模型性能。意义:推动了强化学习在语言代理中的应用,促进了智能决策的可靠性。
👍 44
06/23 20:00
Real-world photography requires capture-time guidance for both camera framing and subject pose. Yet existing aesthetic cropping benchmarks mainly evaluate post-hoc crop prediction and overlook subject-side recommendations, leaving the capture-time guidance capabilities of multimodal large language m
中文介绍 研究了如何通过多模态大语言模型(MLLMs)在拍摄时提供摄影指导。提出的方法能够在拍摄过程中为相机构图和主体姿势提供实时建议,解决了现有模型在美学裁剪方面的局限性,表现出较好的实用性。意义:为摄影技术的发展提供了新思路,推动了计算机视觉与艺术结合的应用。
👍 41
06/24 20:00
While text-to-image (T2I) models have achieved remarkable progress, they struggle with real-world requests that are often underspecified, implicit, or dependent on up-to-date knowledge. We identify this challenge as the Context Gap: the mismatch between the user context and the sufficient generation
中文介绍 论文提出了Qwen-Image-Agent方法,用于解决文本到图像生成(T2I)模型在真实请求中的上下文差距问题。这种方法可以利用上下文信息更好地匹配用户需求,从而提高生成图像的质量和相关性。意义:为图像生成领域提供了新的方向,特别是在生成真实场景方面。
👍 39
06/23 20:00
A classical intuition holds that verifying a solution is easier than producing one. For today's coding agents, this intuition is being inverted: as foundation models develop stronger reasoning capabilities and engineering harnesses grow more sophisticated, generating complex candidate solutions is n
中文介绍 针对编码代理,论文探讨了验证解决方案与生成解决方案间的动态关系。通过分析当前基准和模型能力的演变,提出了更具有效性的方法以生成复杂候选解决方案,以适应日益复杂的任务需求。意义:提高了生成模型在编程领域的适用性,促进了编码能力的发展。
👍 38
06/24 20:00
A unified representation for text and vision is a natural pursuit, as it enables simpler multimodal modeling and more efficient training. However, representing images as discrete signals in the same way as text inevitably introduces severe information loss. Existing work struggles to balance low-lev
中文介绍 ViQ提出了一种用于文本与视觉对齐的量化表示方法,旨在实现更高效的多模态建模和训练。该方法解决了在将图像转化为离散信号时可能导致的信息损失问题,提升了模型在视觉任务中的表现。意义:推动了多模态学习的进展,为改进跨领域的理解奠定了基础。
👍 34
06/23 20:00
Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representations are limited by the accuracy of off-the-shelf reconstruc
中文介绍 MVTrack4Gen提出了一种多视角点追踪的几何监督方法,旨在合成新视角的视频。通过保持几何一致性和运动保真度,该方法提升了基于单目参考视频生成4D视频的效果,展示了在视频生成领域的创新潜力。意义:为生成模型提高了在视频合成中的应用价值。
👍 31
06/24 20:00
Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling
中文介绍 JetSpec通过并行树草稿技术突破了推测解码的规模限制,提高了自回归大语言模型(LLMs)的加速效能。研究表明,在保持高接受率的同时,增加草稿预算可以显著提升速度,这为推断效率的提升提供了新视角。意义:推动了解码效率提升技术的发展,为大模型应用中的实时推理提供支持。
👍 27
06/21 20:00
Computer-use agents can execute software tasks through either graphical interfaces or programmatic command interfaces, but existing evaluations confound interaction modality with differences in tasks, initial states, verifiers, and permitted actions. We introduce a matched execution-layer benchmark
中文介绍 本文对计算机使用代理在图形界面与命令界面之间的执行瓶颈进行分析,并引入了匹配执行层基准以进行评估。通过标准化测试环境与任务变量,论文揭示了不同交互模式下的性能差异。意义:为人机交互领域提供了深入的性能分析和优化建议。
👍 25
06/23 20:00
Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods typically rely on reinforcement learning with verifiable rewards or supervised fine-tuning on large-sc
中文介绍 V-Zero提出了一种无需答案标签的政策内蒸馏方法,通过对比证据门控进行细粒度视觉推理。该方法提高了多模态大语言模型对任务相关视觉证据的识别能力,并在局部图像区域实现了更紧密的推理。意义:为视觉推理和多模态学习提供了新方法,推动智能代理的实用性和有效性。
👍 24
06/18 20:00
Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow
中文介绍 UnityShots研究了一种基于记忆的多镜头音频-视频生成技术,强调跨镜头的结构化记忆在生成连贯视频中的重要性。该方法通过保持主体外观和场景上下文的一致性,实现了高质量的视频生成。意义:推动了生成模型在多模态领域的应用,特别是在影视制作方面。
👍 22
06/23 20:00
Joint-Embedding Predictive Architectures (JEPAs), including recent LeWorldModel (LeWM), have become a promising foundation for reconstruction-free visual world models. For visual planning, however, LeWM evaluates candidate action sequences by repeatedly applying a local one-step latent transition mo
中文介绍 Fast LeWorldModel(LeWM)提出了一种快速视觉世界模型构建方法,能通过局部潜在过渡模型高效评估候选动作序列。这一方法突破了视觉规划中的重建限制,提高了模型在动作选择上的效率。意义:为视觉推理与规划提供了新的技术途径,提高了复杂环境中的决策能力。
👍 16
06/24 20:00
As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications with relatively simple tasks and focus on a narrow set of capabilities w
中文介绍 论文重新评估了在真实环境中代理系统能力的测试,以应对当前基准在评估复杂性上的局限性。通过设计更全面的评估标准,增强了对多种能力的考量。意义:为智能代理能力的全面评估提供了新视角,推动了实用应用的发展。
👍 15
06/23 20:00
Tool use enables large language models (LLMs) to perform complex tasks, and recent agentic reinforcement learning (RL) methods show promise for enhancing model capabilities. However, RL alone often leads to instability or limited gains in tool-use tasks. In our experiments, some models exhibit catas
中文介绍 研究探讨了多步工具使用强化学习的崩溃现象,提出监督信号的引入作为解决方案。通过实验验证,发现适当的监督可以显著改善模型在复杂任务中的稳定性和效果。意义:推动了强化学习在实用工具任务中的应用,对解决不稳定性提供了新思路。
👍 13
06/24 20:00
The prevalent dual-branch paradigm, i.e., training a side network to encode visual conditions and fusing its intermediate-layer features to a frozen pretrained main network, has shown remarkable success in visual-condition controllable generation. Despite its widespread adoption, the role of the sid
中文介绍 LISA提出了一种视觉条件可控生成的对齐方法,利用双分支架构提升生成效果。通过细致地处理边缘特征与主网络的融合,优化了生成质量。意义:为生成任务中的视觉控制提供了新思路,促进了图像生成的多样性。
👍 13
06/23 20:00
We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a s
中文介绍 Autodata提出了一种使AI代理充当数据科学家的方法,帮助生成高质量的合成数据。通过元优化技术,训练代理以创造更强的数据,从而提升模型的整体性能。意义:推动了合成数据生成领域的发展,为AI在数据科学中的应用提供新的可能。
👍 9
06/24 20:00
Reasoning capability has advanced rapidly in large language models (LLMs), leading to an increasing size of key-value (KV) cache in both prefilling and decoding stages. Existing KV cache compression methods mainly rely on attention weights to estimate token importance. While attention effectively ca
中文介绍 研究聚焦于长时间推理中的信息感知键值缓存压缩策略,提出了改进的方法以提升缓存效率,克服现有方法对注意力权重的过度依赖。通过优化信息编码,模型在推理速度及准确度方面都有显著提升。意义:为推理系统的优化提供新方法,增强了模型的长效推理能力。
👍 9
06/24 20:00
Video reasoning language models implicitly assume that every input frame is equally reliable. This leads to what we term the Blind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontier video reasoning models can suffer 15-30%p accuracy drops on real-world em
中文介绍 文章提出了一种灵活的视频理解框架,考虑了输入帧的可靠性差异,以应对真实世界条件下导致的准确度下降问题。通过信心感知的工具编排,提升了模型在处理扰动时的鲁棒性。意义:增强了视频理解领域的稳定性,为实时视频分析提供新方法。
👍 9
06/24 20:00
We present PhysiFormer, a diffusion transformer for physically-plausible 3D object motion. Unlike video world models that operate in view-dependent pixel space, PhysiFormer represents objects as 3D meshes expressed in world coordinates. Given the initial vertex positions and velocities, as well as o
中文介绍 PhysiFormer提出了一种Diffusion Transformer模型,用于模拟物体的物理运动,能够在世界坐标系中进行操作。该模型相较于以视图为依赖的方式,有效提升了对格构空间中物体运动的建模能力。意义:为物理模拟与计算机视觉的结合提供新思路,拓展了仿真技术的应用范围。