👍 95
06/22 20:00
Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolutio
中文介绍 论文提出了一种针对大语言模型(LLM)代理的内存系统,支持持久信息的存储、检索和更新。这种系统通过动态生命周期管理来增强代理的执行能力,使其能有效处理复杂的记忆任务。该研究重要性在于推动了代理智能体的记忆管理,从而提升了其在知识获取和使用中的效率。意义:对记忆、推理和智能体方向具有重要的启示作用。
👍 38
06/23 20:00
Real-world photography requires capture-time guidance for both camera framing and subject pose. Yet existing aesthetic cropping benchmarks mainly evaluate post-hoc crop prediction and overlook subject-side recommendations, leaving the capture-time guidance capabilities of multimodal large language m
中文介绍 该论文提出了ShutterMuse,一个基于多模态大语言模型(MLLM)的摄影指导系统,旨在优化拍摄时的相机构图和对象姿态。通过引入主体侧建议,该系统弥补了现有美学裁剪基准的不足。研究显示,该方法能显著提升拍摄过程中的指导能力,对摄影和计算机视觉的融合应用具有重要影响。
👍 34
06/24 20:00
A unified representation for text and vision is a natural pursuit, as it enables simpler multimodal modeling and more efficient training. However, representing images as discrete signals in the same way as text inevitably introduces severe information loss. Existing work struggles to balance low-lev
中文介绍 论文提出了一种统一的文本与视觉表示方法ViQ,旨在减少信息损失并提高多模态建模的效率。该方法通过对图像进行量化表示,解决了图像与文本在低级别特征之间的信息平衡问题,展现出在生成任务上的潜力。该研究有助于推动文本和视觉 AI 应用的进一步发展,尤其在视觉问答和跨模态检索中。
👍 30
06/15 20:00
While Large Language Models (LLMs) have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executab
中文介绍 论文系统回顾了多模态代码智能的进展,指出在编程任务中视觉信息(如截图和图表)对于意图表达的重要性。研究强调了连接视觉感知与可执行代码之间的关系,并探讨如何利用大语言模型(LLMs)应对这一挑战,对提升代码生成和理解能力具有重要意义。
👍 29
06/24 20:00
While text-to-image (T2I) models have achieved remarkable progress, they struggle with real-world requests that are often underspecified, implicit, or dependent on up-to-date knowledge. We identify this challenge as the Context Gap: the mismatch between the user context and the sufficient generation
中文介绍 Qwen-Image-Agent识别了文本到图像生成中的“上下文差距”问题,提出了一种新方法来缩小用户的实际需求与生成模型之间的差距。该研究表明,通过引入更新知识的能力,可以提升生成模型在实际应用中的有效性,推动图像生成技术的进一步发展,具有实际应用潜力。
👍 29
06/23 20:00
Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representations are limited by the accuracy of off-the-shelf reconstruc
中文介绍 MVTrack4Gen提供了一种基于几何监督的多视角点跟踪方法,旨在合成新的视角视频,同时保持与参考视频的几何一致性和运动真实感。通过克服传统3D重建方法的局限性,该研究为4D视频生成提供了新思路,对计算机视觉和影视制作领域具有重要应用价值。
👍 27
06/24 20:00
Outcome-based reinforcement learning provides a stable optimization backbone for language agents, but its sparse trajectory-level rewards provide little guidance on which intermediate decisions should be reinforced or suppressed. On-policy self-distillation offers dense token-level supervision, yet
中文介绍 OPID提出了在基于结果的强化学习中进行策略蒸馏的方法,旨在为语言代理提供更高效、密集的监督信息。通过实现基于策略的自蒸馏,该方法提高了代理在决策中的表现,显示出在复杂任务中的潜在应用价值,有助于优化语言模型的性能。
👍 21
06/18 20:00
Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow
中文介绍 UnityShots研究提出了基于记忆驱动的多镜头视频生成方法,通过结构化的跨镜头记忆保持主体外观和场景上下文的连贯性。这种方法解决了现有方法在长序列生成中的局限性,具有提升多镜头视频生成质量的潜力,对影视制作与视频创作应用具有重要意义。
👍 21
06/23 20:00
Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods typically rely on reinforcement learning with verifiable rewards or supervised fine-tuning on large-sc
中文介绍 V-Zero提出了一种无答案标签的在政策蒸馏方法,用于细粒度视觉推理,强调主流多模态大型语言模型(MLLM)如何识别任务相关的视觉证据。此方法创新性地结合了对比证据门控,以提高推理质量,对视觉识别和推理应用有显著影响。
👍 20
06/23 20:00
A classical intuition holds that verifying a solution is easier than producing one. For today's coding agents, this intuition is being inverted: as foundation models develop stronger reasoning capabilities and engineering harnesses grow more sophisticated, generating complex candidate solutions is n
中文介绍 论文探讨了当前编码代理奖励验证的挑战,提出生成候选方案的复杂性逐渐超过方案验证的简单性。这一反转将促进编码代理的研究进步,尤其是在增强推理能力和提高手段有效性方面,对智能体系统的应用具有深远影响。
👍 18
06/24 20:00
Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling
中文介绍 JetSpec通过引入并行树草拟的方式,突破了推测解码中的扩展瓶颈,从而加快了自回归大语言模型(LLM)的速度。这种方法在接受率高且草拟开销较低的情况下提高了速率,为大语言模型的高效运行提供了新思路,具有一定的应用价值。
👍 11
06/21 20:00
The Hitchhiker's Guide to Agentic AI is a comprehensive practitioner's reference for building autonomous AI systems. The book covers the full stack from first principles to production deployment, organized around a central thesis: building great agentic systems requires understanding every layer of
中文介绍 论文《The Hitchhiker's Guide to Agentic AI》提供了一套构建自主AI系统的全面指南,涵盖从基础原理到实际部署的各个层次。这一框架有助于理解代理智能体的多层次架构,对推动智能代理技术的发展具有重要参考价值。
👍 10
06/23 20:00
Tool use enables large language models (LLMs) to perform complex tasks, and recent agentic reinforcement learning (RL) methods show promise for enhancing model capabilities. However, RL alone often leads to instability or limited gains in tool-use tasks. In our experiments, some models exhibit catas
中文介绍 研究探讨了多步工具使用强化学习的失败现象,提出了监督信号可以解决不稳定性问题。通过实验发现,监督信号能够提高工具使用任务的有效性,为设计稳健的强化学习系统提供了新的思路,对智能体技术的应用具有积极影响。
👍 9
06/24 20:00
As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications with relatively simple tasks and focus on a narrow set of capabilities w
中文介绍 论文重新审视了适应当今真实环境的代理系统能力的要求,提出当前基准的局限性,并强调需要更全面的评估。此研究对智能体的实际应用能力验证具有重要的意义,有助于推动更高标准的能力测试与发展。
👍 9
06/23 20:00
We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a s
中文介绍 Autodata提出了一种方法,让AI代理充当数据科学家,生成高质量的训练和评估数据。该研究不仅描述了元优化的训练过程,还展示了如何提高数据质量,对合成数据生成和数据科学领域具有重要应用潜力。
👍 8
06/20 20:00
Chain-of-Thought (CoT) has become a standard method for improving reasoning capabilities in large language models (LLMs) by eliciting step-by-step thinking, but its effectiveness in multimodal tasks remains unclear. In this paper, we aim to systematically investigate the key question: What can multi
中文介绍 论文探讨了多模态推理中链式思维(CoT)的方法有效性,系统地调查其在不同任务中的适用性。研究表明,尽管CoT方法在一定程度上提升模型推理能力,但对多模态任务的影响和限制仍需进一步探讨。
👍 6
06/21 20:00
Computer-use agents can execute software tasks through either graphical interfaces or programmatic command interfaces, but existing evaluations confound interaction modality with differences in tasks, initial states, verifiers, and permitted actions. We introduce a matched execution-layer benchmark
中文介绍 研究分析了计算机使用代理在图形界面和命令行界面中的执行瓶颈,提出了一种匹配执行层基准的方法。此研究的结果将有助于提高代理系统在不同界面下的表现,对人机交互和智能系统设计具有重要影响。
👍 6
06/24 20:00
Video reasoning language models implicitly assume that every input frame is equally reliable. This leads to what we term the Blind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontier video reasoning models can suffer 15-30%p accuracy drops on real-world em
中文介绍 论文提出了自信意识工具编排方法,以解决视频理解中的信任问题,强调输入帧的可靠性差异。研究显示,在现实影像下,该方法可提升模型的准确性,特别在视频推理和处理的应用中具有重要意义。
👍 6
06/23 20:00
While Video Virtual Try-on (VVT) has achieved remarkable progress in synthesizing realistic garment overlays on dynamic subjects, existing paradigms remains fundamentally constrained by a passive dependency on source camera trajectories, failing to accommodate the requisite interactive freedom for o
中文介绍 TryOnCrafter通过可渲染的4D试穿代理释放了摄像机轨迹的使用潜力,旨在增强视频虚拟试穿的交互性。该研究带来了视频虚拟试穿的新方法,对时尚电商和增强现实应用领域具有重要贡献。
👍 6
06/22 20:00
Existing low-bit KV-cache quantizers often treat each cached key as a flat vector. Under RoPE, however, a key's contribution to a future attention logit decomposes into a position-dependent sum over two-dimensional frequency blocks. This makes key-cache quantization a block-wise bit-allocation probl
中文介绍 论文提出RoPE感知的KV缓存量化方法,解决了现有低位KV缓存量化中的局限性。通过优化关键字的块状比特分配,该研究为改进深度学习模型的内存效率和计算性能提供了新的思路,对Transformer等模型的应用场景具有重要意义。