Accepted by ECCV 2026

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

Introducing VG-GUI-Bench — a video-guided GUI agent benchmark — and TASKER, a task-driven and scene-aware keyframe search algorithm.

Sunqi Fan1 Qingle Liu1 Runqi Yin1 Meng-Hao Guo1 Shuojin Yang1,✉
1Tsinghua University, Beijing, China
Two progressive levels of video understanding: low-level VideoQA and high-level video-guided agentic tasks.
We advance video understanding from the low-level VideoQA paradigm toward the high-level Video-Guided Agentic Task paradigm. The latter requires models to learn procedural knowledge from a tutorial video and transfer it to long-horizon decision making — a form of video in-context learning.

Abstract

Recent Multimodal Large Language Models (MLLMs) achieve remarkable performance on Video Question Answering (VideoQA) benchmarks, but existing benchmarks primarily test shallow visual perception and rarely examine whether MLLMs can learn deeper procedural skills from video tutorials and generalize them to long-horizon agentic tasks. To address this gap, we introduce VG-GUI-Bench (Video-Guided GUI Benchmark), a new benchmark that evaluates whether MLLM-based GUI agents can follow video tutorials to complete corresponding interactive GUI tasks.

We further observe that performance on both VideoQA and video-guided agentic tasks critically depends on effective keyframe extraction. Based on this, we propose TASKER (Task-driven and Scene-aware Keyframe searcher), a keyframe extraction algorithm that jointly considers task relevance and scene dynamics. TASKER yields significant gains on both VideoQA and video-guided agentic benchmarks, outperforming the best baseline by 2.0% on the EgoSchema fullset and 1.8% on NExT-QA, highlighting the potential of generalized keyframe extraction for video understanding.

Contributions

Three key contributions bridging perception and action in video understanding.

A Two-Level Taxonomy

We identify a key limitation of existing video benchmarks and propose a taxonomy connecting low-level VideoQA with high-level video-guided agentic tasks, highlighting the role of video in-context learning.

📱

VG-GUI-Bench

A new benchmark that pairs tutorial videos with GUI agent tasks to evaluate procedural knowledge transfer from videos, with 1,000 long-horizon test cases and four complementary metrics.

🔍

TASKER

A task-driven and scene-aware keyframe extraction algorithm, formulated as a generalized graph search, that improves both accuracy and frame efficiency across VideoQA and agentic tasks.

VG-GUI-Bench

A dedicated benchmark for evaluating MLLM-based GUI agents on long-horizon tasks guided by video tutorials, built upon the MONDAY dataset.

Overview of the VG-GUI-Bench benchmark: pipeline, action space, and metrics.
Overview of VG-GUI-Bench: the benchmark pipeline, the standardized action space, and the evaluation metrics with their formulas.
1,000
Test cases
10.71
Avg. steps / episode
6
Standardized actions
4
Evaluation metrics

Standardized Action Space

ActionFormatDescription
CLICKCLICK(x, y)Tap at coordinate (x, y)
SCROLLSCROLL(x1, y1, x2, y2)Swipe / drag gesture between two points
TYPETYPE(content)Input a text string
PRESSPRESS(key)System key ∈ {BACK, HOME, ENTER}
ZOOMZOOM()Pinch-to-zoom gesture
FINISHFINISH()Task completed

Evaluation Metrics

MetricMeaning
AccuracyStep correctness: 0.3 for the right action type, +0.7 for the right arguments.
CompletionProportion of correctly executed steps per episode, averaged across episodes.
EfficiencyAverage number of input frames consumed per prediction step (lower is better).
PIRPerformance Improvement Rate — relative accuracy gain from adding the video tutorial.

TASKER Algorithm

TASKER reformulates keyframe extraction as a generalized graph-search problem. The video is split into segments (nodes), and an MLLM evaluates cost functions and termination confidence to decide which segments to expand — selecting a compact yet informative set of keyframes.

Illustration of TASKER's cost function evaluation and node expansion steps.
TASKER's cost-function evaluation and node-expansion steps. The GBFS variant evaluates distance based on question relevance; the Dijkstra variant evaluates distance based on scene dynamics.
TASKER-GBFS

Task-driven. The MLLM estimates which segment most likely contains the missing goal-critical information, then expands the node closest to the destination (the question).

TASKER-Dijkstra

Scene-aware. Without knowing the question, the MLLM selects the segment with the most significant scene change, prioritizing intrinsic visual structure.

TASKER-A*

Task-driven & scene-aware. Combines both signals — only segments that are goal-relevant and show large state changes are prioritized.

TASKER-BFS

Naive. A breadth-first variant that requires no MLLM cost evaluation, steadily expanding all segments to avoid overlooking information.

Results on VideoQA

In a training-free, zero-shot setting, TASKER consistently outperforms prior temporal-selection and video-agent methods on EgoSchema and NExT-QA. Gains over VideoTree are shown in green.

Method(M)LLM EgoSchema Sub.EgoSchema Full NExT-QA Tem.NExT-QA Cau.NExT-QA Des.NExT-QA Avg.
VideoAgentGPT-460.254.164.572.781.171.3
VideoTreeGPT-466.261.170.676.583.975.6
TASKER (Ours)GPT-4 68.0 +1.863.1 +2.0 72.3 +1.778.2 +1.7 85.4 +1.577.4 +1.8
TASKER (Ours)GPT-4o 68.6 +2.463.6 +2.5 72.9 +2.379.0 +2.5 86.1 +2.278.1 +2.5
VideoTreeQwen3-VL77.276.781.985.186.384.3
TASKER (Ours)Qwen3-VL 79.4 +2.277.3 +0.6 83.6 +1.785.3 +0.2 87.5 +1.285.1 +0.8

TASKER outperforms all baselines while using only ~15% of the total frames.

Frame efficiency curve comparing TASKER with LLoVi, VideoAgent and VideoTree.
Frame efficiency. At the same accuracy level (66%), TASKER uses only about 1/4 of the frames required by VideoTree on EgoSchema, demonstrating a superior ability to identify key information.

🏆 VG-GUI-Bench Leaderboard

Frontier MLLMs evaluated on VG-GUI-Bench, with no video input vs. 10 uniformly sampled frames. Adding video consistently improves performance, confirming the benefit of temporal visual guidance.

#ModelSettingAcc. (%)Type Acc. (%)Comp. (%)PIR
1Gemini-3.1-ProNo Video58.5174.5878.53
1Gemini-3.1-Pro10 Uniform Frames61.6876.2578.610.054
2GPT-5-miniNo Video55.7671.9375.97
2GPT-5-mini10 Uniform Frames58.4075.2778.960.047
3Kimi-K2.5No Video57.2272.9176.31
3Kimi-K2.510 Uniform Frames58.2274.7878.720.017
4Claude-Sonnet-4.6No Video44.3567.8172.24
4Claude-Sonnet-4.610 Uniform Frames45.8067.7172.350.033
5Seed-2.0-ProNo Video35.9370.6674.75
5Seed-2.0-Pro10 Uniform Frames39.7875.9679.360.107
6Qwen3-VL-235B-A22BNo Video25.9066.6369.73
6Qwen3-VL-235B-A22B10 Uniform Frames26.8867.4271.690.038
7Gemini-3.1-FlashNo Video24.7367.0370.54
7Gemini-3.1-Flash10 Uniform Frames26.0269.1972.600.052

TASKER & Keyframe-Selection Methods on VG-GUI-Bench

All methods use Qwen3-VL-235B-A22B-Instruct as the base LLM. Best results are highlighted.

MethodAcc. (%)Type Acc. (%)Comp. (%)Eff. ↓PIR
No Video25.3265.8569.030
Uniform Sampling39.8266.3470.6410.880.573
Oracle Keyframes44.3273.3176.3210.750
VideoTree40.7967.5271.9310.000.611
VideoAgent39.8667.0371.175.120.574
TASKER-Dijkstra40.7571.0574.395.880.609
TASKER-A*40.9667.7171.388.240.618

TASKER-A* attains the highest overall accuracy and PIR, surpassing the strong VideoTree baseline, while TASKER-Dijkstra approaches the Oracle Keyframe upper bound on task completion.

Search Visualization

A case study of TASKER solving a VideoQA example from EgoSchema. The key information lies between 126s–130s; TASKER traverses the video tree (yellow path) and converges to the correct keyframes (green leaf).

Visualization of the tree-search and node-expansion process of TASKER.
Tree-search and node-expansion process of TASKER on an EgoSchema case. Nodes on the key search path are marked in yellow; the final leaf node is marked in green.

BibTeX

@inproceedings{fan2026bridging,
  title     = {Bridging VideoQA and Video-Guided Agentic Tasks
               via Generalized Keyframe Extraction},
  author    = {Fan, Sunqi and Liu, Qingle and Yin, Runqi and
               Guo, Meng-Hao and Yang, Shuojin},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}