DRACULA: Hunting for the Actions Users Want Deep Research Agents to Execute

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • DRACULA is presented as a new dataset that captures user feedback not just on final deep research reports, but on intermediate actions proposed by deep research (DR) agents.
  • In a five-week study, nineteen expert CS researchers selected preferred intermediate actions for DR-system outputs (e.g., adding a datasets section), producing 8,103 action preferences and 5,230 judgments about whether reports executed the chosen actions.
  • The authors evaluate how predictable user-preferred actions are, finding that LLMs struggle initially but improve when given the user’s full selection history rather than partial/self-reported or extrapolated context signals.
  • They also show that users’ preferred actions vary for the same query due to unstated goals, and they use simulation results to design an online intervention that recommends new actions based on prior user interactions, which users select most often in follow-up studies.
  • The work open-sources the DRACULA study design, feedback, and simulation tasks to encourage future research on action feedback for long-horizon agents, highlighting “choosing which actions to execute” as a core challenge.

Abstract

Scientific Deep Research (DR) agents answer user queries by synthesizing research papers into multi-section reports. User feedback can improve their utility, but existing protocols only score the final report, making it hard to study and learn which intermediate actions DR agents should take to improve reports. We collect DRACULA, the first dataset with user feedback on intermediate actions for DR. Over five weeks, nineteen expert CS researchers ask queries to a DR system that proposes actions (e.g., "Add a section on datasets"). Our users select actions they prefer, then judge whether an output report applied their selections successfully, yielding 8,103 action preferences and 5,230 execution judgments. After confirming a DR agent can execute DRACULA's actions, we study the predictability of user-preferred actions via simulation-how well LLMs predict the actions users select-a step toward learning to generate useful actions. We discover: (1) LLM judges initially struggle to predict action selections, but improve most when using a user's full selection history, rather than self-reported or extrapolated user context signals; (2) Users' selections for the same query differ based on unstated goals, bottlenecking simulation and motivating affordances that let users steer reports; and (3) Our simulation results inform an online intervention that generates new actions based on the user's past interactions, which users pick most often in follow-up studies. Overall, while work extensively studies execution, DRACULA reveals a key challenge is deciding which actions to execute in the first place. We open-source DRACULA's study design, user feedback, and simulation tasks to spur future work on action feedback for long-horizon agents.