Exploring and Testing Skill-Based Behavioral Profile Annotation: Human Operability and LLM Feasibility under Schema-Guided Execution

arXiv cs.CL / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper reframes Behavioral Profile (BP) annotation as a set of annotation “skills” and tests LLM-assisted annotation using a schema-guided, skill-file-driven pipeline rather than treating it as a single monolithic task.
Using a 14-feature BP schema and 3,134 Chinese concordance lines, the authors run a two-round schema-only protocol to classify each skill as directly operable, recoverable via focused re-annotation, or structurally underspecified.
GPT-5.4 can execute the subset of retained skills with substantial reliability (accuracy 0.678, κ 0.665, weighted F1 0.695), but feasibility is selective and not a universally global replacement for human annotation.
Skill-level difficulty aligns strongly between humans and GPT (r = 0.881), while alignment breaks down at the instance level (r = 0.016) and lexical-item level (r = -0.142), implying shared taxonomy with independent execution behavior.
Open-source model failures are mainly attributed to schema-to-skill execution issues, leading the authors to argue that automatic annotation should be assessed by skill feasibility, not just task-level automation.

Abstract

Behavioral Profile (BP) annotation is difficult to automate because it requires simultaneous coding across multiple linguistic dimensions. We treat BP annotation as a bundle of annotation skills rather than a single task and evaluate LLM-assisted BP annotation from this perspective. Using 3,134 concordance lines of 30 Chinese metaphorical color-term derivatives and a 14-feature BP schema, we implement a skill-file-driven pipeline in which each feature is externally defined through schema files, decision rules, and examples. Two human annotators completed a two-round schema-only protocol on a 300-instance validation subset, enabling BP skills to be classified as directly operable, recoverable under focused re-annotation, or structurally underspecified. GPT-5.4 and three locally deployable open-source models were then evaluated under the same setup. Results show that BP annotation is highly heterogeneous at the skill level: 5 skills are directly operable, 4 are recoverable after focused re-annotation, and 5 remain structurally underspecified. GPT-5.4 executes the retained skills with substantial reliability (accuracy = 0.678, \k{appa} = 0.665, weighted F1 = 0.695), but this feasibility is selective rather than global. Human and GPT difficulty profiles are strongly aligned at the skill level (r = 0.881), but not at the instance level (r = 0.016) or lexical-item level (r = -0.142), a pattern we describe as shared taxonomy, independent execution. Pairwise agreement further suggests that GPT is better understood as an independent third skill voice than as a direct human substitute. Open-source failures are concentrated in schema-to-skill execution problems. These findings suggest that automatic annotation should be evaluated in terms of skill feasibility rather than task-level automation.