Abstract
This paper proposes an innovative defense framework against indirect prompt injection attacks in agent systems, integrating structureaware attention-based detection with preference-aligned purification. Our method effectively identifies and removes malicious instructions embedded within structured interaction data, while preserving task utility and ensuring model security. Specifically, we design an end-to-end structured defense pipeline that combines supervised fine-tuning and reinforcement-based policy optimization to accurately filter adversarial content without compromising structural integrity. To support the training of the purification module, we construct the first adversarial dataset tailored to structured indirect injection scenarios. Furthermore, we introduce a novel attack variant that manipulates response data fields to simulate more deceptive and realistic threats targeting agent behavior. Experimental results on the AgentDojo benchmark demonstrate that, compared to existing detection-based defenses, our method not only significantly reduces attack success rates but also substantially improves the agent’s task completion performance in interactive settings.