SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models

Qi, Peigui; Tang, Kunsheng; Zhou, Wenbo; Zhang, Weiming; Yu, Nenghai; Zhang, Tianwei; Guo, Qing; Zhang, Jie

SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models

Peigui Qi¹, Kunsheng Tang¹, Wenbo Zhou^1*, Weiming Zhang¹, Nenghai Yu¹, Tianwei Zhang², Qing Guo³, Jie Zhang^3*

¹University of Science and Technology of China, Hefei, China
²Nanyang Technological University, Singapore
³CFAR and IHPC, A*STAR, Singapore
^*Corresponding Authors

ACM CCS 2025

Paper Code

Framework Overview

Overview of SafeGuider. In Step I, SafeGuider processes input prompts through a text encoder to obtain [EOS] token embeddings for safety assessment. Prompts with safety scores > 0.5 are considered safe and directly forwarded to image generation, while unsafe ones (safety scores ≤ 0.5) are processed by Step II. In Step II, SAFE beam search with beam width K strategically modifies unsafe prompts to obtain safe yet semantically meaningful embeddings for image generation.

Abstract

Text-to-image models have shown remarkable capabilities in generating high-quality images from natural language descriptions. However, these models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content. Despite various defensive strategies, achieving robustness against attacks while maintaining practical utility in real-world applications remains a significant challenge.

To address this issue, we first conduct an empirical study of the text encoder in the Stable Diffusion (SD) model, which is a widely used and representative text-to-image model. Our findings reveal that the [EOS] token acts as a semantic aggregator, exhibiting distinct distributional patterns between benign and adversarial prompts in its embedding space. Building on this insight, we introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality.

SafeGuider combines an embedding-level recognition model with a safety-aware feature erasure beam search algorithm. This integration enables the framework to maintain high-quality image generation for benign prompts while ensuring robust defense against both in-domain and out-of-domain attacks. SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48% across various attack scenarios.

Moreover, instead of refusing to generate or producing black images for unsafe prompts, SafeGuider generates safe and meaningful images, enhancing its practical utility. In addition, SafeGuider is not limited to the SD model and can be effectively applied to other text-to-image models, such as the Flux model, demonstrating its versatility and adaptability across different architectures.

Empirical Study

Identifying the Text Condition Feature Aggregation Token

Observation 1:

The [EOS] token serves as a text condition feature aggregator in CLIP's text encoder.

Observation 2:

The condition feature aggregation process follows a hierarchical pattern from shallow to deep layers.

Analyzing Embedding Representations in [EOS] Aggregation Token

Visualization

Maximum Mean Discrepancy (MMD)

	Benign	VS Attacks	SJ Attacks
Benign	0	0.496	0.993
VS Attacks	0.496	0	1.000
SJ Attacks	0.993	1.000	0

Observation 3:

Prompts within the same category exhibit clear clustering patterns in [EOS] token embedding space.

Observation 4:

Prompts across different categories demonstrate significant distributional gaps in [EOS] token embedding space.

Generalization Across Different Text Encoders

To investigate the generality of our findings, we extend our analysis to T2I models with different architectures and text encoders. Beyond the CLIP ViT-L/14 encoder in SD-V1.4, we examine models like SD-V2.1, which uses OpenCLIP ViT-H/14, and Flux.1, which employs both CLIP ViT-L/14 and T5-XXL encoders.

Observation 5:

The discovered aggregation token patterns generalize across different text encoders and model architectures.

Experimental Results

Sexually Explicit Content Detection (IND/OOD)
Other Unsafe Themes Detection (IND/OOD)
Generation Quality
Nudity Removal Rate
Harmful Content Removal Rate

Visual Comparisons

Comparison of Sexually Explicit Content Mitigation

Comparison of Other Unsafe Content Mitigation

Comparison of Generation Quality on Benign Prompts

Cross-Architecture Performance (SD-V2.1 and Flux.1)

BibTeX

@inproceedings{qi2025safeguider,
  title     = {SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models},
  author    = {Peigui Qi and Kunsheng Tang and Wenbo Zhou and Weiming Zhang and 
               Nenghai Yu and Tianwei Zhang and Qing Guo and Jie Zhang},
  booktitle = {Proceedings of the 2025 ACM SIGSAC Conference on Computer and
               Communications Security, {CCS} 2025, Taipei, October 13-17, 2025},
  publisher = {ACM},
  year      = {2025},
  url       = {https://doi.org/10.1145/3719027.3744835},
  doi       = {10.1145/3719027.3744835}
}