Projects

Selected Paper Showcases

My research spans two closely related directions: AI Security and AI for Security.

One track asks how we secure AI systems themselves. The other asks how modern AI can be used to protect people, platforms, and online ecosystems. A third stream focuses on understanding limits, measuring behavior, and evaluating risks across adjacent security settings.

Track 01

AI Security

Defending models, measuring robustness, guiding secure code generation, and designing stronger safety mechanisms for AI systems.

Track 02

AI for Security

Using multimodal and language models to detect abuse, moderate harmful content, and improve online safety operations.

Track 03

Related Studies

Studies that probe model behavior and examine how users perceive risk in security-relevant settings.

JBShield framework figure showing jailbreak detection and mitigation.

JBShield defends aligned large language models against jailbreak attacks by inspecting what happens inside the model rather than relying only on surface-level prompt filters. The framework identifies toxic and jailbreak-related concepts in hidden activations, then intervenes on those concepts to preserve the model's refusal behavior under adversarial prompting.

This makes the defense more mechanistic than keyword blocking or prompt-only safeguards. The paper combines representation-level analysis with mitigation and shows that concept-aware intervention can substantially reduce successful jailbreak attacks across diverse LLMs.

Read Paper

GRASP overview figure from the SCPGraph paper.

SCPGraph addresses a different security challenge: LLMs that generate functional but insecure code. The project encodes secure coding practices as graph structures and uses those graphs to guide LLM reasoning during code generation, grounding the model in concrete security constraints instead of vague instructions to "write safe code."

By operationalizing relationships among secure design rules, the framework helps the model avoid common implementation mistakes and improve secure coding performance on realistic tasks. It offers a practical way to turn security knowledge into structured guidance for AI-assisted software development.

Read Paper

Figure from the MultimodelRobustness paper on robustness of vision-language multimodal models.

This paper studies how vision-language multimodal models behave under robustness stress, focusing on how brittle cross-modal systems can become when their inputs or assumptions shift. It examines the reliability limits of multimodal models before they are relied on in downstream security workflows.

The project is foundational because it surfaces where multimodal systems fail before they are deployed in safety-critical or security-sensitive settings. It also provides an early bridge to later work on multimodal moderation and trustworthy AI behavior.

Read Paper

HVGuard framework figure showing multimodal reasoning and mixture-of-experts fusion.

HVGuard studies hateful video moderation, where harm is often conveyed jointly through speech, visuals, sarcasm, and pacing instead of a single explicit cue. The system combines transcripts and video frames in a multimodal LLM pipeline so the model can reason over cross-modal evidence rather than treating a clip as isolated text or isolated imagery.

This makes the approach more effective for implicit and context-dependent hate in real short-video settings. The project shows how structured multimodal reasoning can improve practical moderation for video platforms where harmful intent is often concealed behind humor, editing style, or audiovisual mismatch.

Read Paper

HMGuard overview figure showing challenge identification, prompt design, and harmful meme detection.

HMGuard focuses on harmful memes, a moderation problem where a small amount of text and imagery can hide hateful, harassing, or propagandistic intent. The framework uses multimodal large language models to reason about the relationship between the image, the overlaid text, and the broader social meaning conveyed by the meme.

The work treats meme moderation as an understanding problem rather than a pure classification task. That perspective is important for real-world moderation because harmful meaning often depends on cultural references, visual composition, and subtle multimodal cues that standard detectors miss.

Read Paper

UGCG-Guard overview figure showing data collection, prompting, VLM detection, and moderation.

UGCG-Guard targets promotional content used to lure users into unsafe user-generated content games. The system screens social posts, screenshots, and game-related imagery with large vision-language models to detect sexualized, exploitative, or otherwise illicit promotion before it spreads across the platform ecosystem.

The project is designed for the messy moderation setting around creator-driven games, where harmful content mixes platform-native slang, visual signals, and rapidly shifting promotion styles. It shows how LVLM-based moderation can protect vulnerable users, especially minors, in large UGC game communities.

Read Paper

HateGuard overview figure from the NewWave paper.

NewWave studies hate speech that surges around breaking events, where static moderation policies and older classifiers quickly go stale. The paper introduces an LLM-based reasoning framework that captures the narratives, slogans, and contextual references tied to newly emerging hate waves triggered by real-world events.

Instead of assuming that the target classes are fixed, the project treats moderation as a continual adaptation problem. The result is a more responsive way to track event-driven abuse and update detection strategies without retraining a new model from scratch for every shift in online discourse.

Read Paper

Key risk-distribution figure from the RethinkingUGCG paper.

This project examines children's online safety in user-generated content games by comparing how parents and children perceive risk. Rather than assuming age gates alone are enough, the study highlights the gap between adult oversight models and the types of content, interaction, and social exposure children actually encounter inside creator-driven game ecosystems.

The paper argues for safety interventions that are more context-aware and experience-driven than blanket age-based restrictions. It connects platform design, moderation policy, and lived user behavior in a way that is directly relevant to safer online game environments.

Read Paper

Prompting strategy and results figure from the LLM4HateSpeech paper.

LLM4HateSpeech examines whether large language models can reliably detect hate speech in realistic, context-heavy settings. The work compares prompt strategies and studies how contextual clues, task framing, and external knowledge affect LLM judgments on subtle or ambiguous hateful language.

Its contribution is diagnostic as much as empirical: the paper identifies where LLMs help, where they remain brittle, and which prompting strategies make them more useful for moderation. That makes it a strong foundation for later projects on LLM-based online safety systems.

Read Paper

Keyan Guo

Projects

My research spans two closely related directions: AI Security and AI for Security.

AI Security

AI for Security

Related Studies

AI Security

JBShield

SCPGraph

MultimodelRobustness

AI for Security

HVGuard

HMGuard

UGCG-Guard

NewWave

Related Studies

Beyond Age-Based Restrictions

LLM4HateSpeech