记录为了保研面试读的英文文章
Exposing the Guardrails: Reverse-Engineering and Jailbreaking Safety Filters in DALL·E Text-to-Image Pipelines
Villa C, Mirza S, Pöpper C. Exposing the Guardrails: Reverse-Engineering and Jailbreaking Safety Filters in DALL· E Text-to-Image Pipelines[J].
We investigate the specific design and implementation of safety guardrails in black-box text-to-image (T2I) models, sucas DALL·E, which are implemented to prevent potential misuse from generating harmful image content. Specifically, wintroduce a novel timing-based side-channel analysis approach to reverse engineer the safety mechanisms of DALL·E models. Bmeasuring and analyzing the differential response times of these systems, we reverse-engineer the architecture of previouslunknown cascading safety filters at various stages of the T2I pipeline. Our analysis reveals key takeaways by contrastinsafety mechanisms in DALL·E 2 and DALL·E 3: DALL·E 2 uses blocklist-based filtering, whereas DALL·E 3 employs an LLM-baseprompt revision stage to improve image quality and filter harmful content. We find discrepancies between the LLM’s languagunderstanding and the CLIP embedding used for image generation, which we exploit to develop a negation-based jailbreakinattack. We further uncover gaps in the multilingual coverage of safety measures, which render DALL·E 3 vulnerable to a neclass of low-resource language attacks for T2I systems. Lastly, we outline six distinct countermeasures techniques anresearch directions to address our findings. This work emphasizes the challenges of aligning the diverse components of thessystems and underscores the need to improve the consistency and robustness of guardrails across the entire T2I pipeline.
术语 | 翻译 |
---|---|
black box | 黑盒 |
reverse engineer | 逆向工程 |
cascading safety filters | 级联安全过滤器 |
blocklist-based filtering | 黑名单过滤 |
negation-based jailbreaking attack | 基于否定的越狱攻击 |
low-resource language attacks | 低资源语言攻击 |
discrepancies | 差异 |
contrast | 对比 |
countermeasures | 对策 |
multilingual | 多语言的 |
render | 呈现 |
guardrails | 防护栏 |
Text-to-image (T2I) models, such as DALL-E, Stable Diffusion, and Midjourney, have gained immense popularity by enabling users to generate realistic images from textual descriptions. These AI platforms have seen rapid adoption in real-world products, including Microsoft Designer and ad platforms from Google and Meta, revolutionizing the way users create and interact with visual content. However, the widespread use of T2I models has also raised concerns about their potential for generating harmful content. These models can produce sensitive Not-Safe-for-Work (NSFW) images, such as depicting violence, nudity, and child-inappropriate material, as well as disturbing, hateful, and politically charged images. Despite efforts by developers to implement safety guardrails, unsafe synthetic images continue to proliferate across both mainstream and fringe social networks. Communities such as Unstable Diffusion, which focus on generating sexual content, have attracted tens of thousands of members. Moreover, AI-generated variants of notorious memes are being used to spread hateful ideologies. As T2I models become more sophisticated, minimizing safety risks is paramount. Since the launch of DALL·E 2, users have created an average of 34 million images daily, and the recently introduced DALL·E 3 is accessible to millions of users through API and ChatGPT interfaces. However, little is known about the specific design and implementation of its safety filters, as this information has not been publicly documented by the developers. Prior work on red teaming of safety guardrails has primarily focused on open-source models such as Stable Diffusion and concluded the black-box security-by-obscurity approach to be insufficient. Given the enterprise-grade hardware and model capabilities accessible to users who bypass safety mechanisms in frontier models such as DALL·E, the potential for harm is significantly amplified compared to open-source alternatives. Therefore, understanding and evaluating the effectiveness of DALL·E’s safety measures is crucial to mitigate the risks associated with its misuse and ensure the responsible deployment of this powerful technology.
In this paper, we present a novel approach to reverse-engineer and empirically map the cascading safety guardrails of DALL·E models using time-based side-channel analysis. Our methodology allows us to gain insights into the multi-stage filtering process, from user prompting to the final generated output. Through our analysis, we identify previously unknown filters and shed light on the differences in safety mechanisms between DALL·E 2 and DALL·E 3. Notably, we discover that DALL·E 3 incorporates a large language model (LLM)-based implicit filter to soften harmful prompts, while DALL·E 2 relies on conventional block-list and other more traditional filtering mechanics. Building upon our reverse-engineering of safety guardrails, we explore potential vulnerabilities and propose novel jailbreaking attacks specific to T2I models. Using low-resource-language and negation attacks, we exploit the limitations of the safety filters in handling less common languages and negated phrases. Finally, we draw upon our experimental findings to produce tangible countermeasure solutions that mitigate the timing side-channel and jailbreaking attacks.
In summary, our contributions in this work are:
1. We present the first reverse-engineering of the black-box cascading safety guardrails in DALL·E models using a novel time-based side-channel, providing insights into a multi-stage filtering process, identifying previously unknown blocking or modifying filters, and enabling a feedback channel that adaptive attacks may exploit.
2. We synthesize key takeaways for T2I system security by juxtaposing safety mechanisms present in DALL·E 2 and DALL·E 3, notably the incorporation of an LLM-based implicit filter in DALL·E 3 to soften harmful prompts, in contrast to the conventional blocklist and similarity-based filtering in DALL·E 2.
3. We introduce novel jailbreaking attacks specific to T2I models, namely T2I negation and low-resource-language attacks, which exploit the limitations of safety filters in handling negated phrases and less common languages.
4. We provide an actionable list of six countermeasure recommendations for T2I systems to prevent attacks and enumerate directions for future defense research.
术语 | 翻译 |
---|---|
red teaming | 红队测试 |
enterprise-grade | 企业级 |
feedback channel | 反馈通道 |
adaptive attacks | 自适应攻击 |
depict | 描绘 |
nudity | 裸露 |
disturbing | 恐怖的,令人不安的 |
synthetic | 合成的,人工的 |
proliferate | 激增,蔓延 |
fringe | 边缘的,极端的 |
ideologies | 思想意识形态 |
sophisticated | 复杂的,精密的 |
paramount | 极为重要的 |
frontier | 前沿,边疆 |
mitigate | 减轻,缓解 |
empirically | 经验证的,经验上 |
incorporate | 纳入,包含 |
implicit | 隐含的 |
negated | 否定的 |
phrase | 短语,措辞 |
tangible | 具体的,切实的 |
juxtaposing | 并置,比较 |
We begin by providing contextual background on harmful content generation, the T2I (Text-to-Image) model architecture, and safety filters. The interpretation of harm in image-based content varies across cultures, regions, and countries, making it difficult to classify and categorize content. Previous work has highlighted the lack of research on the taxonomy of AI-generated harms in imagery. T2I models have traditionally been deployed in a typical architecture: input text (prompts) are delivered to a pre-trained model, which processes the text using models like CLIP or BERT. The input is then encoded into vector-based embedding representations. These embeddings serve as input for image generation models, including diffusion and autoregressive models, to generate the final output image. In this paradigm, numerous safety filters are integrated throughout the components of the T2I pipeline, as shown in Figure 1. These filters can be configured to either outright reject problematic inputs or trigger transformations to the provided prompt.
1. Text-based safety filters operate on the input prompt or its embedding representation (Filters 1-3 in Fig. 1). Simple filtering strategies work on a keyword basis, where certain words such as "bloody" or "naked" are always rejected. While easy to implement, these filters can be circumvented by using grammatical negations or finding similar uncensored words. More sophisticated strategies can use multidimensional vector embeddings of the input text and perform similarity checks with sensitive content. These similarity-based filtering mechanisms can be bypassed using strategies outlined by Rando et al.
2. Post-processing filters, the second type of safety filter commonly implemented in deployment models, operate on the output generated by the T2I model (Filter 4 in Fig. 1). These safety filters can be image classifiers designed to detect harmful content in generated images and prevent them from being delivered to the user. More sophisticated filtering strategies at this stage may also incorporate the input text alongside the output image to determine whether an image is harmful. This filter type attempts to mitigate attacks that bypass the textual filter and produce harmful results.
The introduction of the state-of-the-art DALL·E 3 model extends the previous deployment architecture by adding a language model into the image generation pipeline. This language model is instructed to expand the user prompt to make it more descriptive for the image model, and additional image descriptions have been shown to significantly improve the generation quality. These revised prompts are included in the response returned to the user, allowing for debugging and prompt engineering. A filtering characteristic introduced by the language model is its ability to directly refuse problematic prompts according to its own alignment. These LLM refusals can be detected (Filter 2 in Fig. 1) and result in an error message returned to the user. This refusal functionality is described in the DALL·E 3 System Card, which details the adversarial evaluation performed by OpenAI before releasing the model. Given the stochastic nature of large language models, this content refusal filter may cause non-deterministic variance between identical prompts when the model temperature is set to a high threshold.
术语 | 翻译 |
---|---|
state-of-the-art | 最先进的,最新的 |
interpretations | 解释,诠释 |
complicate | 使复杂化 |
taxonomy | 分类法 |
autoregressive | 自回归的 |
outright | 彻底的,完全的 |
bloody | 血腥的 |
circumvent | 避开,规避 |
uncensored | 未审查的 |
presumably | 据推测,可能 |
adversarial | 对抗性的,敌对的 |
stochastic | 随机的 |
non-deterministic | 非确定性的 |
Rosetta: Enabling Robust TLS Encrypted Traffic Classification in Diverse Network Environments with TCP-Aware Traffic Augmentation
Xie R, Wang Y, Cao J, et al. Rosetta: Enabling robust tls encrypted traffic classification in diverse network environments with tcp-aware traffic augmentation[C]//Proceedings of the ACM turing award celebration conference-China 2023. 2023: 131-132.
The majority of Internet traffic is encrypted using the Transport Layer Security (TLS) protocol. Recent advancements have leveraged Deep Learning (DL) models to classify encrypted traffic by automatically extracting complex and informative features from the packet length sequences of TLS flows. While existing DL models have demonstrated excellent classification results on encrypted traffic, our comprehensive study reveals that they experience significant performance degradation in real-world, diverse network environments. Upon systematically investigating the causes, we discovered that the packet length sequences of flows can change drastically due to various TCP mechanisms used for reliable transmission in different network conditions. To address this, we propose Rosetta, a solution that enhances the robustness of TLS encrypted traffic classification for existing DL models. Rosetta utilizes TCP-aware traffic augmentation mechanisms and self-supervised learning to capture implicit TCP semantics, allowing it to extract more robust features from TLS flows. Extensive experiments show that Rosetta significantly improves the classification performance of existing DL models on TLS traffic in diverse network environments.
术语 | 翻译 |
---|---|
Transport Layer Security | TLS |
packet length sequences | 数据包长度序列 |
TCP-aware traffic augmentation mechanism | 基于TCP的流量增强机制 |
degradation | 降级 |
augmentation | 增强 |
semantics | 语义 |
Network traffic classification aims to organize various traffic into different categories, which is fundamental and vital for network management and security. A number of network security tasks have been built on top of it, such as application identification [53, 56, 65], website fingerprinting [46, 49, 51, 52], malicious flow detection [33, 37, 38], and user profiling [16, 23].
With the fast-growing need of user privacy protection and the wide usage of the Transport Layer Security (TLS) protocol, a majority of the Internet traffic has been encrypted [12]. Traditional rule-based methods that examine packet payloads are becoming increasingly ineffective in classifying encrypted network traffic [6, 34].
Recent advances [16, 33, 38, 49, 51, 52] are leveraging deep learning (DL) techniques to conduct generic traffic classification. Particularly, as the packet payloads are converted into pseudorandom values after TLS encryption [36], a number of studies [7, 35, 46, 49–51] design various deep learning models to automatically extract complicated and high-level features from packet length sequences, which possess rich and discriminating implicit information of the encrypted flows.
Besides, it is convenient and low-cost to measure and derive packet length sequences in real-world large-scale networks, even supporting real-time traffic classification tasks [5, 6]. Though these DL models have been reported to achieve excellent classification results on encrypted traffic [3, 7, 13, 14, 41, 49], e.g., 98% classification accuracy [49], the performance of these models for various traffic classification tasks in the real-world diverse network environments is still not clear.
We should note that when they are deployed in a real network for TLS traffic classification, they will face diverse network environments that are time-varying and unpredictable. For example, the packet loss rate and network delay may suddenly arise due to the burst of network traffic [48, 63]. Actually, the environment of one network can be changed complexly due to the joint effect of multiple factors, such as traffic burst [48, 63], traffic engineering [22, 54], partial network failures [2, 64], and network updates [11, 45].
In this paper, we conduct a systematic study to check if existing deep learning models can effectively classify TLS encrypted traffic in diverse network environments. We study six different DL models including Deep Fingerprinting (DF) [51], FS-Net [35], Transformer [57], SDAE [4, 7, 46, 51], CNN [7, 46, 49, 51, 59], and LSTM [7, 46, 51, 59] that rely on packet length sequences to classify encrypted traffic.
We conduct experiments not only with the replayed traffic from two typical TLS traffic datasets [19, 39] in diverse network environments, but also with real TLS traffic that is generated by visiting popular websites and running online network applications in diverse Internet environments.
Our experiments confirm that all these DL models can achieve excellent results with the offline TLS traffic dataset for various classification tasks, including website fingerprinting, malicious flow identification, VPN traffic identification, and application fingerprinting. However, the performance of all models drops remarkably when they are tested in different network environments, e.g., about 53% accuracy drop at worst.
术语 | 翻译 |
---|---|
application identification | 应用识别 |
website fingerprinting | 网站指纹识别 |
malicious flow detection | 恶意流量检测 |
user profiling | 用户画像 |
real-time traffic classification tasks | 实时流量分类任务 |
malicious flow identification | 恶意流量识别 |
VPN traffic identification | VPN 流量识别 |
application fingerprinting | 应用指纹识别 |
traffic engineering | 流量工程 |
partial network failures | 部分网络故障 |
network updates | 网络更新 |
traffic burst | 流量突发 |
burst | 爆发 |
payloads | 数据负载 |
discriminating | 区分性的 |
implicit | 隐含的 |
replayed | 重放的 |
We find that the remarkable performance degradation results from the dramatic change of packet length sequences of the same flow in different network environments. For example, a TLS encrypted flow with the packet length sequence [q1, q2, q3, q4] may change to [q3, q2, q1, q4] due to high packet loss in another network environment. However, existing DL models fail to understand that the two different packet length sequences in different network environments actually originate from the same flow. Furthermore, we notice that the changes of packet length sequences follow the TCP specifications in different network environments, since TLS connections are built on the TCP protocol. Consequently, different TCP mechanisms ensuring reliable transmission in diverse network environments cause three major changes of packet length sequences, i.e., packet subsequence shift, packet subsequence duplication, and packet size variation. Thus, if a model can be aware of these regular packet sequence changes with TCP semantics, robust TLS encrypted traffic classification in diverse network environments may be achieved.
To this end, we develop Rosetta that is capable of enhancing robust TLS encrypted traffic classification for existing deep learning models. The main idea is to learn implicit TCP semantics from carefully crafted traffic and generate effective feature vectors that represent robust features of TLS flows in diverse network environments. Hence, existing deep learning models can leverage these feature vectors to achieve robust TLS encrypted traffic classification. Rosetta consists of two modules: TCP-aware traffic augmentation and traffic invariant extractor. We develop TCP-aware traffic augmentation algorithms based on a thorough understanding of TCP mechanisms that may affect packet length sequences of flows. Hence, we can generate massive flows that reflect how TLS flows may change in various network environments. The traffic invariant extractor applies self-supervised learning to extract robust features by projecting flow variants into a proper hidden space, reducing the distance among feature vectors of flow variants from the same flow. Consequently, flow variants coming from the same flow will have similar feature vectors.
We conduct extensive experiments to evaluate the effectiveness of Rosetta. The results show that Rosetta significantly improves the performance of existing deep learning models on traffic classification in diverse network environments with both replayed and real TLS flows. We further evaluate its classification robustness under different packet loss rates and different delays. Without enabling Rosetta, the classification accuracy of existing models drops remarkably when packet loss rates and delays increase. For example, the accuracy drops from 99% to 55% when the delay is increased from 0 to 50 ms. When Rosetta is enabled, the accuracy can always maintain above 86%. Moreover, we compare our TCP-aware traffic augmentation algorithms with classical data augmentation methods, including Random Mask (RM) [17] and Random Swap (RS) [60] that have been widely used in the domains of Natural Language Processing (NLP) and Computer Version (CV). With RM and RS, the average F1-score is less than 47% in six different network environments. With our TCP-aware traffic augmentation, the average F1-score is 87%. The results demonstrate that TCP-aware traffic augmentation is more effective on extracting robust features of TLS flows in different network environments.
术语 | 翻译 |
---|---|
Natural Language Processing | 自然语言处理 |
Computer Vision | 计算机视觉 |
crafted | 精心制作 |
massive | 大量的 |
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions
Hou X, Zhao Y, Wang S, et al. Model context protocol (mcp): Landscape, security threats, and future research directions[J]. arXiv preprint arXiv:2503.23278, 2025.
The Model Context Protocol (MCP) is a standardized interface designed to enable seamless interaction between AI models and external tools and resources, breaking down data silos and facilitating interoperability across diverse systems. This paper provides a comprehensive overview of MCP, focusing on its core components, workflow, and the lifecycle of MCP servers, which consists of three key phases: creation, operation, and update. We analyze the security and privacy risks associated with each phase and propose strategies to mitigate potential threats. The paper also examines the current MCP landscape, including its adoption by industry leaders and various use cases, as well as the tools and platforms supporting its integration. We explore future directions for MCP, highlighting the challenges and opportunities that will influence its adoption and evolution within the broader AI ecosystem. Finally, we offer recommendations for MCP stakeholders to ensure its secure and sustainable development as the AI landscape continues to evolve.
术语 | 翻译 |
---|---|
data silos | 数据孤岛 |
seamless | 无缝的 |
interoperability | 互操作性 |
phases | 阶段 |
landscape | 局势;格局 |
integration | 集成 |
broader | 更广泛的 |
stakeholders | 利益相关者 |
sustainable | 可持续的 |
In recent years, the vision of autonomous AI agents capable of interacting with a wide range of tools and data sources has gained significant momentum. This progress accelerated in 2023 with the introduction of function calling by OpenAI, which allowed language models to invoke external APIs in a structured way [38]. This advancement expanded the capabilities of LLMs, enabling them to retrieve real-time data, perform computations, and interact with external systems. As function calling gained adoption, an ecosystem formed around it. OpenAI introduced the ChatGPT plugin [37], allowing developers to build callable tools for ChatGPT. LLM app stores such as Coze [4] and Yuanqi [50] have launched their plugin stores, supporting tools specifically designed for their platforms. Frameworks like LangChain [26] and LlamaIndex [29] provided standardized tool interfaces, making it easier to integrate LLMs with external services. Other AI providers, including Anthropic, Google, and Meta, introduced similar mechanisms, further driving adoption. Despite these advancements, integrating tools remains fragmented. Developers must manually define interfaces, manage authentication, and handle execution logic for each service. Function calling mechanisms vary across platforms, requiring redundant implementations. Additionally, current approaches rely on predefined workflows, limiting AI agents’ flexibility in dynamically discovering and orchestrating tools.
In late 2024, Anthropic introduced the Model Context Protocol (MCP)[3], a general-purpose protocol standardizing AI-tool interactions. Inspired by the Language Server Protocol (LSP) [22], MCP provides a flexible framework for AI applications to communicate with external tools dynamically. Instead of relying on predefined tool mappings, MCP allows AI agents to autonomously discover, select, and orchestrate tools based on task context. It also supports human-in-the-loop mechanisms, enabling users to inject data or approve actions as needed. By unifying interfaces, MCP simplifies the development of AI applications and improves their flexibility in handling complex workflows. Since its release, MCP has rapidly grown from a niche protocol to a key foundation for AI-native application development. A thriving ecosystem has emerged, with thousands of community-driven MCP servers enabling model access to systems like GitHub [41], Slack [42], and even 3D design tools like Blender [1]. Tools like Cursor [12] and Claude Desktop [2] demonstrate how MCP clients can extend their capabilities by installing new servers, turning developer tools, productivity platforms, and creative environments alike into multi-modal AI agents.
Despite the rapid adoption of MCP, its ecosystem is still in the early stages, with key areas such as security, tool discoverability, and remote deployment lacking comprehensive solutions. These issues present untapped opportunities for further research and development. Although MCP is widely recognized for its potential in the industry, it has not yet been extensively analyzed in academic research. This gap in research motivates this paper, which provides the first analysis of the MCP ecosystem, examining its architecture and workflow, defining the lifecycle of MCP servers, and identifying potential security risks at each stage, such as installer spoofing and tool name conflict. Through this study, we present a thorough exploration of MCP’s current landscape and offer a forward-looking vision that highlights key implications, outlines future research directions, and addresses the challenges that must be overcome to ensure its sustainable growth.
Our contributions are as follows:
- We provide the first analysis of the MCP ecosystem, detailing its architecture, components, and workflow.
- We identify the key components of MCP servers and define their lifecycle, encompassing the stages of creation, operation, and update. We also highlight potential security risks associated with each phase, offering insights into safeguarding AI-to-tool interactions.
- We examine the current MCP ecosystem landscape, analyzing the adoption, diversity, and use cases across various industries and platforms.
- We discuss the implications of MCP’s rapid adoption, identifying key challenges for stakeholders, and outline future research directions on security, scalability, and governance to ensure its sustainable growth.
术语 | 翻译 |
---|---|
general-purpose | 通用的 |
human-in-the-loop | 人类参与其中 |
multi-modal AI agent | 多模态 AI 智能体 |
installer spoofing | 安装程序伪装 |
autonomous | 自主的 |
momentum | 动力;势头 |
plugin | 插件 |
callable | 可调用的 |
fragmented | 支离破碎的 |
redundant | 冗余的 |
orchestrating | 协调;编排 |
approve | 批准 |
niche | 小众的 |
thriving | 蓬勃发展的 |
untapped | 尚未开发的 |
extensively | 广泛地 |
implication | 含义;影响 |
encompass | 包括;涵盖 |
safeguard | 保护;防范 |
scalability | 可扩展性 |
governance | 治理;管理 |
I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving
Wu G, Zhang Z, Zhang Y, et al. I know what you asked: Prompt leakage via kv-cache sharing in multi-tenant llm serving[C]//Proceedings of the 2025 Network and Distributed System Security (NDSS) Symposium. San Diego, CA, USA. 2025.
Abstract—Large Language Models (LLMs), which laid the groundwork for Artificial General Intelligence (AGI), have recently gained significant traction in academia and industry due to their disruptive applications. In order to enable scalable applications and efficient resource management, various multitenant LLM serving frameworks have been proposed, in which the LLM caters to the needs of multiple users simultaneously. One notable mechanism in recent works, such as SGLang and vLLM, is sharing the Key-Value (KV) cache for identical token sequences among multiple users, saving both memory and computation.
This paper presents the first investigation on security risks associated with multi-tenant LLM serving. We show that the state-of-the-art mechanisms of KV cache sharing may lead to new side channel attack vectors, allowing unauthorized reconstruction of user prompts and compromising sensitive user information among mutually distrustful users. Specifically, we introduce our attack, PROMPTPEEK, and apply it to three scenarios where the adversary, with varying degrees of prior knowledge, is capable of reverse-engineering prompts from other users.
This study underscores the need for careful resource management in multitenant LLM serving and provides critical insights for future security enhancement.
术语 | 翻译 |
---|---|
laid groundwork | 打下基础 |
traction | 吸引力 |
disruptive | 颠覆性的 |
simultaneously | 同时的 |
identical | 相同的 |
compromising | 妥协的 |
mutually | 互相地 |
The rise of Large Language Models (LLMs) like GPT [25] or Llama [44] has enabled a variety of new applications, including universal chatbots [5], virtual assistants [4], and code generators [6], applicable to both large-scale cloud deployments and small-scale local setups. As LLM applications become widespread, effectively serving concurrent requests from multiple users has become a non-trivial research question [32], [61], [34].
In fact, processing a single LLM request is already costly, as it generates Key-Value (KV) cache [37] for each token during the inference phase, occupying a considerable amount of GPU memory [12]. With limited GPU memory capacity, the extensive size of the KV cache [12] restricts the ability to serve concurrent requests, becoming a critical bottleneck in multi-tenant scenarios.
One promising solution proposed by recent work (e.g., vLLM, SGLang) [32], [61], [53] is to share the KV cache across the requests to reduce both computation and memory usage. The rationale is that identical tokens in different requests can generate the same KV cache if their preceding tokens are also identical. Figure 1 illustrates an instance of KV cache sharing.
When the first user submits the query “Imagine you are an IT expert and tell me how to install Windows”, the KV cache for each token is computed and stored on the LLM server. Thereafter, if another user issues a query “Imagine you are an IT expert and tell me how to install Linux”, the initial segment of the sentence—“Imagine you are an IT expert and tell me how to install”—has an identical KV cache. Hence, the second user can directly utilize the KV cache previously computed for the first user, recalculating only the differing segment, “Linux”.
KV cache sharing prevents duplicate KV storage on the GPU, allowing more user requests to be served concurrently. More importantly, it reduces the serving time for individual requests by eliminating unnecessary calculations.
术语 | 翻译 |
---|---|
universal | 普适的 |
applicable | 适用的 |
setup | 配置 |
concurrent | 并发的 |
non-trivial | 非平凡的 |
inference | 推理 |
extensive | 广泛的 |
bottleneck | 瓶颈 |
promising | 有前景的 |
rationale | 原则,理由 |
preceding | 前面的 |
duplicate | 重复的 |
However, in this paper, we point out that the KV cache sharing mechanism is not secure. Our key insight is that the KV cache sharing may inadvertently create side channel information, which can be leveraged by the adversary to carefully craft requests sent to the LLM server to determine if its requests match the other users’, thereby recovering other users’ prompts.
In this paper, we dig into the current KV cache sharing strategies and demonstrate how these can be exploited to reconstruct user input prompts. In particular, we propose PROMPTPEEK, which leverages the changes of serving order as side channel information, to repeatedly extract other users’ prompts from the LLM service. More specifically, PROMPTPEEK utilizes the side channel information to monitor the KV cache hits and extracts one token at a time from another user’s prompt. By iteratively repeating this process, PROMPTPEEK can reconstruct the entire prompt from another user.
We assess PROMPTPEEK across three scenarios, where the adversary possesses different levels of background knowledge: knowledge of the prompt template, knowledge of the prompt input, and no background knowledge. Our results show that the adversary can achieve an average success rate of 99% in fully or partially reversing the prompt input, 98% in reversing the prompt template, and 95% without additional background knowledge, when tested on a Llama2-13B model on an A100 80G GPU.
From a higher level, our results show that the attack depends on three key factors: memory capacity, concurrent users’ requests, and attack requests. Memory capacity sets the feasibility of the attack, while both users’ and attack requests accelerate memory depletion. Besides, the attacker’s background knowledge minimizes the number of requests needed to recover prompts.
Our contributions. To summarize, we make the following contributions in this paper:
- We are the first to touch on the security risks in multi-tenant LLM serving, identifying it as a new attack surface in LLM security. We not only investigate the risks associated with KV cache sharing but also highlight the broader implications for any future LLM serving frameworks. Our research emphasizes the need for careful management of shared resources in these environments.
- We propose PROMPTPEEK and assess its feasibility across three scenarios. Unlike previous studies that only approximate prompt content, PROMPTPEEK accurately reconstructs prompts, which significantly increases privacy risks, as prompts may contain sensitive information like bank account numbers or health records. More importantly, we recognize that KV cache sharing is still in its early stages, so we outline three critical attack conditions that service providers and framework developers should consider in case of potential security risks.
- We simulate real-world LLM scenarios to evaluate the effectiveness and cost of our attack across three different environments, utilizing four distinct datasets on an A100 80G GPU. Our results reveal that our attack not only successfully uncovers prompt secrets but also at a low cost. For example, knowing the prompt template allows the adversary to uncover the prompt’s secrets, including gender, age, weight, and height, with just 60 requests in total.
术语 | 翻译 |
---|---|
inadvertently | 不经意地 |
iteratively | 迭代地 |
depletion | 耗尽,枯竭 |
touch on | 提及,涉及 |
broader implications | 更广泛的影响 |
CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning
He G, Yoneki E. CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning[C]//Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization. 2025: 493-506.
Large language models (LLMs) are remarked by their substantial computational requirements. To mitigate the cost, researchers develop specialized CUDA kernels, which often fuse several tensor operations to maximize the utilization of GPUs as much as possible. However, those specialized kernels may still leave performance on the table as CUDA assembly experts show that manual optimization of GPU SASS schedules can lead to better performance, and trial-and-error is largely employed to manually find the best GPU SASS schedules.
In this work, we employ an automatic approach to optimize GPU SASS schedules, which thus can be integrated into existing compiler frameworks. The key to automatic optimization is training an RL agent to mimic how human experts perform manual scheduling. To this end, we formulate an assembly game, where RL agents can play to find the best GPU SASS schedules. The assembly game starts from a -O3 optimized SASS schedule, and the RL agents can iteratively apply actions to mutate the current schedules. Positive rewards are generated if the mutated schedules get higher throughput by executing on GPUs.
Experiments show that CuAsmRL can further improve the performance of existing specialized CUDA kernels transparently by up to 26%, and on average 9%. Moreover, it is used as a tool to reveal potential optimization moves learned automatically.
LLMs are transformer-based deep neural networks (DNNs) consisting of many layers of self-attention [43] and linear projections. Since their appearance, state-of-the-art performance has been achieved across various domains, such as image generation [29] and natural language processing [41]. To date, OpenAI [30, 31] announces more than 100 billion words are generated every day. As such, LLMs have become a significant workload in the deep learning community and have gathered much attention.
However, training and serving LLMs are computationally expensive because they typically consist of multiple layers of transformer backbone, which is of billions of parameters. As a result, researchers have developed specialized CUDA kernels to accelerate LLM computation, instead of relying on high-level language to generate CUDA kernels. For example, fused attention (flash-attention) [5] is developed such that the attention computation achieves better utilization of the shared memory of NVIDIA GPUs. Fused feed-forward is a kernel implementation that fuses multiple operators for LLAMA [41], and root-mean-square layer normalization is a popular layer normalization operator for transformers [46]. We observe that those works are typically implemented by handwritten hardware-efficient codes, i.e. CUDA kernels for NVIDIA GPUs, for the flexibility and efficiency of hardwarevendor-provided programming models.
In this work, we investigate the possibility of further improving the handwritten kernels by exploring optimization at a lower level, i.e. hardware native assembly. Specifically, we focus on NVIDIA CUDA kernels. Optimizing at a lower level allows us to further optimize existing specialized CUDA kernels and this approach has been employed by previous works [12, 45], which show that manual optimization of GPU-native assembly schedules can lead to better performance. However, trial-and-error is suggested to manually find the best GPU SASS schedules, which is a tedious process even for CUDA experts, and cannot keep up with the development of new deep learning operators. Moreover, manual optimization cannot be integrated into existing compilation pipelines.
We propose CuAsmRL, an automatic optimizer for optimizing NVIDIA GPU SASS schedules. The idea of automatic optimization is achieved by training an RL agent, which mimics how human experts perform manual scheduling, to learn to find the optimized SASS schedule. To the best of our knowledge, we are the first to formulate the optimization of SASS schedules as an assembly game.
Being able to automatically optimize the SASS schedules enables us to integrate CuAsmRL into OpenAI Triton [40], an MLIR-based compiler for writing GPU kernels. Therefore, it first uses an autotuner to find the optimal kernel configurations, and then reuses the compilation pipeline of Triton but intercepts the generated cubin, which is then disassembled to SASS instructions, performs optimization and finally assembles back to an optimized cubin.
By evaluating on characteristic LLMs kernels, we find CuAsmRL automatically discovered better schedules than the -O3 SASS schedule, which leads to 1.09x of geometric mean throughput improvement. As this optimization takes place at a lower level, it is transparent to CUDA kernel developers. Given LLM training and serving can easily consume millions of GPU hours, we expect this kernel-level improvement to be significant.
In summary, this paper makes the following contributions:
- We formulate optimizing SASS schedules as an assembly game, and we implement CuAsmRL, an automatic optimizer for optimizing NVIDIA GPU SASS schedules.
- We integrate CuAsmRL into an existing compiler framework, OpenAI Triton, as a SASS-to-SASS optimizer, and it is transparent to CUDA kernel developers.
- Our evaluation shows that representative specialized kernels for LLMs can be further accelerated by up to 26% and on average 9% on Ampere GPUs.
- We demonstrate CuAsmRL can be used as a tool to reveal optimization moves learned automatically, which can bring new insights into the optimization of SASS instructions.
GPUs are hardware accelerators that can perform highly parallel computation and therefore tensor operations can be executed efficiently. To program GPUs, programmers must follow the programming model provided by CUDA [20]. Conceptually, a CUDA kernel consists of a grid of thread blocks running concurrently, and inside each thread block are multiple threads. Each thread block is mapped to a GPU steaming multiprocessor and is executed individually and in parallel.
CUDA kernel developers often program in a high-level programming language, such as C++ or Python, and then compilers compile the kernel code to device code. In the case of C++, the compilation is done by NVIDIA’s compiler (NVCC), while for Python, Triton [40] can be used. The compilation process has several stages: first, the code is compiled to PTX, which is an intermediate language that is GPU-architecture independent [27]. Note that one can also directly embed PTX when programming with a high-level programming language.
Then, the PTX codes are compiled to SASS, which is only possible through NVIDIA’s proprietary compiler ptxas [27]. SASS is a native assembly language to the target GPU. That is, the SASS is specific to the target GPU’s architecture. In this work, we limit our discussion to Ampere GPUs. While the corresponding SASS codes of a CUDA kernel are obtainable by utilizing the CUDA binary utilities [28], the instruction set is only vaguely documented. As a result, the lowering and optimization at this stage are unknown and inaccessible.
Finally, the SASS codes are assembled into binary code (cubin) that can be executed directly on the GPU. The overall compilation process is shown in Figure 1.