LLM-PIEval: A benchmark for indirect prompt injection attacks in large language models
2024
Large Language Models (LLMs) have brought with them an unprecedented interest in AI in society. This has enabled their use in several day to day applications such as virtual assistants or smart home agents. This integration with external tools also brings several risk areas where malicious actors may try to inject harmful instruc-tions in the user query (direct prompt injection) or in the retrieved information payload of RAG systems (indirect prompt injection). Among these, indirect prompt injection attacks carry serious risks given the end users may not be aware of new attacks when they happen. However, detailed benchmarking of LLMs towards this risk is still limited. In this work, we develop a new framework named LLM-PIEval to measure any LLM candidate towards their risk for indirect prompt injection attacks. We leverage our framework to create a new test set and evaluate several state of the art LLMs using this test set, and observe strong attack success rates in most of them. We release our generated test set, along with API specifications and prompts to encourage wider assessment of this risk in current LLMs.
Research areas