'Bad Likert Judge' Exposes Gaps in AI Content Moderation and Safety
'Bad Likert Judge' Exposes Gaps in AI Content Moderation and Safety
Researchers at Unit 42 have unveiled a jailbreak technique for large language models (LLMs) known as "Bad Likert Judge." This method exploits the models' safety mechanisms by prompting them to evaluate and score potentially harmful responses using the Likert scale, a common psychometric tool. By leveraging the LLM's evaluation capabilities, attackers can coax the models into generating harmful content associated with the highest Likert scale scores. Unit 42’s research, conducted by Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky, tested the technique across six leading LLMs from prominent vendors, including OpenAI, Microsoft, and Google. Their findings revealed a notable increase in the attack success rate (ASR), improving "more than 60% compared to plain attack prompts on average," according to the researchers.
The attack unfolds in several steps, beginning with prompting the target LLM to act as a "judge" to assess the harmfulness levels of generated responses. Next, the LLM is instructed to produce examples corresponding to various Likert scale scores, with the highest-scoring responses often containing harmful content. Further refinement through iterative prompts can amplify the harmfulness of these responses. During the evaluation, Unit 42 tested categories such as harassment, self-harm promotion, malware generation, and system prompt leakage. Notably, the ASR for harassment content rose to 95% in one model, while categories like "indiscriminate weapons" and "sexual content" also saw significant increases. The findings underscore the potential risks associated with AI misuse when safety guardrails are bypassed.
A key takeaway from Unit 42’s study highlights the critical role of content filtering systems in mitigating jailbreak risks. Researchers noted that "content filters significantly reduce the ASR across the model, with an average ASR reduction of 89.2 percentage points." However, they also acknowledged the limitations of such filters, pointing out that no LLM is entirely immune to jailbreak attempts. These vulnerabilities stem from inherent computational constraints and attention mechanisms, which adversaries can manipulate to bypass safety protocols. "Despite the effectiveness of content filtering, it is essential to acknowledge that it is not a perfect solution. Determined adversaries could still find ways to circumvent these filters, and there is always the possibility of false positives or false negatives in the filtering process."