Stay Informed with the
Blue Square Alliance Against Hate Newsletter

Make an Impact

Show your support in the fight against Jewish Hate and All Hate.

Testing Next-Generation AI to Better Detect Antisemitism 

As part of our ongoing efforts to advance antisemitism research, the Blue Square Alliance Command Center has been testing how newer large language models (LLMs) perform at identifying antisemitic content online. The goal of this work is to evaluate whether modern AI systems can replicate the nuanced reasoning that trained human annotators apply when judging antisemitic expression, especially in ambiguous or coded language contexts that have historically challenged automated systems. While several academic studies have analyzed earlier models such as BERT, GPT-3.5, and GPT-4, there has been no published research evaluating more recent models that became available in 2024 and 2025. 

Our exploratory benchmark experiments indicate that, when guided by carefully structured prompts and annotation frameworks, these newer models achieve stronger accuracy and balance in classifying antisemitic language compared to earlier systems. 

 (A glossary of key terms is included at the end of this article.) 

The Prompt and Evaluation Design 

Each model was tested using a custom prompt developed by Command Center researchers to mirror the reasoning process of expert human annotators. The benchmark dataset was independently annotated by our team, using schema adapted from the Decoding Antisemitism Lexicon. The Decoding Antisemitism project was designed to operationalize the IHRA Working Definition of Antisemitism for research purposes––translating a primarily policy-oriented definition into one that can be systematically applied to data annotation and AI training. The initiative aims to understand online antisemitism in all its conceptual and communicative diversity. Grounding our benchmark in this framework ensures that the evaluation aligns with the most comprehensive model of antisemitic discourse to date.  

The prompt instructed models to classify posts as either Antisemitic or Not Antisemitic, applying this framework through detailed annotation rules for handling ambiguity, coded language, and political discourse around Israel and Zionism. 

Models were instructed to: 

  • Apply nuanced, context-sensitive reasoning 
  • Give the benefit of the doubt in genuinely ambiguous cases 
  • Distinguish between policy-based criticism of Israel and rhetoric that denies Jewish self-determination 

A key instruction read: 

“If the language in a post is genuinely ambiguous and could reasonably be interpreted as either antisemitic or not, give the benefit of the doubt and classify it as Not Antisemitic.” 

This structure was designed to test whether new LLMs could reproduce the kind of judgment exercised by trained human annotators while avoiding over-classification of political speech—a limitation often observed in commercial moderation systems such as Perspective API (Decoding Antisemitism 2023). 

Benchmark Dataset 

The benchmark set contained approximately 450 co-text-independent social media posts––meaning posts that were not replies or comments, and that did not reference real-world events occurring after the models’ training cutoff or obscure incidents outside general world knowledge. The dataset comprised roughly 37% antisemitic and 63% non-antisemitic examples. Many required background knowledge—such as familiarity with antisemitic tropes, conspiracy references, or coded terms—rather than reliance on explicit hate speech. This design allowed researchers to evaluate each model’s ability to detect implicit antisemitism rather than surface-level toxicity. 

Results: Clear Gains Over Earlier Generations 

Across the benchmark, all three models achieved Weighted-F1 scores above 0.88, substantially higher than the 0.69-0.77 range observed in earlier fine-tuned transformer or LLM systems (Steffen / Pustet / Mihaljević, 2024; Patel / Mehta / Blackburn, 2025). 

The Weighted-F1 score is a common metric used to evaluate classification models. In practice, it reflects how well the model performs across both antisemitic and non-antisemitic examples, even when one category is less represented in the data than the other (i.e., reflecting real-world data where antisemitic content is relatively rare). 

Model Weighted-F1 F1 (Antisemitic) F1 (Non-Antisemitic) Precision (Antisemitic) Recall (Antisemitic) 
Gemini 2.5 Flash (Standard) 0.91 0.88 0.93 0.90 0.86 
Gemini 2.5 Flash (Dynamic Thinking) 0.92 0.89 0.94 0.91 0.88 
Llama-3.3-70B-Instruct Turbo 0.90 0.86 0.92 0.87 0.84 
MoonshotAI Kimi-K2-Instruct 0.89 0.85 0.91 0.85 0.83 

(Results from Blue Square Alliance Command Center benchmark testing, 2025) 

These results show both high overall accuracy and balanced performance across classes, a long-standing challenge in antisemitism detection where antisemitic examples represent a small minority of online posts. 

Why These Results Matter 

The improvement is significant for several reasons: 

  1. Filling a research gap: To date, there has been no systematic evaluation of post-2024 models for antisemitism detection. These benchmarks begin to close that gap. 
  1. Improved reasoning on ambiguous content: Antisemitism often hides behind irony, moral critique, or political language. Newer models demonstrated better contextual reasoning and discrimination in such cases. 
  1. Grounded in established frameworks: Because both the dataset and prompt were derived from the Decoding Antisemitism schema, the evaluation aligns model reasoning with expert human interpretation. 
  1. Replicable methods: Using a clear prompt structure based on IHRA definitions ensures interpretability and mitigates the opacity common to proprietary moderation systems. 

Next Steps: Toward Context-Aware Antisemitism Detection 

The Command Center’s benchmark set was intentionally not representative of real-world social media data. It was curated to include only co-text-independent posts––that is, posts that were not replies to a parent post or thread, or references to specific real-world events occurring after the models’ training cutoffs. This design allowed the team to isolate how models interpret linguistic content alone, without external cues or thread-based context. 

While this setup was essential for establishing a clean, controlled baseline, it does not reflect the messier nature of real social media environments, where meaning often depends on conversation history, author patterns, or current events. Our next phase of research therefore focuses on bridging this gap––developing systems capable of classifying antisemitic content in any online setting, regardless of conversational depth or contextual complexity. To achieve this, the team is now conducting follow-up benchmarks to test whether providing models with additional context––such as parent comments, author metadata, or references to current events––can further improve interpretive accuracy. 

This line of work moves toward what can be called context-engineered or complete-context architectures: systems that equip models with the same interpretive information that human experts use when evaluating antisemitic language. These architectures integrate multiple complementary information types, each capturing a different layer of meaning: 

  • Conversational co-text: the surrounding thread, enabling models to detect tone, irony, or the intended target of speech. 
  • Current and historical event context: retrieved through time-filtered semantic search to situate posts that respond to news, anniversaries, or historical analogies. 
  • Named-entity context: resolved through entity-linking databases to clarify who or what is being referenced. 
  • Situational metadata: including timestamps, platform, and engagement indicators, which anchor interpretation in chronology and relevance. 

Each of these co-textual and contextual layers can be retrieved through a combination of semantic search and structured lookup, then injected into model prompts in a standardized format. While this framework remains in the experimental stage, it represents a promising direction for building models that reason with multiple sources of evidence rather than isolated text fragments. 

In this sense, retrieval-augmented generation (RAG) could evolve into a broader paradigm of context-engineered inference, where classification operates as a structured interpretive process approximating human expert reasoning. We are now developing prototypes that integrate external knowledge retrieval with these context layers––an important step toward reliable, context-sensitive antisemitism detection at scale. 

Toward Responsible, Context-Aware AI 

The Command Center’s findings highlight that newer large language models, when guided by expert-informed annotation and context-sensitive prompting, may approach near-human-level accuracy in detecting antisemitic expression. These benchmarks are exploratory, not definitive. We continue to expand its datasets and prototype systems that integrate retrieval-augmented and context-layered reasoning. As antisemitism evolves across digital spaces, the tools designed to identify it must do so as well. 

Glossary of Key Terms 

LLM (Large Language Model) 

An advanced type of AI trained on massive amounts of text to understand and generate human intelligence, such as understanding language, recognizing patterns, or making decisions. 

Antisemitism Detection Model 

A system designed to automatically identify antisemitic language in social media posts or other text. 

Annotation 

The process of labeling data to train or evaluate AI models. 

Co-text 

The surrounding text that appears with a post — such as the parent comment, replies, or other nearby messages in a thread. Co-text helps reveal meaning that isn’t obvious from the post alone. For example, a short comment like “Of course they did” might seem neutral by itself, but when read alongside the message it’s responding to, it could clearly reference an antisemitic statement. 

Context-Aware Model 

A model that considers external information — such as who wrote the post, when, or what it was replying to — rather than judging the text in isolation. 

Precision 

A measure of how many items the model labeled as antisemitic were actually antisemitic. High precision means few false alarms. 

Recall 

A measure of how many of the actual antisemitic posts the model successfully caught. High recall means it missed fewer cases. 

F1 Score 

A single metric that combines precision and recall to summarize overall performance. 

Weighted F1 Score 

A version of F1 that gives more weight to the majority class (here, non-antisemitic posts) so the score better reflects real-world data imbalance. 

Balanced Performance 

Refers to a model performing consistently across both categories (antisemitic and non-antisemitic) rather than excelling in one and failing in the other. 

RAG (Retrieval-Augmented Generation) 

An AI technique in which the model retrieves relevant information from an external database before generating an answer or classification — improving accuracy and transparency. 

Semantic Search 

A search method that looks for meaning rather than exact words. Instead of matching specific keywords, semantic search uses AI to understand the intent and context behind a query. For example, if you search “Jewish stereotypes in media,” a semantic search system can also find posts that discuss related ideas such as “Hollywood control” or “media influence,” even if they don’t use the same words. 

Other Stories

Together, We Can Achieve More. We’re committed to fighting hate in all its forms. Find out how we can help you.