Scaling CSS Methods for Cross-Platform Discourse Analysis¶
Swapneel Mehta, SimPPL
NYU Center for Social Media and Politics, 2026
About Me¶
Three open problems in cross-platform research¶
- Searching: how do you find the right posts from millions of candidates across platforms that don't share data with each other, leaving cross-platform patterns invisible to researchers?
- Organizing: how do you use LLMs to structure post content into interpretable themes when the models tend to generate repetitive labels across clusters? (Ziems et al. 2024, Computational Linguistics)
- Investigating: how do you build AI agents that help researchers investigate discourse further, and how do you evaluate whether their tool-calling outputs are actually useful?
Three contributions¶
We built Arbiter, an open investigative platform by the nonprofit SimPPL, covering X, YouTube, Reddit, Bluesky, Telegram, Meta, and Instagram.
- Query expansion: we break research questions into structured retrieval facets and combine keyword search, semantic search, and neural reranking
- Theme discovery: our clustering pipeline uses embedding geometry to reduce label redundancy. In preliminary results across 8 strategies, contrastive prompting substantially lowered redundancy
- Agent evaluation: we built a continuous evaluation framework measuring how well an AI agent selects and uses CSS tools when researchers ask questions in natural language
Data collection across seven platforms (and counting)¶
| Platform | Source | Notes |
|---|---|---|
| YouTube | Official Data API v3 | Video metadata, comments, search results |
| Bluesky | AT Protocol firehose | Full public post stream, open protocol |
| X/Twitter | Official API (pay-as-you-go) + twitterapi.io for pilots | Public posts |
| Public data dumps | Subreddit-level archives | |
| Telegram | Channel aggregation services | Public channel content |
| Meta/Instagram | Third-party services | Public page and post data |
| TikTok | Research API | Application-based access |
Platform APIs appear to be unreliable for research¶
- YouTube's Search API returns inconsistent results between identical queries. Jaccard overlap drops to around 30% after 12 weeks, and the API favors shorter, more popular videos (Efstratiou 2025, arXiv)
- Videos on YouTube become progressively unfindable within 20 to 60 days of publication, with a 76-92% loss in recoverable results after 10 weeks (Rieder et al. 2025, ResearchGate)
- TikTok's Research API fails to return metadata for 1 in 8 videos, including official TikTok content, with no error codes explaining the gaps (Entrena-Serrano et al. 2025, arXiv)
- These findings suggest researchers cannot rely on any single platform API to provide complete or consistent data over time
Tradeoffs with sampling¶
- Free data access drives researchers to over-study whichever platform is easiest to collect from, and implicitly overgeneralize findings that may not transfer (Tufekci 2014, ICWSM)
- We collect from seven sources including random samples from Bluesky (full firehose via AT Protocol) and Reddit (monthly public dumps from 40,000 subreddits)
- Our system is a search and structured discourse analysis tool designed to surface posts and actors that may exert issue-specific influence on different platforms
- We can identify visible actors, dominant themes, and cross-platform narrative differences for a given issue, but we cannot claim population-level prevalence or representativeness
Why retrieval matters for journalists¶
A journalist types a question. Our system needs to find the right posts from millions of candidates across seven platforms, in multiple languages, matching entities the journalist did not explicitly name. The challenge is doing this without returning thousands of irrelevant results.
Part 1: Query Expansion & Retrieval¶
Prior work on query expansion¶
- Rocchio (1971) introduced relevance feedback: shift the query vector toward relevant documents, but this assumes long texts with rich vocabulary
- Jagerman et al. (2023, arXiv) showed that LLM-generated expansion terms outperform classical feedback on standard IR benchmarks like BEIR
- Wang et al. (2023, EMNLP) had the LLM generate pseudo-documents as expanded context, improving BM25 recall by 3 to 15 percent
- On social media, free-form expansion adds terms that match thousands of short noisy posts, so recall goes up but precision collapses
What a journalist needs beyond her search query¶
Example: A journalist in Kenya types "Ibrahim Traore Burkina Faso" into our platform
| She typed | What the system also needs to find |
|---|---|
| "Ibrahim Traore" | Actors: "Captain Traoré," "Capitaine Traoré," "IB" |
| "Burkina Faso" | Orgs: MPSR, Alliance of Sahel States, CNSP |
| (nothing typed) | Events: Wagner Group departure, Sahel sovereignty |
| (nothing typed) | Phrases: "military junta," "pan-African sovereignty" |
Two words in, and the system needs to find dozens of related concepts across YouTube and Twitter in multiple languages, so the journalist can see the full picture of who is promoting Traoré and how the narrative differs across platforms.
How we identify and categorize entities¶
- A constrained LLM call decomposes the query into typed facets: actors, organizations, geographies, topics, exact phrases, and a semantic search query
- The LLM produces structured output following a schema that enforces categories (actor terms must be named people, geographic terms must be actual places, not inferred regions)
- When recent news or Wikipedia content is available, we inject it as grounded context so the planner extracts entities it can verify rather than speculate about
- Related efforts in entity-organized news: GDELT Project monitors global news in 100+ languages with real-time entity extraction, and news.smol.ai organizes content by extracted entities
- We are continuously building an entity knowledge base from these sources so that relationships between actors, organizations, and events surface as they emerge
The structured retrieval plan¶
{
"actorTerms": ["ibrahim traoré", "captain traoré"],
"organizationTerms": ["MPSR", "alliance of sahel states"],
"geoTerms": ["burkina faso", "ouagadougou"],
"topicTerms": ["military junta", "sahel sovereignty"],
"phrases": ["burkina faso transition"],
"semanticQuery": "Ibrahim Traoré military leadership"
}
Each facet type feeds into a different search clause: actorTerms, organizationTerms, and geoTerms enter as required matches. topicTerms and phrases enter as boosted optional matches. The semanticQuery is embedded for vector search.
Lexical search finds exact string matches¶
We run BM25 keyword search across post titles, extracted entity names, hashtags, and @-mentions.
- A post tagged "#BurkinaFaso" or mentioning "@ibaborey" has no meaningful equivalent in embedding space. Keyword matching is the only reliable way to find these exact references
- Phrase matching catches multi-word expressions like "Alliance of Sahel States" that standard tokenization would split apart
- Actor and organization terms from the retrieval plan enter as required matches, so at least one named entity must appear in every returned result
Semantic search finds meaning, not keywords¶
The problem: a French-language post "Le Capitaine Traoré consolide le pouvoir au Sahel" is relevant to our English query but shares zero keywords with it. Keyword search will miss it entirely.
"Ibrahim Traoré military leadership"
|
v
Voyage AI 3.5-lite encoder
|
v
1024-dimensional vector
|
v
K-nearest neighbor search
against all post embeddings
already stored in our index
- Every post in our system is embedded as a 1024-dimensional vector when it enters the pipeline
- At query time, we embed the semantic query and find the closest posts by vector distance, regardless of the language they were written in
Combining both with Reciprocal Rank Fusion¶
Neither keyword search nor semantic search alone is sufficient, so we combine their results using Reciprocal Rank Fusion (Cormack, Clarke, and Buettcher 2009).
$$\text{RRF}(d) = \sum_{i=1}^{n} \frac{1}{k + \text{rank}_i(d)}$$
Here $d$ is a post. $n$ is the number of search systems we are combining (2: keyword and semantic). $\text{rank}_i(d)$ is the position of post $d$ in the ranked list from system $i$. $k = 60$ is a smoothing constant that prevents the highest-ranked items from completely dominating the fused score.
Example: post A is ranked #1 in both systems: $\frac{1}{61} + \frac{1}{61} = 0.033$. Post B is ranked #1 in keyword search only: $\frac{1}{61} = 0.016$. Posts that both systems agree on receive roughly double the score of posts found by only one system.
Why reranking matters: an example¶
A journalist searches for "Ibrahim Traore Burkina Faso." The retrieval system returns 1,000 posts. But which ones actually answer her question?
| Post | Keyword match | Meaning match | Reranker score |
|---|---|---|---|
| "Traoré inaugurates new gold mine in Ouagadougou" | Yes (exact names) | Partially relevant | 0.65 |
| "Le Capitaine consolide le pouvoir au Sahel" | No (French, zero keyword overlap) | Highly relevant | 0.82 |
| "Burkina Faso weather forecast for Friday" | Yes (place name) | Not relevant | 0.08 |
- Keyword search finds the first and third but misses the French post entirely
- Semantic search finds the first and second but also surfaces loosely related content
- The reranker reads each post alongside the query and assigns an absolute relevance score, letting us drop everything below 0.2
Why rerankers, not LLMs, for relevance scoring¶
- A neural reranker reads query and post text together and produces a relevance score from 0 to 1, calibrated so that 0.8 means the same thing regardless of the query topic
- Purpose-built rerankers outperform LLM-based reranking by roughly 4.4 nDCG@10 points on standard retrieval while being an order of magnitude smaller (Sun et al. 2024, MAIR)
- Voyage rerank-2.5 adds instruction-following to the reranker architecture, gaining 8-12% accuracy when task instructions are provided (Voyage AI, 2025)
- We drop posts scoring below 0.2, which filters 60-80% of candidates before expensive downstream processing
Part 2: Geometric Theme Discovery¶
LLMs produce repetitive cluster labels¶
- Huang and He (2024, arXiv) found LLMs produce "different descriptions for the same label," requiring a dedicated merging stage
- Janssens et al. (2025, ECML PKDD) found HDBSCAN on social media data produces "an excessive number of topics, many semantically overlapping"
- TopicGPT (Pham et al. 2024, NAACL) addresses this by merging near-duplicate topics after generation using pairwise similarity
- In our experience, labeling 53 clusters independently produced "IndiGo Flight Disruptions" for 28 of them because the model has no knowledge of what other clusters exist
Existing approaches to label diversity¶
- BERTopic (Grootendorst 2022, arXiv) uses MMR to diversify keywords within a single topic, but has no cross-topic label uniqueness mechanism when using LLMs
- LLooM (Lam et al. 2024, CHI) induces concepts using LLMs, but the scoring step alone is 79.9% of its $1.44-per-run cost and it does not verify cross-concept uniqueness
- Most approaches fix redundancy after generation by merging duplicates. Our approach prevents duplicates during generation by using embedding geometry to filter candidates
Why we use UMAP + HDBSCAN for clustering¶
We wanted to find natural groupings in social media posts without specifying the number of clusters in advance. We benchmarked 5 clustering algorithms against 4 dimensionality reduction methods on our datasets.
| Algorithm | No Reduction | With UMAP |
|---|---|---|
| K-Means | 0.059 | 0.550 |
| K-Medoids | 0.076 | 0.587 |
| HDBSCAN | 0.445 | 0.706 |
| Community Det. | 0.253 | — |
The silhouette score measures how well separated the clusters are, from -1 (wrong cluster) to 1 (perfectly separated). UMAP + HDBSCAN produced the highest separation at 0.706 (McInnes et al. 2018, arXiv).
- One tradeoff: HDBSCAN assigns 29-76% of posts as outliers rather than forcing them into clusters. We absorb these outliers into the nearest cluster by cosine similarity to centroids
Our approach: filter labels geometrically¶
Clusters with embeddings
|
v
+---------------------------+
| 1. Compute centroids |
| 2. Merge if sim >= 0.85 |
+-------------+-------------+
v
For each cluster:
+---------------------------+
| 3. Find 2 nearest |
| neighbor clusters |
| 4. Show LLM posts from |
| this cluster + neighbors|
| 5. Generate 5 candidates |
| 6. Embed all 5 |
| 7. Filter: discard if |
| sim >= 0.70 to prior |
| 8. Pick closest to |
| centroid |
+---------------------------+
- Merge threshold (0.85) and filter threshold (0.70) are empirically determined heuristics
- On our datasets, this produces more distinct labels than the 7 alternative strategies we benchmarked
Example: how label filtering works¶
Cluster 12 is about agricultural development. A prior cluster was already labeled "Development Landscape." The LLM generates 5 candidates for cluster 12:
| Candidate | Similarity to "Development Landscape" | Result |
|---|---|---|
| Agricultural Development Initiatives | 0.74 | Filtered (>= 0.70) |
| Burkina Faso Farming Modernization | 0.68 | Passes |
| Rural Economic Transformation | 0.61 | Passes |
| Development Strategies | 0.82 | Filtered (>= 0.70) |
| Sahel Agricultural Programs | 0.55 | Passes |
The algorithm selects "Burkina Faso Farming Modernization" because it has the highest similarity to the cluster centroid among the candidates that passed the filter. If all 5 had been filtered, it falls back to the best candidate regardless.
Preliminary: contrastive context reduces redundancy¶
We benchmarked 8 labeling strategies on a single dataset of 2,138 posts across 53 clusters. This is preliminary work and we are still exploring how well these results generalize to other corpora.
| Strategy | Unique labels | Redundancy |
|---|---|---|
| Contrastive (with neighbor context) | 98.1% | 0.216 |
| Negative constraints ("avoid these labels") | 92.5% | 0.509 |
| Post-hoc deduplication | 94.3% | 0.443 |
| Our hybrid pipeline | >98% | 0.231 |
Redundancy here means the average pairwise cosine similarity between all label embeddings (lower is better). Our hybrid pipeline is currently exceeding 98% uniqueness on the datasets we have tested, and we are working to ensure this holds consistently as we create new case studies.
Why changing clusters changes labels entirely¶
One challenge with LLM-based labeling that we want to be upfront about: when the clustering parameters change or a different random seed is used, the cluster boundaries shift and the labels change completely. This is a stability problem shared by all methods in this space.
- We use a hierarchical structure: leaf-level clusters (subconcepts) get grouped into parent themes (concepts) using K-Means on cluster centroids, and the LLM generates a summary label for each parent
- For large volumes of posts, the cost of LLM calls for labeling adds up. Our contrastive batched approach uses roughly 7 LLM API calls total for a typical analysis, compared to 265+ for the exhaustive diversity approach and $1.44 per run for LLooM
- Our ongoing work focuses on improving label stability across different runs and expanding the evaluation to more datasets and languages
Case study: Traoré discourse across platforms¶
We analyzed 974 YouTube posts and 1,117 Twitter posts about Ibrahim Traoré using our pipeline.
- YouTube is dominated by development and infrastructure themes: "Burkina Faso Leadership Dynamics" (290 posts, 29.8%), "Development Landscape" (217 posts, 22.3%)
- Twitter centers on sovereignty and anti-colonial rhetoric: "National Identity" (235 posts), "Colonial Legacy" (99 posts), "Diplomatic Invitation Rejections" (70 posts)
- The top YouTube actor, "PILLARS OF HISTORY 12," averages 5.3 million interactions per post. "Make Africa Great Together" published posts like "Ibrahim Traoré helps people have good jobs" reaching 5.19 million interactions
- Multiple fan accounts use identical hashtag sets (#ibrahimtraore #burkinafaso) and similar short-form titles, consistent with what Rogers and Righetti (2025, Platforms & Society) describe as manufactured attention patterns
Templated content across promotion accounts¶
These YouTube accounts use identical sentence structures with only the positive claim swapped out, a pattern consistent with AI-generated promotional content:
| Account | Post template | Interactions |
|---|---|---|
| Make Africa Great Together | "Ibrahim Traoré helps people have good jobs 🙏❤ #ibrahimtraore" | 5.19M |
| Make Africa Great Together | "Ibrahim Traore gives hope to people 🙏❤ #ibrahimtraore" | 5.08M |
| Make Africa Great Together | "Ibrahim Traore is ready to meet everyone 🙏❤ #ibrahimtraore" | 4.33M |
| TheNewTribe | "Ibrahim Traoré's leadership brings modern irrigation" | 2.02M |
| TheNewTribe | "Ibrahim Traoré's leadership gives Burkina Faso a strong voice at the UN" | 59.6K |
| MY LOVE AFRICA | "Ibrahim Traoré: The President Who Owns Nothing" | 3.54M |
| MY LOVE AFRICA | "Ibrahim Traoré: Africa's Youngest Revolutionary Leader" | 63.6K |
- The sentence structure, emoji patterns, and hashtag sets are identical within each account, with only the positive claim varying
- These accounts collectively generated over 20 million interactions on YouTube
What we want the agent to do¶
A journalist asks a question in plain language: "Which accounts are shaping climate skepticism narratives and what reach do they achieve?" The agent needs to select the right analytical tools, generate code, execute it, and return a visualization the journalist can use in her reporting.
Part 3: Agent Evaluation¶
What is a tool call?¶
A tool is a function that an AI agent can decide to run when it needs to do something it cannot do with language alone. Here is a simple example:
def get_theme_actors(platform, theme_name, limit=10):
"""Returns the accounts posting most about a theme."""
posts = search_by_theme(platform, theme_name)
actors = count_by_author(posts)
return sorted(actors, by=engagement)[:limit]
- When a researcher asks "who is posting the most about gambling on Twitter," the agent decides to call
get_theme_actors("twitter", "gambling")rather than trying to answer from memory - The challenge is knowing which tool to call, with what arguments, and when to chain multiple tools together. Standard LLM benchmarks do not test this because they evaluate conversation quality, not tool selection
Tools available to our agent¶
| Category | Tools | What they do |
|---|---|---|
| Actor Profiling | 4 | Topic fingerprint, impact metrics, activity timeline, first mention |
| Topic Analysis | 3 | Stance distribution, velocity over time, top voices |
| Theme Analysis | 5 | Actors per theme, engagement leaderboard, cross-platform comparison |
| Claims Analysis | 9 | Claims by status, by type, by actor, search, cross-platform |
| Data Retrieval | 6 | Search posts, semantic search, posts by author or time range |
| Visualization | 7 | Tables, bar/line/pie charts, metric cards, post cards |
Each tool wraps a CSS method as a callable function. When a researcher asks a question, the agent selects from these 34 tools, generates code to call them, and returns a visualization.
Chaining tools into a research pipeline¶
A journalist asks: "Are there accounts coordinating to promote banned trading platforms, and do they operate across YouTube and Twitter?"
searchPosts --> getThemeActors --> getActorTimeline
(binary options) (promotion themes) (posting cadence)
|
+-------------------+
v
getTopicStance --> compareAcrossPlatforms
(promotional vs (same accounts or themes
critical stance) on other platforms?)
- Step 1 retrieves posts mentioning specific platform names (Exness, Quotex, Pocket Option) across all platforms
- Step 2 identifies which accounts are posting most within promotional themes
- Step 3 checks each account's posting frequency and cadence to surface coordinated scheduling
- Step 4 separates promotional content from critical or educational content using stance analysis
- Step 5 checks whether the same accounts or promotional themes appear on other platforms
Banned trading platforms promoted on YouTube¶
We searched 6,345 posts across YouTube, Twitter, and Bluesky for mentions of specific trading platforms. The promotion ecosystem is concentrated almost entirely on YouTube.
| Platform | YouTube posts | Regulatory status |
|---|---|---|
| Exness | 102 | Banned by SEBI (India) |
| Quotex | 99 | Unregulated binary options, banned in EU |
| Pocket Option | 79 | Unregulated, not licensed in major jurisdictions |
| Binarium | 58 | Unregulated binary options broker |
| IQ Option | 41 | Binary options banned in EU, restricted in India |
Promotion disguised as education¶
| Platform | YouTube posts | Regulatory status |
|---|---|---|
| Deriv | 26 | Restricted in several jurisdictions |
| Olymp Trade | 12 | Banned by SEBI (India) |
| OctaFX | 10 | Banned by SEBI (India) |
| Expert Option | 2 | Unregulated |
- The content takes the form of "trading education" and "AI trading bot demonstrations" rather than direct ads, making it harder for platform moderation to detect
- Cross-tagging is common: individual posts reference multiple platform names in hashtags (e.g., #quotex #pocketoption #binaryoptions) to capture search traffic
- The majority of these YouTube creators appear to be based in India, promoting platforms their own regulator has banned
Each platform plays a different role in the ecosystem¶
| Platform | Posts | What we found |
|---|---|---|
| YouTube | 3,606 | Binary options promotion through "education" content. Indian creators dominate, promoting SEBI-banned platforms |
| 1,941 | Crypto casino advertising (MetaWin: 36.9M interactions from 4 identical posts) alongside trading tip accounts and prop trading promotions | |
| Bluesky | 335 | 82% focused on Polymarket insider trading allegations. Functions as the accountability layer |
- YouTube hosts the promotion in educational packaging. Twitter hosts the direct advertising. Bluesky hosts the critical analysis
- This division of labor across platforms is invisible to any single-platform monitoring tool
- The MetaWin posts on Twitter (identical copy: "30% extra on all deposits," posted Jan 22, 23, 29) are straightforward advertising. The YouTube ecosystem is more subtle: it wraps promotion inside trading tutorials and AI bot demonstrations
How our agent routes questions¶
User question
|
v
+-------------------+
| Primary Agent: |
| classify intent |
+----+----+----+----+
| | |
v | v
Direct | Planner: decompose
reply | into steps
v |
Executor v
(simple) Executor (each step)
|
success? --no--> Planner revises
|
v
Return result
- The Primary Agent classifies the request and routes it to the right handler
- For complex questions, the Planner selects tools for each step and the Executor writes and runs code in a sandbox
- If the code fails, the Planner gets a second attempt to revise the approach
Why we need domain-specific evaluation¶
- MT-Bench (Zheng et al. 2023, NeurIPS) showed that LLM-as-judge agrees with human evaluators at 80%+ for general conversational quality, but it does not test whether an agent picks the right analytical tool for a given research question
- No existing benchmark measures whether an AI agent's social media analysis is useful to an actual researcher
- Our approach: define question archetypes (content analysis, actor profiling, thematic discovery, stance detection, claims verification, temporal trends), hand-rate a sample of answers, then use a larger LLM as judge that can also rerun the code to verify that the output matches the response
Six evaluation dimensions¶
| Dimension | Scale | What it measures |
|---|---|---|
| Intent | 0-2 | Did the agent correctly understand what the user asked? |
| ToolSel | 0-2 | Did it pick the right tool? Using generic code when a specialized function exists is a failure |
| Code | 0-2 | Is the generated code functionally correct? Scored by comparing against hand-written implementations |
| Response | 0-2 | Is the answer relevant and grounded in evidence? Averaged across 5 axes by an LLM judge |
| Exec | 0-1 | Did the code run without errors? A binary check |
| Error | 0-2 | When something breaks, does the agent recover and still provide useful output? |
Failure modes across nine case studies¶
We evaluated 60 questions per case study across 9 case studies (540 total evaluations).
| Code | What happened | Count |
|---|---|---|
| NO-CODE | Should have written code, gave text only | 140 |
| TOOL-001 | Used generic code instead of a specialized function | 108 |
| RESP-003 | The written answer contradicts what the code produced | 88 |
| RESP-002 | Asked clarifying questions instead of attempting an answer | 52 |
| CODE-001 | Tried to access a data field that does not exist | 19 |
The two most common failures tell us the agent is too cautious about writing code and does not take advantage of the specialized analysis functions we built.
Content type predicts failure patterns¶
| Case Study | Top failure | Avg score |
|---|---|---|
| Ibrahim Traore | Skill Bypassed (13) | 8.0 |
| Digital Identity | Skill Bypassed (21) | highest |
| Bondi Beach Attack | No Code (27) | ~5.0 |
| Online Gambling | No Code (29) | ~6.0 |
| Online Safety Act | No Code (27) | ~5.5 |
- The agent tends to avoid generating code when the content involves violence or financial regulation, but performs well on political topics where it has more context to work with
- This pattern across our nine case studies suggests that content-aware routing could help: the agent should behave differently depending on the sensitivity and complexity of the topic it is analyzing
Prompt optimization: ongoing work¶
We applied GEPA (Agrawal et al. 2025, ICLR 2026 Oral), a method that uses natural-language reflection to evolve prompts. This is ongoing work with a small evaluation set of 39 examples from one case study, and we welcome feedback on the methodology.
- The evolved prompt improved 22 questions but also made 23 others worse on our broader evaluation set
- Useful patterns it identified: routing tables that map user intents to specific functions work better than prose descriptions of what each tool does. Negative code examples ("never do this") proved as instructive as positive ones. Multi-strategy search should be the default rather than a fallback
- The value so far is in the failure analysis rather than the overall score improvement. We are continuing to expand our evaluation dataset and refine the framework
Benchmark scores depend on the prompt, not just the model¶
- Meaning-preserving prompt formatting changes can swing accuracy by up to 76 points on the same model (Sclar et al. 2024, ICLR)
- GPT-3.5-turbo performance varies by up to 40% on code tasks depending on whether the prompt uses plain text, Markdown, JSON, or YAML (He et al. 2024, arXiv)
- A benchmark score is a property of the (model, prompt) pair, not the model alone
- Prompt optimization treats both as a joint artifact and searches for the prompt configuration where the model performs best on your specific task
The evaluation loop¶
Define question --> Evaluate agent --> Classify failures
archetypes on each type into taxonomy
^ |
| v
+------- Improve prompt <-----------+
or architecture
- This cycle used to require months of manual work, but with our evaluation framework one complete iteration runs in hours
- The bottleneck has moved from engineering to evaluation design: the quality of the loop depends on whether we are measuring the right things
- Related: DSPy (Khattab et al. 2023, arXiv) for declarative prompt optimization; Toolformer (Schick et al. 2023, NeurIPS) for self-supervised tool learning
Published research methods can become tool calls¶
- Hanley and Durumeric (2024, IEEE S&P) track 52,036 news narratives at scale using MPNet embeddings and DP-Means clustering. Their narrative detection method could become a tool call that identifies which stories are spreading across platforms in a case study
- Nakov et al. (2021, IJCAI) define the full fact-checking pipeline: claim detection, evidence retrieval, and verification. Each stage maps to a separate tool call. Their group at MBZUAI also released OpenFactCheck (EMNLP 2024), a modular framework designed to be integrated into downstream systems
- CSMAP's own work on ideology scoring (Eady et al. 2024, Political Analysis) maps the ideology of news outlets from link-sharing data. Their cross-lingual narrative similarity method (Waight et al. 2025, Sociological Methods & Research) distills texts to core claims for comparison across languages
- Each of these could be wrapped as a function that our agent calls when a journalist asks the right question. The goal is to make vetted social science methods accessible in real time on custom datasets, within weeks rather than years
Broader takeaways¶
- Cross-platform information is growing faster than our tools to study it. Practitioners protecting information integrity need shared infrastructure that works across platforms
- Our methods are not perfect. Our sampling is not random, our theme labels are heuristically filtered rather than formally guaranteed, and our agent optimization is still in early stages. Each of these is a meaningful step forward and we are working to improve them
- Published research methods can now become tool calls in a shared platform, making them accessible to journalists and researchers who work with social media data but do not write code
Journalists in eight countries use our platform¶
Ten journalists in Kenya run influence operation investigations through Deutsche Welle Akademie. We collaborate with the NEST Center for Journalism in Mongolia on media development and with Jagran New Media, one of India's largest digital newsrooms. Rappler in the Philippines, founded by Nobel laureate Maria Ressa, uses our platform for accountability reporting.
Our investigations have contributed to Meta's Q1 2024 Adversarial Threat Report and surfaced influence operations on YouTube before the platform's own systems flagged them.
Summary¶
- Query expansion: we decompose research questions into typed facets with context grounding from news and Wikipedia, combine keyword and semantic search, and use neural reranking to produce absolute relevance scores
- Theme discovery: our geometric filtering pipeline uses heuristic cosine similarity thresholds to reduce label redundancy during generation rather than fixing it after the fact. Preliminary benchmarks across 8 strategies show contrastive prompting substantially reduces redundancy
- Agent evaluation: a continuous six-dimension evaluation framework with a failure mode taxonomy that reveals content-type-specific weaknesses across nine case studies. This is ongoing work and we welcome collaboration on shared evaluation benchmarks
Team¶
References¶
Sampling & APIs: Tufekci (2014) ICWSM. Efstratiou (2025) arXiv. Rieder et al. (2025) ResearchGate. Entrena-Serrano et al. (2025) arXiv.
Retrieval: Jagerman et al. (2023) arXiv. Wang et al. (2023) EMNLP. Rodriguez & Spirling (2022) J. Politics. Cormack et al. (2009) SIGIR. Sun et al. (2024) MAIR. Voyage AI (2025) rerank-2.5.
Themes: Grimmer, Roberts & Stewart (2022) Text as Data, Princeton UP. Grootendorst (2022) arXiv. Lam et al. (2024) CHI. Huang & He (2024) arXiv. Janssens et al. (2025) ECML PKDD. Pham et al. (2024) NAACL.
Agents & Evaluation: Agrawal et al. (2025) ICLR. Zheng et al. (2023) NeurIPS. Khattab et al. (2023) arXiv. Starbird et al. (2019) CSCW. Rogers & Righetti (2025) P&S.
Research as Tool Calls: Hanley & Durumeric (2024) IEEE S&P. Nakov et al. (2021) IJCAI. Wang et al. (2024) EMNLP. Eady et al. (2024) Pol. Analysis. Waight et al. (2025) Soc. Methods & Research.
Thank you¶
swapneel@simppl.org | @swapneelm
Vetted access: arbiter.simppl.org
SimPPL is a US 501(c)(3) nonprofit, supported by Google, Mozilla, Ford Foundation, Omidyar Network, and Wikimedia.
Why we support journalists in this work¶
- Over 2,900 American news outlets have shut down since 2005, leaving 212 counties with no local news source at all (Medill, 2025)
- Platforms invest in internal trust and safety teams, but these are internal mechanisms that do not surface findings to the public
- Journalists are typically the first to respond to information integrity threats during elections, public health crises, and civic events
- Our goal is to make cross-platform investigation tools accessible to the people whose job it is to hold power accountable and report accurately on online discourse
Appendix: Mathematical Details¶
A1: RRF worked example¶
$$\text{RRF}(d) = \sum_{i=1}^{n} \frac{1}{k + \text{rank}_i(d)}, \quad k = 60$$
Post A ranked #1 in both systems: $\frac{1}{61} + \frac{1}{61} = 0.033$
Post B ranked #1 in keyword only: $\frac{1}{61} = 0.016$
A2: Cosine similarity and thresholds¶
$$\cos(\mathbf{v}_1, \mathbf{v}_2) = \frac{\mathbf{v}_1 \cdot \mathbf{v}_2}{\|\mathbf{v}_1\| \times \|\mathbf{v}_2\|}$$
| Use | Heuristic threshold |
|---|---|
| Cluster merging | >= 0.85 |
| Label filtering | >= 0.70 |
| Label post-hoc merge | > 0.92 |
A3: Reranking¶
The reranker scores each post against the query text on a 0-to-1 scale. We retain posts scoring >= 0.2. Batch size: 200 posts. Timeout: 15 seconds.
A4: UMAP adaptive parameters¶
$$n_{\text{neighbors}} = \min(15,\; \max(2,\; \lfloor n/3 \rfloor))$$ $$n_{\text{components}} = \min(50,\; \max(2,\; \lfloor n/5 \rfloor))$$
min_dist = 0.1, metric = cosine, random_state = 42
A5: HDBSCAN¶
$$\text{min\_cluster\_size} = \max(8,\; \min(20,\; \lfloor\sqrt{n}\rfloor))$$
min_samples = 2, epsilon = 0.1, method = leaf. Outliers assigned to nearest cluster by cosine similarity to centroid.
A6: Silhouette score¶
$$s(i) = \frac{b(i) - a(i)}{\max(a(i),\; b(i))}$$
$a(i)$ = mean distance to points in same cluster. $b(i)$ = mean distance to nearest other cluster. Range: -1 (wrong cluster) to 1 (perfectly separated).
A7: Label filtering¶
Discard candidate $l_c$ if $\max_{j \in \text{prior labels}} \cos(\mathbf{e}_c, \mathbf{e}_j) \geq 0.70$
Select: $l^* = \arg\max_{l_c \in \text{filtered}} \cos(\mathbf{e}_c, \boldsymbol{\mu}_k)$ where $\boldsymbol{\mu}_k$ is the cluster centroid.
Fallback: if all candidates filtered, pick the best from the unfiltered set.
LLM benchmarks are effectively rank 2¶
Papailiopoulos (2026, post, code) assembled an 83-model x 49-benchmark matrix and found:
- The first singular value captures 71% of variance. Five benchmark scores predict the other 44 to within ~5 points
- Component 1 separates frontier from small models: GPQA-D, LiveCodeBench, MMLU-Pro load highest
- Component 2 separates latest frontier from older frontier: SimpleQA, ARC-AGI-2, HLE load highest. This is almost a "recency of frontier" measure
- A simple SVD + ridge regression method (BenchPress) matches Claude Sonnet at predicting missing scores (5.8% vs 6.1% median error), in under a second for $0
- Claude actually gets worse with more data ("retrieval-augmented degradation"), while BenchPress improves monotonically
Evaluation is the new frontier for AI¶
- If 49 benchmarks collapse to two dimensions, most of what we call "evaluation" measures the same two things repeatedly. The benchmarks that resist prediction (SimpleBench, ARC-AGI-1, Terminal-Bench) are the ones testing capabilities the rest of the matrix does not capture
- The optimal 5-benchmark set that spans both dimensions is: HLE, AIME 2025, LiveCodeBench, SWE-bench Verified, SimpleQA
- For domain-specific applications like CSS agent tool-calling, standard benchmarks are even less informative because they were not designed to test whether an agent selects the right analytical method for a given research question
- Building robust, domain-specific evaluation frameworks that test genuinely new capabilities is more valuable than adding another general benchmark to an already saturated matrix
A9: The Cursor-Kimi attribution case¶
In March 2026, Cursor launched "Composer 2" claiming frontier-level coding intelligence at 1/10 the cost of Claude. Within 24 hours, a developer found the model ID in the API: kimi-k2p5-rl-0317.
- Composer 2 was built on Kimi K2.5, an open-weight model from Beijing-based Moonshot AI, with reinforcement learning applied on top (TechCrunch)
- Cursor's launch messaging implied a fully in-house model with no mention of Moonshot AI or Kimi anywhere
- Kimi K2.5's license requires prominent attribution in commercial products, which Cursor did not provide
- Cursor co-founder admitted: "It was a miss to not mention the Kimi base in our blog from the start"
The benchmark number was real. The origin story was not. This illustrates why transparency in evaluation matters as much as the scores themselves.
A10: AI models used in the pipeline¶
| Pipeline stage | Model | Why this model |
|---|---|---|
| Query planning | GPT-4 Turbo | Structured output for typed facet decomposition |
| Post embeddings | Voyage AI 3.5-lite (1024-dim) | Cost/quality tradeoff at scale |
| Neural reranking | Voyage rerank-2.5 | Absolute relevance scores, instruction-following |
| Entity extraction | Llama-4-scout via Groq | Faster inference, higher rate limits than direct API |
| Theme label generation | GPT-4o-mini | Best overall quality at lowest cost in our benchmarks |
| Entity narratives | GPT-5-nano | Lightweight narrative generation with rule-based fallback |
| Agent (task LLM) | GPT-5.2 | Primary reasoning model for tool selection and code generation |
| Agent evaluation (judge) | Claude Sonnet 4.6 | LLM-as-judge for code quality and response quality scoring |
| Prompt optimization | GEPA with Claude Sonnet 4.6 | Reflective prompt evolution via natural-language reflection |