Scaling CSS Methods for Cross-Platform Discourse Analysis¶

Swapneel Mehta, SimPPL

NYU Center for Social Media and Politics, 2026

About Me¶

No description has been provided for this image

Swapneel Mehta, Ph.D.

Co-founder and President, SimPPL

Former Postdoctoral Researcher, MIT and Boston University

Ph.D. in Data Science at New York University

Running a nonprofit team working in 7 countries, founded in 2021

Raised $2.5+ million in awards and grants from Google, Mozilla, Wikimedia, Ford, Omidyar

Worked at X, Slack, Adobe, Oxford, with Meta, CERN on rolling out AI and ML algorithms in products

Three open problems in cross-platform research¶

Searching: how do you find the right posts from millions of candidates across platforms that don't share data with each other, leaving cross-platform patterns invisible to researchers?

Organizing: how do you use LLMs to structure post content into interpretable themes when the models tend to generate repetitive labels across clusters? (Ziems et al. 2024, Computational Linguistics)

Investigating: how do you build AI agents that help researchers investigate discourse further, and how do you evaluate whether their tool-calling outputs are actually useful?

Three contributions¶

We built Arbiter, an open investigative platform by the nonprofit SimPPL, covering X, YouTube, Reddit, Bluesky, Telegram, Meta, and Instagram.

Query expansion: we break research questions into structured retrieval facets and combine keyword search, semantic search, and neural reranking
Theme discovery: our clustering pipeline uses embedding geometry to reduce label redundancy. In preliminary results across 8 strategies, contrastive prompting substantially lowered redundancy
Agent evaluation: we built a continuous evaluation framework measuring how well an AI agent selects and uses CSS tools when researchers ask questions in natural language

Data collection across seven platforms (and counting)¶

Platform	Source	Notes
YouTube	Official Data API v3	Video metadata, comments, search results
Bluesky	AT Protocol firehose	Full public post stream, open protocol
X/Twitter	Official API (pay-as-you-go) + twitterapi.io for pilots	Public posts
Reddit	Public data dumps	Subreddit-level archives
Telegram	Channel aggregation services	Public channel content
Meta/Instagram	Third-party services	Public page and post data
TikTok	Research API	Application-based access

Platform APIs appear to be unreliable for research¶

YouTube's Search API returns inconsistent results between identical queries. Jaccard overlap drops to around 30% after 12 weeks, and the API favors shorter, more popular videos (Efstratiou 2025, arXiv)
Videos on YouTube become progressively unfindable within 20 to 60 days of publication, with a 76-92% loss in recoverable results after 10 weeks (Rieder et al. 2025, ResearchGate)
TikTok's Research API fails to return metadata for 1 in 8 videos, including official TikTok content, with no error codes explaining the gaps (Entrena-Serrano et al. 2025, arXiv)
These findings suggest researchers cannot rely on any single platform API to provide complete or consistent data over time

Tradeoffs with sampling¶

Free data access drives researchers to over-study whichever platform is easiest to collect from, and implicitly overgeneralize findings that may not transfer (Tufekci 2014, ICWSM)

We collect from seven sources including random samples from Bluesky (full firehose via AT Protocol) and Reddit (monthly public dumps from 40,000 subreddits)

Our system is a search and structured discourse analysis tool designed to surface posts and actors that may exert issue-specific influence on different platforms

We can identify visible actors, dominant themes, and cross-platform narrative differences for a given issue, but we cannot claim population-level prevalence or representativeness

Why retrieval matters for journalists¶

A journalist types a question in plain language. Our system needs to find the right posts from millions of candidates across seven platforms, in multiple languages, matching entities the journalist did not explicitly name.

The challenge: doing this without returning thousands of irrelevant results.

Example query: “Which accounts are coordinating promotion of banned gambling platforms across YouTube and X?”

Requires: entity resolution, cross-platform matching, keyword + semantic search, neural reranking

Try it live → arbiter.simppl.org

Part 1: Query Expansion & Retrieval¶

Prior work on query expansion¶

Rocchio (1971) introduced relevance feedback: shift the query vector toward relevant documents, but this assumes long texts with rich vocabulary

Jagerman et al. (2023, arXiv) showed that LLM-generated expansion terms outperform classical feedback on standard IR benchmarks like BEIR

Wang et al. (2023, EMNLP) had the LLM generate pseudo-documents as expanded context, improving BM25 recall by 3 to 15 percent

On social media, free-form expansion adds terms that match thousands of short noisy posts, so recall goes up but precision collapses

What a journalist needs beyond her search query¶

Example: A journalist in Kenya types "Ibrahim Traore Burkina Faso" into our platform

She typed	What the system also needs to find
"Ibrahim Traore"	Actors: "Captain Traoré," "Capitaine Traoré," "IB"
"Burkina Faso"	Orgs: MPSR, Alliance of Sahel States, CNSP
(nothing typed)	Events: Wagner Group departure, Sahel sovereignty
(nothing typed)	Phrases: "military junta," "pan-African sovereignty"

Two words in, and the system needs to find dozens of related concepts across YouTube and Twitter in multiple languages, so the journalist can see the full picture of who is promoting Traoré and how the narrative differs across platforms.

How we identify and categorize entities¶

A constrained LLM call decomposes the query into typed facets: actors, organizations, geographies, topics, exact phrases, and a semantic search query

The LLM produces structured output following a schema that enforces categories (actor terms must be named people, geographic terms must be actual places, not inferred regions)

When recent news or Wikipedia content is available, we inject it as grounded context so the planner extracts entities it can verify rather than speculate about

Related efforts in entity-organized news: GDELT Project monitors global news in 100+ languages with real-time entity extraction, and news.smol.ai organizes content by extracted entities

We are continuously building an entity knowledge base from these sources so that relationships between actors, organizations, and events surface as they emerge

The structured retrieval plan¶

{
  "actorTerms":        ["ibrahim traoré", "captain traoré"],
  "organizationTerms": ["MPSR", "alliance of sahel states"],
  "geoTerms":          ["burkina faso", "ouagadougou"],
  "topicTerms":        ["military junta", "sahel sovereignty"],
  "phrases":           ["burkina faso transition"],
  "semanticQuery":     "Ibrahim Traoré military leadership"
}

Each facet type feeds into a different search clause: actorTerms, organizationTerms, and geoTerms enter as required matches. topicTerms and phrases enter as boosted optional matches. The semanticQuery is embedded for vector search.

Lexical search finds exact string matches¶

We run BM25 keyword search across post titles, extracted entity names, hashtags, and @-mentions.

A post tagged "#BurkinaFaso" or mentioning "@ibaborey" has no meaningful equivalent in embedding space. Keyword matching is the only reliable way to find these exact references
Phrase matching catches multi-word expressions like "Alliance of Sahel States" that standard tokenization would split apart
Actor and organization terms from the retrieval plan enter as required matches, so at least one named entity must appear in every returned result

Semantic search finds meaning, not keywords¶

The problem: a French-language post "Le Capitaine Traoré consolide le pouvoir au Sahel" is relevant to our English query but shares zero keywords with it. Keyword search will miss it entirely.

  "Ibrahim Traoré military leadership"
              |
              v
    Voyage AI 3.5-lite encoder
              |
              v
    1024-dimensional vector
              |
              v
    K-nearest neighbor search
    against all post embeddings
    already stored in our index

Every post in our system is embedded as a 1024-dimensional vector when it enters the pipeline
At query time, we embed the semantic query and find the closest posts by vector distance, regardless of the language they were written in

Combining both with Reciprocal Rank Fusion¶

Neither keyword search nor semantic search alone is sufficient, so we combine their results using Reciprocal Rank Fusion (Cormack, Clarke, and Buettcher 2009).

$$\text{RRF}(d) = \sum_{i=1}^{n} \frac{1}{k + \text{rank}_i(d)}$$

Here $d$ is a post. $n$ is the number of search systems we are combining (2: keyword and semantic). $\text{rank}_i(d)$ is the position of post $d$ in the ranked list from system $i$. $k = 60$ is a smoothing constant that prevents the highest-ranked items from completely dominating the fused score.

Example: post A is ranked #1 in both systems: $\frac{1}{61} + \frac{1}{61} = 0.033$. Post B is ranked #1 in keyword search only: $\frac{1}{61} = 0.016$. Posts that both systems agree on receive roughly double the score of posts found by only one system.

Why reranking matters: an example¶

A journalist searches for "Ibrahim Traore Burkina Faso." The retrieval system returns 1,000 posts. But which ones actually answer her question?

Post	Keyword match	Meaning match	Reranker score
"Traoré inaugurates new gold mine in Ouagadougou"	Yes (exact names)	Partially relevant	0.65
"Le Capitaine consolide le pouvoir au Sahel"	No (French, zero keyword overlap)	Highly relevant	0.82
"Burkina Faso weather forecast for Friday"	Yes (place name)	Not relevant	0.08

Keyword search finds the first and third but misses the French post entirely
Semantic search finds the first and second but also surfaces loosely related content
The reranker reads each post alongside the query and assigns an absolute relevance score, letting us drop everything below 0.2

Why rerankers, not LLMs, for relevance scoring¶

A neural reranker reads query and post text together and produces a relevance score from 0 to 1, calibrated so that 0.8 means the same thing regardless of the query topic

Purpose-built rerankers outperform LLM-based reranking by roughly 4.4 nDCG@10 points on standard retrieval while being an order of magnitude smaller (Sun et al. 2024, MAIR)

Voyage rerank-2.5 adds instruction-following to the reranker architecture, gaining 8-12% accuracy when task instructions are provided (Voyage AI, 2025)

We drop posts scoring below 0.2, which filters 60-80% of candidates before expensive downstream processing

Part 2: Geometric Theme Discovery¶

LLMs produce repetitive cluster labels¶

Huang and He (2024, arXiv) found LLMs produce "different descriptions for the same label," requiring a dedicated merging stage

Janssens et al. (2025, ECML PKDD) found HDBSCAN on social media data produces "an excessive number of topics, many semantically overlapping"

TopicGPT (Pham et al. 2024, NAACL) addresses this by merging near-duplicate topics after generation using pairwise similarity

In our experience, labeling 53 clusters independently produced "IndiGo Flight Disruptions" for 28 of them because the model has no knowledge of what other clusters exist

Existing approaches to label diversity¶

BERTopic (Grootendorst 2022, arXiv) uses MMR to diversify keywords within a single topic, but has no cross-topic label uniqueness mechanism when using LLMs

LLooM (Lam et al. 2024, CHI) induces concepts using LLMs, but the scoring step alone is 79.9% of its $1.44-per-run cost and it does not verify cross-concept uniqueness

Most approaches fix redundancy after generation by merging duplicates. Our approach prevents duplicates during generation by using embedding geometry to filter candidates

Why we use UMAP + HDBSCAN for clustering¶

We wanted to find natural groupings in social media posts without specifying the number of clusters in advance. We benchmarked 5 clustering algorithms against 4 dimensionality reduction methods on our datasets.

Algorithm	No Reduction	With UMAP
K-Means	0.059	0.550
K-Medoids	0.076	0.587
HDBSCAN	0.445	0.706
Community Det.	0.253	—

The silhouette score measures how well separated the clusters are, from -1 (wrong cluster) to 1 (perfectly separated). UMAP + HDBSCAN produced the highest separation at 0.706 (McInnes et al. 2018, arXiv).

One tradeoff: HDBSCAN assigns 29-76% of posts as outliers rather than forcing them into clusters. We absorb these outliers into the nearest cluster by cosine similarity to centroids

Our approach: filter labels geometrically¶

 Clusters with embeddings
         |
         v
 +---------------------------+
 | 1. Compute centroids      |
 | 2. Merge if sim >= 0.85   |
 +-------------+-------------+
               v
      For each cluster:
 +---------------------------+
 | 3. Find 2 nearest         |
 |    neighbor clusters      |
 | 4. Show LLM posts from    |
 |    this cluster + neighbors|
 | 5. Generate 5 candidates  |
 | 6. Embed all 5            |
 | 7. Filter: discard if     |
 |    sim >= 0.70 to prior   |
 | 8. Pick closest to        |
 |    centroid               |
 +---------------------------+

Merge threshold (0.85) and filter threshold (0.70) are empirically determined heuristics
On our datasets, this produces more distinct labels than the 7 alternative strategies we benchmarked

Example: how label filtering works¶

Cluster 12 is about agricultural development. A prior cluster was already labeled "Development Landscape." The LLM generates 5 candidates for cluster 12:

Candidate	Similarity to "Development Landscape"	Result
Agricultural Development Initiatives	0.74	Filtered (>= 0.70)
Burkina Faso Farming Modernization	0.68	Passes
Rural Economic Transformation	0.61	Passes
Development Strategies	0.82	Filtered (>= 0.70)
Sahel Agricultural Programs	0.55	Passes

The algorithm selects "Burkina Faso Farming Modernization" because it has the highest similarity to the cluster centroid among the candidates that passed the filter. If all 5 had been filtered, it falls back to the best candidate regardless.

Preliminary: contrastive context reduces redundancy¶

We benchmarked 8 labeling strategies on a single dataset of 2,138 posts across 53 clusters. This is preliminary work and we are still exploring how well these results generalize to other corpora.

Strategy	Unique labels	Redundancy
Contrastive (with neighbor context)	98.1%	0.216
Negative constraints ("avoid these labels")	92.5%	0.509
Post-hoc deduplication	94.3%	0.443
Our hybrid pipeline	>98%	0.231

Redundancy here means the average pairwise cosine similarity between all label embeddings (lower is better). Our hybrid pipeline is currently exceeding 98% uniqueness on the datasets we have tested, and we are working to ensure this holds consistently as we create new case studies.

Why changing clusters changes labels entirely¶

One challenge with LLM-based labeling that we want to be upfront about: when the clustering parameters change or a different random seed is used, the cluster boundaries shift and the labels change completely. This is a stability problem shared by all methods in this space.

We use a hierarchical structure: leaf-level clusters (subconcepts) get grouped into parent themes (concepts) using K-Means on cluster centroids, and the LLM generates a summary label for each parent
For large volumes of posts, the cost of LLM calls for labeling adds up. Our contrastive batched approach uses roughly 7 LLM API calls total for a typical analysis, compared to 265+ for the exhaustive diversity approach and $1.44 per run for LLooM
Our ongoing work focuses on improving label stability across different runs and expanding the evaluation to more datasets and languages

Case study: Traoré discourse across platforms¶

We analyzed 974 YouTube posts and 1,117 Twitter posts about Ibrahim Traoré using our pipeline.

YouTube is dominated by development and infrastructure themes: "Burkina Faso Leadership Dynamics" (290 posts, 29.8%), "Development Landscape" (217 posts, 22.3%)
Twitter centers on sovereignty and anti-colonial rhetoric: "National Identity" (235 posts), "Colonial Legacy" (99 posts), "Diplomatic Invitation Rejections" (70 posts)
The top YouTube actor, "PILLARS OF HISTORY 12," averages 5.3 million interactions per post. "Make Africa Great Together" published posts like "Ibrahim Traoré helps people have good jobs" reaching 5.19 million interactions
Multiple fan accounts use identical hashtag sets (#ibrahimtraore #burkinafaso) and similar short-form titles, consistent with what Rogers and Righetti (2025, Platforms & Society) describe as manufactured attention patterns

Templated content across promotion accounts¶

These YouTube accounts use identical sentence structures with only the positive claim swapped out, a pattern consistent with AI-generated promotional content:

Account	Post template	Interactions
Make Africa Great Together	"Ibrahim Traoré helps people have good jobs 🙏❤ #ibrahimtraore"	5.19M
Make Africa Great Together	"Ibrahim Traore gives hope to people 🙏❤ #ibrahimtraore"	5.08M
Make Africa Great Together	"Ibrahim Traore is ready to meet everyone 🙏❤ #ibrahimtraore"	4.33M
TheNewTribe	"Ibrahim Traoré's leadership brings modern irrigation"	2.02M
TheNewTribe	"Ibrahim Traoré's leadership gives Burkina Faso a strong voice at the UN"	59.6K
MY LOVE AFRICA	"Ibrahim Traoré: The President Who Owns Nothing"	3.54M
MY LOVE AFRICA	"Ibrahim Traoré: Africa's Youngest Revolutionary Leader"	63.6K

The sentence structure, emoji patterns, and hashtag sets are identical within each account, with only the positive claim varying
These accounts collectively generated over 20 million interactions on YouTube

What we want the agent to do¶

A journalist asks a question in plain language. The agent needs to select the right analytical tools, generate code, execute it, and return a visualization the journalist can use in reporting.

Example: “Which accounts are shaping climate skepticism narratives and what reach do they achieve?”

Agent workflow: parse intent → select tools (search, theme analysis, network mapping) → execute → synthesize findings

See agent in action → arbiter.simppl.org

Part 3: Agent Evaluation¶

What is a tool call?¶

A tool is a function that an AI agent can decide to run when it needs to do something it cannot do with language alone. Here is a simple example:

def get_theme_actors(platform, theme_name, limit=10):
    """Returns the accounts posting most about a theme."""
    posts = search_by_theme(platform, theme_name)
    actors = count_by_author(posts)
    return sorted(actors, by=engagement)[:limit]

When a researcher asks "who is posting the most about gambling on Twitter," the agent decides to call get_theme_actors("twitter", "gambling") rather than trying to answer from memory
The challenge is knowing which tool to call, with what arguments, and when to chain multiple tools together. Standard LLM benchmarks do not test this because they evaluate conversation quality, not tool selection

Tools available to our agent¶

Category	Tools	What they do
Actor Profiling	4	Topic fingerprint, impact metrics, activity timeline, first mention
Topic Analysis	3	Stance distribution, velocity over time, top voices
Theme Analysis	5	Actors per theme, engagement leaderboard, cross-platform comparison
Claims Analysis	9	Claims by status, by type, by actor, search, cross-platform
Data Retrieval	6	Search posts, semantic search, posts by author or time range
Visualization	7	Tables, bar/line/pie charts, metric cards, post cards

Each tool wraps a CSS method as a callable function. When a researcher asks a question, the agent selects from these 34 tools, generates code to call them, and returns a visualization.

Chaining tools into a research pipeline¶

A journalist asks: "Are there accounts coordinating to promote banned trading platforms, and do they operate across YouTube and Twitter?"

searchPosts --> getThemeActors --> getActorTimeline
(binary options)  (promotion themes)  (posting cadence)
                                          |
                      +-------------------+
                      v
              getTopicStance --> compareAcrossPlatforms
              (promotional vs    (same accounts or themes
               critical stance)   on other platforms?)

Step 1 retrieves posts mentioning specific platform names (Exness, Quotex, Pocket Option) across all platforms
Step 2 identifies which accounts are posting most within promotional themes
Step 3 checks each account's posting frequency and cadence to surface coordinated scheduling
Step 4 separates promotional content from critical or educational content using stance analysis
Step 5 checks whether the same accounts or promotional themes appear on other platforms

Banned trading platforms promoted on YouTube¶

We searched 6,345 posts across YouTube, Twitter, and Bluesky for mentions of specific trading platforms. The promotion ecosystem is concentrated almost entirely on YouTube.

Platform	YouTube posts	Regulatory status
Exness	102	Banned by SEBI (India)
Quotex	99	Unregulated binary options, banned in EU
Pocket Option	79	Unregulated, not licensed in major jurisdictions
Binarium	58	Unregulated binary options broker
IQ Option	41	Binary options banned in EU, restricted in India

Promotion disguised as education¶

Platform	YouTube posts	Regulatory status
Deriv	26	Restricted in several jurisdictions
Olymp Trade	12	Banned by SEBI (India)
OctaFX	10	Banned by SEBI (India)
Expert Option	2	Unregulated

The content takes the form of "trading education" and "AI trading bot demonstrations" rather than direct ads, making it harder for platform moderation to detect
Cross-tagging is common: individual posts reference multiple platform names in hashtags (e.g., #quotex #pocketoption #binaryoptions) to capture search traffic
The majority of these YouTube creators appear to be based in India, promoting platforms their own regulator has banned

Each platform plays a different role in the ecosystem¶

Platform	Posts	What we found
YouTube	3,606	Binary options promotion through "education" content. Indian creators dominate, promoting SEBI-banned platforms
Twitter	1,941	Crypto casino advertising (MetaWin: 36.9M interactions from 4 identical posts) alongside trading tip accounts and prop trading promotions
Bluesky	335	82% focused on Polymarket insider trading allegations. Functions as the accountability layer

YouTube hosts the promotion in educational packaging. Twitter hosts the direct advertising. Bluesky hosts the critical analysis
This division of labor across platforms is invisible to any single-platform monitoring tool
The MetaWin posts on Twitter (identical copy: "30% extra on all deposits," posted Jan 22, 23, 29) are straightforward advertising. The YouTube ecosystem is more subtle: it wraps promotion inside trading tutorials and AI bot demonstrations

How our agent routes questions¶

User question
      |
      v
+-------------------+
|  Primary Agent:   |
|  classify intent  |
+----+----+----+----+
     |    |    |
     v    |    v
  Direct  |  Planner: decompose
  reply   |  into steps
          v       |
      Executor    v
      (simple) Executor (each step)
                  |
          success? --no--> Planner revises
                  |
                  v
          Return result

The Primary Agent classifies the request and routes it to the right handler
For complex questions, the Planner selects tools for each step and the Executor writes and runs code in a sandbox
If the code fails, the Planner gets a second attempt to revise the approach

Why we need domain-specific evaluation¶

MT-Bench (Zheng et al. 2023, NeurIPS) showed that LLM-as-judge agrees with human evaluators at 80%+ for general conversational quality, but it does not test whether an agent picks the right analytical tool for a given research question
No existing benchmark measures whether an AI agent's social media analysis is useful to an actual researcher
Our approach: define question archetypes (content analysis, actor profiling, thematic discovery, stance detection, claims verification, temporal trends), hand-rate a sample of answers, then use a larger LLM as judge that can also rerun the code to verify that the output matches the response

Six evaluation dimensions¶

Dimension	Scale	What it measures
Intent	0-2	Did the agent correctly understand what the user asked?
ToolSel	0-2	Did it pick the right tool? Using generic code when a specialized function exists is a failure
Code	0-2	Is the generated code functionally correct? Scored by comparing against hand-written implementations
Response	0-2	Is the answer relevant and grounded in evidence? Averaged across 5 axes by an LLM judge
Exec	0-1	Did the code run without errors? A binary check
Error	0-2	When something breaks, does the agent recover and still provide useful output?

Failure modes across nine case studies¶

We evaluated 60 questions per case study across 9 case studies (540 total evaluations).

Code	What happened	Count
NO-CODE	Should have written code, gave text only	140
TOOL-001	Used generic code instead of a specialized function	108
RESP-003	The written answer contradicts what the code produced	88
RESP-002	Asked clarifying questions instead of attempting an answer	52
CODE-001	Tried to access a data field that does not exist	19

The two most common failures tell us the agent is too cautious about writing code and does not take advantage of the specialized analysis functions we built.

Content type predicts failure patterns¶

Case Study	Top failure	Avg score
Ibrahim Traore	Skill Bypassed (13)	8.0
Digital Identity	Skill Bypassed (21)	highest
Bondi Beach Attack	No Code (27)	~5.0
Online Gambling	No Code (29)	~6.0
Online Safety Act	No Code (27)	~5.5

The agent tends to avoid generating code when the content involves violence or financial regulation, but performs well on political topics where it has more context to work with
This pattern across our nine case studies suggests that content-aware routing could help: the agent should behave differently depending on the sensitivity and complexity of the topic it is analyzing

Prompt optimization: ongoing work¶

We applied GEPA (Agrawal et al. 2025, ICLR 2026 Oral), a method that uses natural-language reflection to evolve prompts. This is ongoing work with a small evaluation set of 39 examples from one case study, and we welcome feedback on the methodology.

The evolved prompt improved 22 questions but also made 23 others worse on our broader evaluation set
Useful patterns it identified: routing tables that map user intents to specific functions work better than prose descriptions of what each tool does. Negative code examples ("never do this") proved as instructive as positive ones. Multi-strategy search should be the default rather than a fallback
The value so far is in the failure analysis rather than the overall score improvement. We are continuing to expand our evaluation dataset and refine the framework

Benchmark scores depend on the prompt, not just the model¶

Meaning-preserving prompt formatting changes can swing accuracy by up to 76 points on the same model (Sclar et al. 2024, ICLR)
GPT-3.5-turbo performance varies by up to 40% on code tasks depending on whether the prompt uses plain text, Markdown, JSON, or YAML (He et al. 2024, arXiv)
A benchmark score is a property of the (model, prompt) pair, not the model alone
Prompt optimization treats both as a joint artifact and searches for the prompt configuration where the model performs best on your specific task

The evaluation loop¶

Define question --> Evaluate agent --> Classify failures
  archetypes        on each type        into taxonomy
       ^                                     |
       |                                     v
       +-------  Improve prompt  <-----------+
                 or architecture

This cycle used to require months of manual work, but with our evaluation framework one complete iteration runs in hours
The bottleneck has moved from engineering to evaluation design: the quality of the loop depends on whether we are measuring the right things
Related: DSPy (Khattab et al. 2023, arXiv) for declarative prompt optimization; Toolformer (Schick et al. 2023, NeurIPS) for self-supervised tool learning

Published research methods can become tool calls¶

Hanley and Durumeric (2024, IEEE S&P) track 52,036 news narratives at scale using MPNet embeddings and DP-Means clustering. Their narrative detection method could become a tool call that identifies which stories are spreading across platforms in a case study
Nakov et al. (2021, IJCAI) define the full fact-checking pipeline: claim detection, evidence retrieval, and verification. Each stage maps to a separate tool call. Their group at MBZUAI also released OpenFactCheck (EMNLP 2024), a modular framework designed to be integrated into downstream systems
CSMAP's own work on ideology scoring (Eady et al. 2024, Political Analysis) maps the ideology of news outlets from link-sharing data. Their cross-lingual narrative similarity method (Waight et al. 2025, Sociological Methods & Research) distills texts to core claims for comparison across languages
Each of these could be wrapped as a function that our agent calls when a journalist asks the right question. The goal is to make vetted social science methods accessible in real time on custom datasets, within weeks rather than years

Broader takeaways¶

Cross-platform information is growing faster than our tools to study it. Practitioners protecting information integrity need shared infrastructure that works across platforms
Our methods are not perfect. Our sampling is not random, our theme labels are heuristically filtered rather than formally guaranteed, and our agent optimization is still in early stages. Each of these is a meaningful step forward and we are working to improve them
Published research methods can now become tool calls in a shared platform, making them accessible to journalists and researchers who work with social media data but do not write code

Journalists in eight countries use our platform¶

Ten journalists in Kenya run influence operation investigations through Deutsche Welle Akademie. We collaborate with the NEST Center for Journalism in Mongolia on media development and with Jagran New Media, one of India's largest digital newsrooms. Rappler in the Philippines, founded by Nobel laureate Maria Ressa, uses our platform for accountability reporting.

Our investigations have contributed to Meta's Q1 2024 Adversarial Threat Report and surfaced influence operations on YouTube before the platform's own systems flagged them.

Summary¶

Query expansion: we decompose research questions into typed facets with context grounding from news and Wikipedia, combine keyword and semantic search, and use neural reranking to produce absolute relevance scores
Theme discovery: our geometric filtering pipeline uses heuristic cosine similarity thresholds to reduce label redundancy during generation rather than fixing it after the fact. Preliminary benchmarks across 8 strategies show contrastive prompting substantially reduces redundancy
Agent evaluation: a continuous six-dimension evaluation framework with a failure mode taxonomy that reveals content-type-specific weaknesses across nine case studies. This is ongoing work and we welcome collaboration on shared evaluation benchmarks

Team¶

Dhara Mungra

Co-Founder & Treasurer

Utkarsh Verma

ML Engineer

Atmik Shetty

ML Engineer

Varun Nair

ML Engineer

Dev Bhut

ML Engineer

Delisha Naik

Research Engineer

Arhaan Godhrawala

Research Engineer

simppl.org/team

References¶

Sampling & APIs: Tufekci (2014) ICWSM. Efstratiou (2025) arXiv. Rieder et al. (2025) ResearchGate. Entrena-Serrano et al. (2025) arXiv.

Retrieval: Jagerman et al. (2023) arXiv. Wang et al. (2023) EMNLP. Rodriguez & Spirling (2022) J. Politics. Cormack et al. (2009) SIGIR. Sun et al. (2024) MAIR. Voyage AI (2025) rerank-2.5.

Themes: Grimmer, Roberts & Stewart (2022) Text as Data, Princeton UP. Grootendorst (2022) arXiv. Lam et al. (2024) CHI. Huang & He (2024) arXiv. Janssens et al. (2025) ECML PKDD. Pham et al. (2024) NAACL.

Agents & Evaluation: Agrawal et al. (2025) ICLR. Zheng et al. (2023) NeurIPS. Khattab et al. (2023) arXiv. Starbird et al. (2019) CSCW. Rogers & Righetti (2025) P&S.

Research as Tool Calls: Hanley & Durumeric (2024) IEEE S&P. Nakov et al. (2021) IJCAI. Wang et al. (2024) EMNLP. Eady et al. (2024) Pol. Analysis. Waight et al. (2025) Soc. Methods & Research.

Thank you¶

swapneel@simppl.org | @swapneelm

Vetted access: arbiter.simppl.org

SimPPL is a US 501(c)(3) nonprofit, supported by Google, Mozilla, Ford Foundation, Omidyar Network, and Wikimedia.

Why we support journalists in this work¶

Over 2,900 American news outlets have shut down since 2005, leaving 212 counties with no local news source at all (Medill, 2025)

Platforms invest in internal trust and safety teams, but these are internal mechanisms that do not surface findings to the public

Journalists are typically the first to respond to information integrity threats during elections, public health crises, and civic events

Our goal is to make cross-platform investigation tools accessible to the people whose job it is to hold power accountable and report accurately on online discourse

Appendix: Mathematical Details¶

A1: RRF worked example¶

$$\text{RRF}(d) = \sum_{i=1}^{n} \frac{1}{k + \text{rank}_i(d)}, \quad k = 60$$

Post A ranked #1 in both systems: $\frac{1}{61} + \frac{1}{61} = 0.033$

Post B ranked #1 in keyword only: $\frac{1}{61} = 0.016$

A2: Cosine similarity and thresholds¶

$$\cos(\mathbf{v}_1, \mathbf{v}_2) = \frac{\mathbf{v}_1 \cdot \mathbf{v}_2}{\|\mathbf{v}_1\| \times \|\mathbf{v}_2\|}$$

Use	Heuristic threshold
Cluster merging	>= 0.85
Label filtering	>= 0.70
Label post-hoc merge	> 0.92

A3: Reranking¶

The reranker scores each post against the query text on a 0-to-1 scale. We retain posts scoring >= 0.2. Batch size: 200 posts. Timeout: 15 seconds.

A4: UMAP adaptive parameters¶

$$n_{\text{neighbors}} = \min(15,\; \max(2,\; \lfloor n/3 \rfloor))$$ $$n_{\text{components}} = \min(50,\; \max(2,\; \lfloor n/5 \rfloor))$$

min_dist = 0.1, metric = cosine, random_state = 42

A5: HDBSCAN¶

$$\text{min\_cluster\_size} = \max(8,\; \min(20,\; \lfloor\sqrt{n}\rfloor))$$

min_samples = 2, epsilon = 0.1, method = leaf. Outliers assigned to nearest cluster by cosine similarity to centroid.

A6: Silhouette score¶

$$s(i) = \frac{b(i) - a(i)}{\max(a(i),\; b(i))}$$

$a(i)$ = mean distance to points in same cluster. $b(i)$ = mean distance to nearest other cluster. Range: -1 (wrong cluster) to 1 (perfectly separated).

A7: Label filtering¶

Discard candidate $l_c$ if $\max_{j \in \text{prior labels}} \cos(\mathbf{e}_c, \mathbf{e}_j) \geq 0.70$

Select: $l^* = \arg\max_{l_c \in \text{filtered}} \cos(\mathbf{e}_c, \boldsymbol{\mu}_k)$ where $\boldsymbol{\mu}_k$ is the cluster centroid.

Fallback: if all candidates filtered, pick the best from the unfiltered set.

LLM benchmarks are effectively rank 2¶

Papailiopoulos (2026, post, code) assembled an 83-model x 49-benchmark matrix and found:

The first singular value captures 71% of variance. Five benchmark scores predict the other 44 to within ~5 points
Component 1 separates frontier from small models: GPQA-D, LiveCodeBench, MMLU-Pro load highest
Component 2 separates latest frontier from older frontier: SimpleQA, ARC-AGI-2, HLE load highest. This is almost a "recency of frontier" measure
A simple SVD + ridge regression method (BenchPress) matches Claude Sonnet at predicting missing scores (5.8% vs 6.1% median error), in under a second for $0
Claude actually gets worse with more data ("retrieval-augmented degradation"), while BenchPress improves monotonically

Evaluation is the new frontier for AI¶

If 49 benchmarks collapse to two dimensions, most of what we call "evaluation" measures the same two things repeatedly. The benchmarks that resist prediction (SimpleBench, ARC-AGI-1, Terminal-Bench) are the ones testing capabilities the rest of the matrix does not capture
The optimal 5-benchmark set that spans both dimensions is: HLE, AIME 2025, LiveCodeBench, SWE-bench Verified, SimpleQA
For domain-specific applications like CSS agent tool-calling, standard benchmarks are even less informative because they were not designed to test whether an agent selects the right analytical method for a given research question
Building robust, domain-specific evaluation frameworks that test genuinely new capabilities is more valuable than adding another general benchmark to an already saturated matrix

A9: The Cursor-Kimi attribution case¶

In March 2026, Cursor launched "Composer 2" claiming frontier-level coding intelligence at 1/10 the cost of Claude. Within 24 hours, a developer found the model ID in the API: kimi-k2p5-rl-0317.

Composer 2 was built on Kimi K2.5, an open-weight model from Beijing-based Moonshot AI, with reinforcement learning applied on top (TechCrunch)
Cursor's launch messaging implied a fully in-house model with no mention of Moonshot AI or Kimi anywhere
Kimi K2.5's license requires prominent attribution in commercial products, which Cursor did not provide
Cursor co-founder admitted: "It was a miss to not mention the Kimi base in our blog from the start"

The benchmark number was real. The origin story was not. This illustrates why transparency in evaluation matters as much as the scores themselves.

A10: AI models used in the pipeline¶

Pipeline stage	Model	Why this model
Query planning	GPT-4 Turbo	Structured output for typed facet decomposition
Post embeddings	Voyage AI 3.5-lite (1024-dim)	Cost/quality tradeoff at scale
Neural reranking	Voyage rerank-2.5	Absolute relevance scores, instruction-following
Entity extraction	Llama-4-scout via Groq	Faster inference, higher rate limits than direct API
Theme label generation	GPT-4o-mini	Best overall quality at lowest cost in our benchmarks
Entity narratives	GPT-5-nano	Lightweight narrative generation with rule-based fallback
Agent (task LLM)	GPT-5.2	Primary reasoning model for tool selection and code generation
Agent evaluation (judge)	Claude Sonnet 4.6	LLM-as-judge for code quality and response quality scoring
Prompt optimization	GEPA with Claude Sonnet 4.6	Reflective prompt evolution via natural-language reflection