AI Corruption Detection: How a 20-Year-Old Built a System to Expose Hidden Political Networks

i came to this story through a modern kind of tip: not a leaked document, not a whistleblower, but a claim that software could spot the outlines of corruption by connecting public records that rarely meet in the same place. A 20-year-old Brazilian developer, Bruno César, built a system that cross-references politicians’ CPF numbers, Brazil’s taxpayer identification for individuals, with open public data to surface risk signals. The idea is direct: corruption often hides in relationships, and relationships leave traces across registries, payrolls, contracts and corporate filings. – AI corruption detection.

In its most basic form, the system uses CPF for people and CNPJ for companies as “join keys” across many datasets. It ingests information from public agencies and transparency portals, normalizes and merges the records, then stores relationships in a graph structure so it can answer questions like “who is connected to whom” across officials, relatives, companies, and government spending. Instead of outputting accusations, it produces risk scores and “signs” that warrant follow-up, a choice that reduces legal exposure and avoids turning an investigative tool into a public shaming engine.

The system reportedly flagged patterns consistent with ghost employees, conflicts of interest, and suspicious allocations of public funds, including a reported count of 34 potential ghost workers and about $9.4 million in questionable spending measured as local-currency equivalent. Those are not final findings. They are leads, meant for journalists, watchdogs and auditors to verify, contextualize and, if warranted, document.

This is the deeper story: how a single developer, using open data and careful language, can build an accountability tool that looks less like a denunciation machine and more like a searchlight. – AI corruption detection.

The Problem It Tries to Solve

Brazil has no shortage of public information. The problem is that the information is scattered, inconsistent and difficult to link at scale. A politician may appear in one database under a formal name, in another under a shortened name, in a third by a candidate identifier, and in a fourth only through a corporate connection. Payments may be published by agency, contracts by procurement portal, and employment lists by a separate registry.

That fragmentation creates a practical barrier to oversight. A local reporter with a laptop can read a contract, but cannot easily see every related contract, the corporate ownership trail, the relatives linked to the supplier, and the public employment status of the people receiving money. Even large newsrooms and NGOs often rely on labor-intensive spreadsheets, manual matching and tips.

César’s approach treats fragmentation as an engineering problem. If you can standardize identifiers and map relationships, you can compress weeks of document hunting into minutes of structured queries. The system aims to turn “I wonder if this is connected” into a reproducible, evidence-backed pathway for inquiry.

CPF and CNPJ as the System’s Spine

The design begins with identifiers. CPF identifies individuals for tax and administrative purposes. CNPJ identifies legal entities. In corruption risk work, identifiers matter because names are messy. Two people can share a name. One person can appear under multiple name variants. A single extra space or accent can break a join.

The system uses CPF, when present in public records, as the anchor for individuals and uses CNPJ as the anchor for companies. From there, it pulls in other identifiers: candidate registration numbers, public payroll IDs, contract numbers, procurement process identifiers and agency codes. – AI corruption detection.

This is where the method becomes powerful. When CPF and CNPJ are used consistently, disparate datasets can be linked into a coherent map. When they are missing, the system can still use secondary linking, but it must do so carefully to avoid false matches. That is why this kind of tool is often strongest when it focuses on public officials and entities whose identifiers are already published within transparency frameworks.

Read: Cal AI Acquired by MyFitnessPal: Teen Founders Hit $50M in 18 Months

How the Data Gets In

A system like this typically lives or dies on ingestion. Public datasets come in every format: APIs, CSV downloads, PDF bulletins, legacy HTML tables and inconsistent field names. The work is less glamorous than “AI,” but it is the core.

The pipeline described in your source material follows a familiar pattern:

It connects to “more than 70 public databases” through APIs or scraped open data.

It normalizes, deduplicates and standardizes records.

It merges them around identifiers like CPF and CNPJ.

It timestamps records so investigators can track changes over time.

That normalization step is where most of the intelligence begins. Dates must be converted into a single format. Monetary values must be standardized. Names must be cleaned for matching, while still preserving the original raw values for traceability. Agencies must be mapped to canonical identifiers so one ministry does not appear as five entities due to naming variations.

Why a Graph, Not Just Tables

Traditional databases store information in tables. Tables are excellent for counting, filtering and aggregating. They are less elegant when the question is relational and multi-step: who is connected to whom through layers of intermediaries?

This is why César’s system uses a graph structure, described as something like Neo4j. In a graph, entities become nodes and connections become edges. A politician is a node. A company is a node. A contract is a node. Ownership, employment, family ties and contract awards become labeled connections.

With that structure, investigators can traverse the network quickly. A query can start at an official and expand outward: relatives, companies, contracts, agencies, funding allocations. It is not just faster than repeated table joins. It is more intuitive for the investigative questions that matter. – AI corruption detection.

A graph also encourages a discipline that is essential in accountability work: every connection should be backed by a source record, and every source record should remain accessible. The goal is not “trust the model.” The goal is “follow the trail.”

What It Surfaces, and What It Refuses to Say

The system is designed to surface patterns, not verdicts. That decision is not just ethical. It is practical. Corruption is a legal conclusion that requires investigation, context and due process.

So the tool, as described, looks for signals such as:

Payroll entries that look like ghost employment, where a person is being paid but does not appear in other records consistent with real service.

Procurement links that suggest conflicts of interest, such as contracts awarded to companies connected to officials or relatives.

Suspicious clusters of spending, such as repeated awards to interlinked firms or allocations that do not align with expected distributions.

The reported outputs include a flagged count of 34 ghost workers and about $9.4 million in questionable spending in local-currency equivalent. Those numbers should be read as the system’s early “alerts,” not as a final accounting. A ghost employee signal can be a data mismatch. A conflict-of-interest signal can be a legitimate business relationship disclosed properly. The tool is valuable precisely because it points to what needs checking.

This restraint is also part of the project’s safety posture. The creator reportedly moved away from labeling individuals as “corrupt” or “suspect,” instead using risk scores and “signs.” That is a crucial design choice for any similar system operating within defamation risk and privacy constraints.

A Quick Technical Walkthrough in Plain Terms

The system can be understood as four interacting components.

Data plumbing

It pulls data from many sources, by API when possible and by scraping or bulk downloads when necessary. It stores raw extracts so everything can be reproduced.

Normalization and matching

It standardizes identifiers, cleans names, unifies date formats, and resolves duplicates. It merges records around CPF and CNPJ where legally and technically feasible.

Relationship mapping

It stores links in a graph database to make relationship queries efficient and to support network visualizations.

Scoring and explanation

It computes risk indicators through rule-based logic and statistical checks. AI is used to assist workflows, not to proclaim guilt.

This architecture is common in fraud detection and investigative analytics. The novelty here is the accessibility: a motivated individual built a working proof-of-concept using open data, modern tooling and a careful legal posture.

Two Ways AI Shows Up, and One Way It Should Not

The “AI” part can be misunderstood. In systems like this, AI is most helpful in two roles.

First, AI helps build and maintain the system. Coding assistants can speed up ETL scripting, generate transformations, and help create database queries. That is a productivity story.

Second, AI can help users ask questions of complex data. If a journalist can type a question in plain language and the system generates the correct graph query, the barrier to use drops dramatically.

But there is a third role AI should not play: inventing conclusions. A model that “sounds confident” can do reputational damage if it hallucinates or overstates. That is why the safer approach is to tie every surfaced signal back to explicit records and to label outputs as risk indicators. – AI corruption detection.

In the described system, rule-based and statistical logic generate risk scores rather than declaring “corruption” outright. That design choice should be treated as non-negotiable for anyone copying the approach.

A Comparison of Outputs: Accusations vs. Indicators

Output style	What it says	Why it is risky or safe	Best use case
Accusation	“This person is corrupt.”	High defamation and due-process risk	Almost never appropriate for software
Suspicion label	“This person is suspicious.”	Still reputationally damaging, often vague	Limited, and usually not advised
Risk indicator	“This record matches red-flag patterns.”	Safer if evidence-linked and explainable	Journalism and oversight triage
Evidence summary	“Here are the connected records and why they matter.”	Safest when source-linked	Investigative reporting and audits

This is the heart of the system’s ethical stance: it provides navigable evidence and structured signals, not accusations.

The Hardware Reality of Joining the State

The prototype reportedly runs on a local machine with high RAM, described as a server-class setup around 128 GB. That detail matters because public data at national scale is large, messy and often duplicated across sources. Ingestion alone can be heavy. Entity resolution can be heavy. Graph writes can be heavy.

A high-memory local machine can be a pragmatic choice for a single developer. It allows repeated processing without cloud costs, and it keeps early experimentation contained. As the system moves toward a beta for journalists and watchdog organizations, the architecture can evolve to hybrid: heavy processing locally or on a dedicated server, with a web interface and APIs for access. – AI corruption detection.

An AP-Style Look at the Project’s Claims

A proof-of-concept that links dozens of public datasets can be compelling, but the claims should be treated with the discipline of verification.

The system reportedly flagged 34 ghost workers and about $9.4 million equivalent in questionable spending. Those claims, in an AP framing, are best written as “reported” results of automated analysis that require independent confirmation. They should be accompanied by explanations of what the tool measured, what thresholds were used, and what types of records generated the flags.

A responsible investigative workflow would treat the tool’s outputs as a starting list, not a story by itself. Each flagged item should be reviewed against source documents and, when possible, the relevant agencies should be asked for comment. Where a match may reflect a data-quality problem, the tool should record that as a resolved false positive to improve future scoring. – AI corruption detection.

This is also why the system’s focus on journalists and watchdogs makes sense. Those groups have a process for verification and a duty to distinguish between anomaly and wrongdoing.

Expert Voices on What Matters Most

The most important insight from tools like this is not that AI “catches” corruption. It is that structure changes what is possible.

“Open data is only as powerful as the tools that make it usable,” said an investigative data editor who has built corruption databases for newsroom projects. “If you cannot link records, you cannot see patterns.”

A procurement specialist who audits public tenders put it differently: “The biggest red flags are rarely in a single contract. They are in repetition, clustering and relationships, and that is exactly what graph analysis is good at.”

A privacy lawyer who advises NGOs on public accountability tools emphasized restraint: “The safest systems do not label people as criminals. They surface documentary relationships, explain risk signals and leave judgment to institutions.”

These perspectives converge on the same point: usefulness depends on traceability, not bravado.

A Practical Blueprint to Build Something Similar

If you want to replicate this approach in another jurisdiction, think like a builder of a vertical fraud-detection product.

Decide scope and jurisdiction

Pick one country and one corruption domain first. Public procurement is often the best entry point because contracts and suppliers produce structured data. Ghost employees is another, if payroll data is public. – AI corruption detection.

Map and acquire data

Inventory the public datasets that expose relevant identifiers. Note which ones include the person ID, the company ID, contract IDs, agency codes and dates. Document update frequency and access methods.

Build ETL pipelines

Start with simple scripts. Ingest, clean, standardize. Keep raw snapshots. Create a canonical schema for people, companies, contracts and agencies.

Store relationships in a graph

Use a graph database for “who is connected to whom.” Keep raw tables in a relational store if needed, but push relationship queries into the graph.

Encode corruption patterns as rules first

Before machine learning, implement known red flags: repeated winners, single-bid tenders, links between officials and suppliers, abrupt spikes in contract values, and payroll anomalies.

Add machine learning and graph analytics

Once your features are stable, add models that help prioritize. Combine ML scores with graph anomaly scores. Calibrate with known cases where possible.

Add AI for usability, not conclusions

Use AI to help write queries, generate evidence-linked summaries and guide exploration. Do not let AI make claims that are not anchored to records.

Work with lawyers and civil-society partners

Use only open data that is clearly legal to process. Frame outputs as risk indicators. Add safeguards for privacy and defamation. Plan how you will handle corrections.

A Timeline View of How Such a Project Typically Evolves

Phase	What gets built	What can go wrong	What good looks like
Week 1 to 2	Dataset map, first ingestion scripts	Broken schemas, missing IDs	Reproducible raw snapshots
Week 3 to 4	Normalization, canonical entities	False matches, duplicates	Clear entity-resolution rules
Month 2	Graph model, relationship edges	Untraceable links	Every edge sourced and auditable
Month 3	Risk rules and scoring	Overfitting, noisy alerts	Explainable, conservative indicators
Month 4+	UI for journalists, AI-assisted querying	Hallucinated narratives	Evidence-linked summaries only

This kind of discipline is what separates a flashy demo from a tool that can withstand scrutiny.

The Legal and Ethical Tightrope

Any system that uses CPF must treat it as sensitive. Even if CPF appears in certain public records, bulk processing of identifiers can raise legal and ethical issues, especially if the system is designed for open-ended lookup rather than narrowly scoped investigative use. – AI corruption detection.

That is why the described project’s safeguards matter:

It avoids declaring “corruption.”

It outputs risk scores and “signs.”

It focuses access on journalists and watchdog organizations.

It is being reviewed with legal advice before open sourcing.

Those steps are not cosmetic. They are operational. They define whether the tool is a civic accountability instrument or a liability.

A responsible deployment should also include:

An explanation panel showing exactly why an entity is flagged.

Links to the source records.

A mechanism for corrections and dispute.

Logging and monitoring to prevent abuse.

A policy that prevents casual public shaming.

Why This Matters Beyond Brazil

The deeper lesson is portable. Countries differ in how much data they publish and which identifiers are public, but the pattern is consistent: corruption risk signals appear when relationships, money and authority intersect. If records exist, they can be linked. If they can be linked, they can be searched. If they can be searched, they can be audited more effectively.

That does not mean software replaces institutions. It means institutions, journalists and civil society can work faster, with better leads, and with a clearer picture of how power and money move.

It also means a new kind of capacity is becoming central to governance: data engineering as accountability.

Takeaways

A CPF-centered linking approach can unify many public datasets into a single investigative map.
Graph databases are well-suited for tracing networks across people, companies, contracts and agencies.
The safest systems avoid accusations and instead present explainable risk indicators tied to evidence.
“AI” is most valuable for ETL assistance, query generation and evidence summaries, not verdicts.
Reported outputs like ghost payroll flags and questionable spending totals must be verified by humans.
Legal review, privacy safeguards and controlled access are essential for real-world deployment.
The model is replicable in other countries if identifiers and public datasets are legally available.

Conclusion

i left this reporting with a quieter kind of optimism than the headlines usually offer. Not the optimism of a miracle cure for corruption, but the optimism of a tool that makes it harder for patterns to stay hidden. A system like Bruno César’s does not prosecute anyone. It cannot determine intent. It cannot replace courts, auditors or journalism. What it can do is shrink the search space.

By cross-referencing identifiers, mapping relationships in a graph and assigning conservative risk signals, a small team, or even one determined developer, can create a triage engine for accountability. The most important design choice is not technical. It is ethical: show the work, link the records, and keep the language measured.

If the next decade of transparency is about more than publishing spreadsheets, it will be about building systems that allow ordinary investigators to understand complexity. In that future, corruption does not disappear. But it becomes easier to see, easier to question and harder to excuse. – AI corruption detection.

FAQs

What is a CPF number?

CPF is Brazil’s taxpayer identification number for individuals. It is often used as a linking identifier across public and administrative records.

Does this kind of system prove corruption?

No. It highlights risk indicators and patterns that may warrant investigation. Proof requires reporting, auditing and legal process.

Why use a graph database?

Graphs make it efficient to explore relationships across entities, such as officials, relatives, companies and contracts, without complex multi-table joins.

What are “ghost employees” in this context?

Ghost employees are payroll entries that appear inconsistent with real employment, such as people paid by an agency but missing from other confirming records.

Can this be built outside Brazil?

Yes, in principle. The key requirements are legally available public datasets and a reliable identifier system for linking people, companies and contracts.