Pawan Singh Kapkoti — Data & Analytics Engineer

// 02 ABOUT

About Me

Yorkshire, UK · MSc Data Analytics · OSS maintainer

Yorkshire-based data and AI engineer. Governance tooling for SQL and AI agents, built in the open and runnable on-prem.

// SHIPPED

sql-steward — a semantic-layer MCP server where an AI agent queries a real database and never writes SQL. On PyPI and in the official MCP registry.
Its on-prem governance layer: schema-scout maps, query-warden gates access, pii-veil redacts, agent-blackbox audits. Seven packages on PyPI in all.
ETL from ERP systems into dbt marts with SLO monitoring

// ARCHITECTURE

LangGraph agents over MCP servers; ChromaDB and pgvector RAG; Ollama models
Schema-aware validation gate with a self-correcting repair loop
Every answer records its model and prompt — reproducible, audited
Fully on-prem; no batch data leaves the building

// UPSTREAM

Merged: pandas, dlt, dbt-core, Apache Superset, sqlglot, drt, fpdf2, scanapi
In review: ONS, NHS England, Octopus Energy, GOV.UK, DuckDB
Read the source, find the gap, ship the fix

// 03 EDUCATION & CERTS

Academic Background

Data analytics formal study plus the certifications that mattered for the day job.

01

MSc Data Analytics — Aston University

Birmingham, UK · 2022 – 2024

Coursework across statistical learning, big-data processing (Spark), and database systems, with a research dissertation applying machine learning (Random Forest) to the effects of ethnicity on student behaviour and outcomes.

02

BCA — Bachelor of Computer Applications

Amity University · Noida, India · 2017 – 2020

Algorithms, OOP, RDBMS, software engineering. Final-year project: a client-server inventory tracker with a JSP front and MySQL back.

03

Microsoft Certified: Power BI Data Analyst Associate (PL-300)

Microsoft

Modelling, DAX, Power Query M, time intelligence, Import vs DirectQuery. Used daily for the manufacturing reporting layer.

04

Other certifications

Google · Microsoft Azure · AWS

Google Data Analytics Professional Certificate · Azure Data Engineering (DP-203 path) · AWS Cloud Practitioner.

// 04 PROJECTS

Projects

What I've shipped, and the adoption it has picked up.

2,000+

installs of sql-sop on PyPI, my rule-based SQL linter.

3

outside developers have merged 9 lint rules into it, six from @mvanhorn alone.

MCP

sql-steward accepted into the official Model Context Protocol registry.

sql-steward · governed execution

ask"…"

compilesemantic layer → SELECTok

guardpolicy checkok

execute3 rowsok

auditappend to ledgerchain ✓

audit ledger

Drive it. Ask a metric and the question is compiled to a bounded query, run, and logged. Request PII and the guard refuses it at compile time, before a row is ever read. See the full ungoverned-vs-governed side-by-side → · Try the policy live: type SQL, watch it get refused →

FLAGSHIP

_01

sql-steward

Code · Live demo · PyPI

The core product: a governed MCP server where an AI agent queries a real database and never writes SQL. Every query is compiled from a semantic layer you control — entities, joins, metrics, PII tags — so there is no run_sql tool to misuse. Blocked PII is refused before the query runs, declared data-quality checks gate the results, and init --from-db drafts a reviewable layer straight from a live schema, so it scales past a toy database. Same tools across SQL Server, Postgres, and SQLite. Listed in the official MCP registry.

semantic layer · no raw SQL · init --from-db · data-quality gates · 3 engines · MCP registry

PythonFastMCPsqlglotSQLAlchemy

// The governance layer

Four small, on-prem tools that plug into sql-steward — each does one job, each ships on PyPI, none needs the cloud.

schema-scout map

Reverse-engineers a 150+ table SQL Server database into an AI-ready catalog, inferring foreign keys and flagging PII. This is what init --from-db reads.

query-warden access

YAML role-based access control for SQL, enforced before the query runs. A role can be denied a table or a single column and never sees a row of it.

pii-veil redact

Column-level PII masking and refusal. Tagged columns are masked in the result or blocked outright at compile time, before a query ever touches the data.

agent-blackbox audit

Append-only, hash-chained ledger: each row stores the hash of the one before it, so any later edit breaks the chain and verify() points at it. One SQLite file, zero deps.

#01 3f9c… #02 a71e… #03 0b4d… chain broken ✗

Hover a block — every block after it breaks.

_02

UK Crime Pipeline

Code · Live

End-to-end data pipeline. Police UK API to PostgreSQL and BigQuery. 6 dbt marts (outcome analysis, YoY trends), 65 tests. Polars-based alternative ingestion. Declarative validation, SLO monitoring, pipeline maturity scorecard.

99,675 records · 10 cities · 6 dbt marts · 65 tests

LivePythonPostgreSQLBigQuerydbtPolars

_03

Manufacturing Compliance Dashboard

Code · Live

BRC/HACCP food safety compliance. MCP server exposes 5 compliance tools for LLM agents. NL query interface for auditors. SLO monitoring (temp 95%, traceability 90%), z-score anomaly detection, PDF audit reports. Four Golden Signals /metrics endpoint.

MCP server · NL query · /metrics · SLO monitoring

LiveStreamlitFastMCPFastAPIPySparkPlotly

My Stack

Tools I reach for daily, grouped by what they're for.

Languages

Python

SQL

T-SQL

Bash

TypeScript

Data Engineering

dbt

PostgreSQL

SQL Server

DuckDB

BigQuery

Polars

AI & Agents

LangGraph

FastMCP

Ollama

pgvector

ChromaDB

Web & API

FastAPI

Streamlit

Next.js

BI & Visualisation

Power BI

Looker Studio

Infra & DevOps

Docker

GitHub Actions

Linux

Manufacturing & Compliance Domain

ERP data integration BRC HACCP FEFO SSRS

// 06 OPEN SOURCE

Open Source Contributions

Maintainer or substantive contributor — not just docs typo fixes.

01

drt-hub/drt

Collaborator — destinations, --threads, --quiet, json_columns review

Collaborator on the multi-source data sync engine across three releases. v0.5: destination connectors including Mixpanel and the official connector tutorial. v0.6: --threads N parallel orchestration with a thread-safe StateManager and 11 parallel-dispatch tests. v0.7: --quiet for CI/cron use cases. Plus reviewer voice on the json_columns config PR where the early-validation suggestion shaped the final implementation.

02

sql-sop

Maintainer — review queue, releases, contributor funnel

Three outside contributors have merged nine lint rules into it. Six came from @mvanhorn (W019, W016, W015, W023, W021, W012), with W011 and P005 from @tmchow and the OVER() window check from @Prabhu-1409. I run the review queue, publish to PyPI via Trusted Publishing (2,000+ installs, mirrors excluded), maintain the security and governance policy, and keep a public ROADMAP plus a one-file scaffold so a new contributor can add a rule without touching the core.

03

pr-sop

Creator and maintainer — three releases in 24 hours

Shipped v0.1.0, v0.1.1 (third-party rev: pin false-positive fix), and v0.1.2 (CI-merge-commit tag lookup fix) to PyPI in 24 hours. Full governance, security, contributing, and code-of-conduct documents published.

04

scanapi/scanapi

Active contributor — docstrings PR and pipx install-path PR

Added docstrings to spec_evaluator.py (PR #868, merged), opened the pipx install-path documentation PR (#907), and triaging follow-on TOC issues for the wiki.

// 07 NOTABLE PRS

Signature Upstream Pull Requests

Merged fixes and features in the SQL, data and AI tools I use every day.

01

tobymao/sqlglot#7824

Fix(presto): preserve SHA256/SHA512 digest semantics · merged

Presto/Trino mapped native SHA256/512 to the string-hash expression, silently changing hash values in transpilation. Traced the root cause into the transpiler that underpins dbt and SQLMesh, mirrored the existing MD5 handling, and it merged the next morning.

02

apache/superset#39118

refactor(dashboard): download-permission rename · merged

Aligned the dashboard download permission with the explore path. Refined to a single-responsibility change after maintainer review, then merged into Apache Superset.

03

sqlfluff/sqlfluff#8088

TSQL: restore single-quote normalization · merged

A reported linter false-positive on quoted T-SQL aliases turned out to be a dialect-level bug: the patched single_quote lexer had dropped normalization for every single-quoted token. Fixed at the root with a regression test; all 545 dialect fixtures pass unregenerated.

04

drt-hub/drt#668 (+ #608, #678)

feat: sync.mask PII masking · Mixpanel destination · merged

Collaborator across releases. Shipped sync.mask (hash / redact / truncate PII before load) as a pure transform at the field-mapping seam, plus the Mixpanel destination connector and the official connector tutorial.

05

tobymao/sqlglot#7832

Fix(bigquery): SHA512 → SHA2Digest · merged

Follow-up to #7824, opened at the maintainer's invitation: mapped BigQuery's native SHA512 to the digest expression and gave Presto/Trino a type-aware encode, keeping byte-for-byte hash semantics across six dialects.

06

openfoodfacts/openfoodfacts-server#13892

taxonomy: add "groundnut" to peanut allergens · merged

A one-line fix to a safety-critical open dataset used worldwide: "groundnut", the standard UK and Indian term, was missing from the peanut allergen synonyms, so allergen cross-checks could silently miss it.

// Open, in review — six more opened this month

ONS · rdsa-utils#245 — human-readable pipeline run-ID generator, validator and parser
NHS England · #78 — cross-platform command detection, fixing setup on Windows
GOV.UK / CDDO · #98 — SharePoint list pagination bug silently truncating past 200 rows
DuckDB · duckdb-web#7002 — documented the full set of MySQL-extension settings
SQLMesh · #5888 — hex-string surrogate keys for SHA256/512 on Presto/Trino
Apache Superset · #41799 — GranularExportControls download-permission follow-up

// 08 ACTIVITY

Activity

What the contribution graph and language mix look like right now.

// 09 CONTACT

Get in Touch

Open to data engineering, data ops, manufacturing analytics, and on-prem AI roles. Especially interested in roles where the data domain genuinely matters and the system has to keep running on a Sunday night.

// EMAIL

pawankapkoti3889@gmail.com

Best for first contact. Replies same day, weekdays.

// LINKEDIN

linkedin.com/in/pawan-singh-kapkoti

Connect for roles, OSS collaboration, or a manufacturing-data chat.

// COMMUNITY

discord.gg/gBr77yYPkD

The Discord for the governed stack. Questions, show and tell, release feed.

How I work

Three honest stages — same flow whether the work is a pipeline, a dashboard, or shipping a tool to PyPI.

01

Discovery

Read the source. Read the data. Talk to the people who actually run the line. Confirm the problem is the problem before writing code.

02

Architecture

Sketch the smallest thing that delivers value. Pick boring, well-documented tools. Write the tests first when the contract matters.

03

Delivery

Ship in CI behind a green build. Document the trade-offs. Set up a way for the next person (sometimes me) to understand it without paging me.

PAWAN