// HELLO WORLD
Hi, I'm

PAWAN

~/import pawan_singh_kapkoti
Data Engineer|

I build robust data pipelines, developer tools, and local AI for food manufacturing. Working in Yorkshire, shipping code on the side. I believe compliance shouldn't mean spreadsheets and AI shouldn't require the cloud.

What I do
// 02 ABOUT

About Me

Yorkshire, UK · MSc Data Analytics · OSS maintainer

Pawan Singh Kapkoti

// BUILD

I build data pipelines for food manufacturing — incremental ETL from ERP systems, dbt marts with SLO monitoring, and Power BI consumption layers. Four Python tools live on PyPI (sql-sop, sql-sop-mcp, pr-sop, morning-brief), each with a real problem behind it that I hit on the floor before writing a line of code.

// DEPLOY

Local AI for factories. LangGraph agents that talk to MCP servers, ChromaDB and pgvector RAG, Ollama-served models — all on-prem so no batch data leaves the building. A 5-stage SQL validation pipeline sits between the model and the database.

// CONTRIBUTE

I learn tools by reading their source: reverse-engineer the architecture, find the gap, ship the fix. Merged PRs into pandas, ChromaDB, scanapi, dlt, dbt-core, ollama-python, fpdf2, pyOpenSci, and Apache Superset.

// 03 EDUCATION & CERTS

Academic Background

Data analytics formal study plus the certifications that mattered for the day job.

01
Aston University

MSc Data Analytics — Aston University

Birmingham, UK · 2022 – 2024

Coursework across statistical learning, big-data processing (Spark), database systems, and a research dissertation on data quality in healthcare reporting pipelines.

02
Amity University

BCA — Bachelor of Computer Applications

Amity University · Noida, India · 2017 – 2020

Algorithms, OOP, RDBMS, software engineering. Final-year project: a client-server inventory tracker with a JSP front and MySQL back.

03
Microsoft

Microsoft Certified: Power BI Data Analyst Associate (PL-300)

Microsoft

Modelling, DAX, Power Query M, time intelligence, Import vs DirectQuery. Used daily for the manufacturing reporting layer.

04
Other certs

Other certifications

Google · Microsoft Azure · AWS

Google Data Analytics Professional Certificate · Azure Data Engineering (DP-203 path) · AWS Cloud Practitioner.

// 04 PROJECTS

Projects

Things I built that are running in production or deployed publicly.

_01

OpsMind

On-premises AI for food manufacturing. LangGraph 6-node agent converts English to SQL in 5 seconds. MCP server architecture decouples database and doc search as tools. 5-stage SQL validation, runtime-loaded domain docs, structured JSONL audit logging. Runs Gemma 3 12B via Ollama — no data leaves the factory.

7 domains · MCP servers · 5-stage SQL validation · golden-set eval
PythonLangGraphFastMCPOllamaDockerChromaDB
_02

Manufacturing Compliance Dashboard

BRC/HACCP food safety compliance. MCP server exposes 5 compliance tools for LLM agents. NL query interface for auditors. SLO monitoring (temp 95%, traceability 90%), z-score anomaly detection, PDF audit reports. Four Golden Signals /metrics endpoint.

MCP server · NL query · /metrics · SLO monitoring
LiveStreamlitFastMCPFastAPIPySparkPlotly
_03

UK Crime Pipeline

End-to-end data pipeline. Police UK API to PostgreSQL and BigQuery. 6 dbt marts (outcome analysis, YoY trends), 65 tests. Polars-based alternative ingestion. Declarative validation, SLO monitoring, pipeline maturity scorecard.

99,675 records · 10 cities · 6 dbt marts · 65 tests
LivePythonPostgreSQLBigQuerydbtPolars
_04

sql-sop

Fast rule-based SQL linter. 39 rules (5 T-SQL specific), 152 tests, libCST-based injection scanner, inline disable directives, .sql-guard.yml config, SARIF output for GitHub Code Scanning. v0.7 milestone in progress (Performance Rules Pack); three external authors merged. sql-sop-mcp ships the same linter as an MCP server for Claude Desktop / Cursor / ChatGPT. 500+ monthly downloads.

39 rules · 152 tests · libCST scanner · SARIF · v0.6.1 · v0.7 in progress
PythonlibCSTTyperRichpre-commit
_05

pr-sop

Opinionated PR governance checks. CHANGELOG drift, version consistency between pyproject.toml and __init__.py, stale rev: pins in READMEs. Three config-driven checks turned on via a YAML block. CLI, pre-commit hook, or GitHub Action. First external consumer: sql-sop itself.

3 checks · 29 tests · pydantic v2 · under 1s/PR · v0.1.2
PythonTyperPydanticGitHub Actions
_06

morning-brief

Rule-based daily Gmail triage. Zero LLM, read-only OAuth. Classifies the last day of mail into HIGH / MEDIUM / LOW / SPAM via YAML rules, writes a markdown digest, fires a desktop toast. v0.3.0 adds sub-day windows, thread collapse, and preview / why commands.

Read-only OAuth · sub-day windows · thread collapse · v0.3.0
PythonTyperGmail API
_07

SQL Ops Reviewer

GitHub Action that auto-reviews .sql files in pull requests using local AI. Catches injection risks, performance anti-patterns, style violations. Posts structured review comments with fix suggestions. One YAML file to set up, runs on the CI runner, zero API keys.

10 categories · phi3:mini default · zero API keys
PythonOllamaGitHub Actions
_08

MediAsk

Health Q&A platform for factory workers. NHS-verified guidance, Gemini AI responses, voice input, 18 languages. Flask + PostgreSQL, Dockerised.

NHS-verified · 18 languages · voice input
LiveFlaskPostgreSQLGemini
// 05 STACK

My Stack

Tools I reach for daily, grouped by what they're for.

Languages

Python SQL T-SQL Bash TypeScript

Data Engineering

dbt Prefect PostgreSQL SQL Server BigQuery Polars pandas PySpark

AI / RAG

LangGraph FastMCP Ollama ChromaDB pgvector Gemini

Web & API

FastAPI Flask Streamlit Next.js Pydantic

BI & Visualisation

Power BI Looker Studio Jupyter Plotly

Infra & DevOps

Docker OpenTofu GitHub Actions Linux Google Cloud

Manufacturing & Compliance Domain

SI Integreater OCM scanning BRC HACCP FEFO SSRS
// 06 OPEN SOURCE

Open Source Contributions

Maintainer or substantive contributor — not just docs typo fixes.

01

drt-hub/drt

Collaborator — destinations, --threads, --quiet, json_columns review

Collaborator on the multi-source data sync engine across three releases. v0.5: 5 destination connectors and the official connector tutorial. v0.6: --threads N parallel orchestration with thread-safe StateManager, 11 parallel-dispatch tests. v0.7: --quiet for CI/cron use cases (approved). Plus reviewer voice on the json_columns config PR where the early-validation suggestion shaped the final implementation.

02

sql-sop

Maintainer — review queue, releases, contributor funnel

Review and merge community PRs across the rule catalogue. Recent merges: W013 window-without-partition (Prabhu-1409), W019 count-distinct-unbounded (mvanhorn), W011 union-without-all and P005 sqlalchemy-text-fstring (tmchow). Publish to PyPI via Trusted Publishing, maintain governance + security policy, run a v0.7 milestone with public ROADMAP and a scaffold script for new contributors. Two more rule PRs in active review.

03

pr-sop

Creator and maintainer — three releases in 24 hours

Shipped v0.1.0, v0.1.1 (third-party rev: pin false-positive fix), and v0.1.2 (CI-merge-commit tag lookup fix) to PyPI in 24 hours. Full governance, security, contributing, and code-of-conduct documents published.

04

scanapi/scanapi

Active contributor — docstrings PR and pipx install-path PR

Added docstrings to spec_evaluator.py (PR #868, merged), opened the pipx install-path documentation PR (#907), and triaging follow-on TOC issues for the wiki.

// 07 NOTABLE PRS

Signature Upstream Pull Requests

Merged contributions into projects I use every day.

01

scanapi/scanapi#868

docs: add missing docstrings to spec_evaluator.py · First Contribution

Comprehensive docstrings across the SpecEvaluator class, instance methods, and module-level singledispatch functions. Closed issue #442 (open since 2021).

02

pyOpenSci/python-package-guide#622

docs: CITATION.cff and software-citation guidance

Added Turing Way links and clearer guidance on writing CITATION.cff files for scientific Python packages.

03

dlt-hub/dlt#3830

docs: source count correction in intro

Updated the source count from 5,000 to 8,000+ in the intro docs to match current reality.

04

py-pdf/fpdf2#1805

i18n: Punjabi (pa) tutorial translation

Translated the fpdf2 tutorial into Punjabi to widen the reach of the documentation in South-Asian developer communities.

05

pandas-dev/pandas

docs: clarified str.cat() return type for Index

Documentation clarification on the return-type behaviour of str.cat() when called on an Index versus a Series.

06

chroma-core/chroma

docs: HNSW tuning guide and version compatibility

Added a 220-line HNSW parameter tuning guide and a runtime version-compatibility check between client and server.

// 08 ACTIVITY

Activity

What the contribution graph and language mix look like right now.

GitHub stats Top languages
Contribution snake
// 09 CONTACT

Get in Touch

Open to data engineering, data ops, manufacturing analytics, and on-prem AI roles. Especially interested in roles where the data domain genuinely matters and the system has to keep running on a Sunday night.

// EMAIL

pawankapkoti3889@gmail.com
Best for first contact. Replies same day, weekdays.

// LINKEDIN

linkedin.com/in/pawan-singh-kapkoti
Connect for roles, OSS collaboration, or a manufacturing-data chat.

How I work

Three honest stages — same flow whether the work is a pipeline, a dashboard, or shipping a tool to PyPI.

01

Discovery

Read the source. Read the data. Talk to the people who actually run the line. Confirm the problem is the problem before writing code.

02

Architecture

Sketch the smallest thing that delivers value. Pick boring, well-documented tools. Write the tests first when the contract matters.

03

Delivery

Ship in CI behind a green build. Document the trade-offs. Set up a way for the next person (sometimes me) to understand it without paging me.