PAWAN
I build robust data pipelines, developer tools, and local AI for food manufacturing. Working in Yorkshire, shipping code on the side. I believe compliance shouldn't mean spreadsheets and AI shouldn't require the cloud.
About Me
Yorkshire, UK · MSc Data Analytics · OSS maintainer
// BUILD
I build data pipelines for food manufacturing — incremental ETL from ERP systems, dbt marts with SLO monitoring, and Power BI consumption layers. Four Python tools live on PyPI (sql-sop, sql-sop-mcp, pr-sop, morning-brief), each with a real problem behind it that I hit on the floor before writing a line of code.
// DEPLOY
Local AI for factories. LangGraph agents that talk to MCP servers, ChromaDB and pgvector RAG, Ollama-served models — all on-prem so no batch data leaves the building. A 5-stage SQL validation pipeline sits between the model and the database.
// CONTRIBUTE
I learn tools by reading their source: reverse-engineer the architecture, find the gap, ship the fix. Merged PRs into pandas, ChromaDB, scanapi, dlt, dbt-core, ollama-python, fpdf2, pyOpenSci, and Apache Superset.
Academic Background
Data analytics formal study plus the certifications that mattered for the day job.
MSc Data Analytics — Aston University
Coursework across statistical learning, big-data processing (Spark), database systems, and a research dissertation on data quality in healthcare reporting pipelines.
BCA — Bachelor of Computer Applications
Algorithms, OOP, RDBMS, software engineering. Final-year project: a client-server inventory tracker with a JSP front and MySQL back.
Microsoft Certified: Power BI Data Analyst Associate (PL-300)
Modelling, DAX, Power Query M, time intelligence, Import vs DirectQuery. Used daily for the manufacturing reporting layer.
Other certifications
Google Data Analytics Professional Certificate · Azure Data Engineering (DP-203 path) · AWS Cloud Practitioner.
Projects
Things I built that are running in production or deployed publicly.
OpsMind
On-premises AI for food manufacturing. LangGraph 6-node agent converts English to SQL in 5 seconds. MCP server architecture decouples database and doc search as tools. 5-stage SQL validation, runtime-loaded domain docs, structured JSONL audit logging. Runs Gemma 3 12B via Ollama — no data leaves the factory.
Manufacturing Compliance Dashboard
BRC/HACCP food safety compliance. MCP server exposes 5 compliance tools for LLM agents. NL query interface for auditors. SLO monitoring (temp 95%, traceability 90%), z-score anomaly detection, PDF audit reports. Four Golden Signals /metrics endpoint.
UK Crime Pipeline
End-to-end data pipeline. Police UK API to PostgreSQL and BigQuery. 6 dbt marts (outcome analysis, YoY trends), 65 tests. Polars-based alternative ingestion. Declarative validation, SLO monitoring, pipeline maturity scorecard.
sql-sop
Fast rule-based SQL linter. 39 rules (5 T-SQL specific), 152 tests, libCST-based injection scanner, inline disable directives, .sql-guard.yml config, SARIF output for GitHub Code Scanning. v0.7 milestone in progress (Performance Rules Pack); three external authors merged. sql-sop-mcp ships the same linter as an MCP server for Claude Desktop / Cursor / ChatGPT. 500+ monthly downloads.
pr-sop
Opinionated PR governance checks. CHANGELOG drift, version consistency between pyproject.toml and __init__.py, stale rev: pins in READMEs. Three config-driven checks turned on via a YAML block. CLI, pre-commit hook, or GitHub Action. First external consumer: sql-sop itself.
morning-brief
Rule-based daily Gmail triage. Zero LLM, read-only OAuth. Classifies the last day of mail into HIGH / MEDIUM / LOW / SPAM via YAML rules, writes a markdown digest, fires a desktop toast. v0.3.0 adds sub-day windows, thread collapse, and preview / why commands.
SQL Ops Reviewer
GitHub Action that auto-reviews .sql files in pull requests using local AI. Catches injection risks, performance anti-patterns, style violations. Posts structured review comments with fix suggestions. One YAML file to set up, runs on the CI runner, zero API keys.
My Stack
Tools I reach for daily, grouped by what they're for.
Languages
Data Engineering
AI / RAG
Web & API
BI & Visualisation
Infra & DevOps
Manufacturing & Compliance Domain
Open Source Contributions
Maintainer or substantive contributor — not just docs typo fixes.
drt-hub/drt
Collaborator on the multi-source data sync engine across three releases. v0.5: 5 destination connectors and the official connector tutorial. v0.6: --threads N parallel orchestration with thread-safe StateManager, 11 parallel-dispatch tests. v0.7: --quiet for CI/cron use cases (approved). Plus reviewer voice on the json_columns config PR where the early-validation suggestion shaped the final implementation.
sql-sop
Review and merge community PRs across the rule catalogue. Recent merges: W013 window-without-partition (Prabhu-1409), W019 count-distinct-unbounded (mvanhorn), W011 union-without-all and P005 sqlalchemy-text-fstring (tmchow). Publish to PyPI via Trusted Publishing, maintain governance + security policy, run a v0.7 milestone with public ROADMAP and a scaffold script for new contributors. Two more rule PRs in active review.
pr-sop
Shipped v0.1.0, v0.1.1 (third-party rev: pin false-positive fix), and v0.1.2 (CI-merge-commit tag lookup fix) to PyPI in 24 hours. Full governance, security, contributing, and code-of-conduct documents published.
scanapi/scanapi
Added docstrings to spec_evaluator.py (PR #868, merged), opened the pipx install-path documentation PR (#907), and triaging follow-on TOC issues for the wiki.
Signature Upstream Pull Requests
Merged contributions into projects I use every day.
scanapi/scanapi#868
Comprehensive docstrings across the SpecEvaluator class, instance methods, and module-level singledispatch functions. Closed issue #442 (open since 2021).
pyOpenSci/python-package-guide#622
Added Turing Way links and clearer guidance on writing CITATION.cff files for scientific Python packages.
dlt-hub/dlt#3830
Updated the source count from 5,000 to 8,000+ in the intro docs to match current reality.
py-pdf/fpdf2#1805
Translated the fpdf2 tutorial into Punjabi to widen the reach of the documentation in South-Asian developer communities.
pandas-dev/pandas
Documentation clarification on the return-type behaviour of str.cat() when called on an Index versus a Series.
chroma-core/chroma
Added a 220-line HNSW parameter tuning guide and a runtime version-compatibility check between client and server.
Activity
What the contribution graph and language mix look like right now.
Get in Touch
Open to data engineering, data ops, manufacturing analytics, and on-prem AI roles. Especially interested in roles where the data domain genuinely matters and the system has to keep running on a Sunday night.
How I work
Three honest stages — same flow whether the work is a pipeline, a dashboard, or shipping a tool to PyPI.
Discovery
Read the source. Read the data. Talk to the people who actually run the line. Confirm the problem is the problem before writing code.
Architecture
Sketch the smallest thing that delivers value. Pick boring, well-documented tools. Write the tests first when the contract matters.
Delivery
Ship in CI behind a green build. Document the trade-offs. Set up a way for the next person (sometimes me) to understand it without paging me.