clarke
principal engineer / head of sre
About
Engineering leader based in London with 15+ years in the industry. Currently heading up Site Reliability Engineering at Man Group, one of the world's largest publicly traded hedge funds. Started out in sysadmin, grew through DevOps, and now focused on building resilient infrastructure at scale.
Outside of work I build things that scratch an itch — sports bots, game addons, automation tools. Mostly in Go, Python, and TypeScript. Gamer and football obsessive.
Career
-
Principal Engineer, Head of SREMan Group2025 – present
- Created SRE from scratch — a new function addressing firm-wide stability concerns. Built the team and the entire platform from zero in 12 months
- Built 15 MCP servers as the connective layer across the firm's infrastructure — Active Directory, Elasticsearch, Fleet, Metrics, Logs, Opsgenie, ServiceNow, Slack, TeamCity, and more. These became the foundation for everything below, giving AI tools structured access to the full operational estate
- Built Primer — pre-release review system that uses MCP servers to concurrently assemble linked Jira tickets, Bitbucket PRs, Terraform plan analysis, Docker image diffs, Octopus Deploy promotion validation, and manifest parsing for any given change request. Deployed as both MCP server and standalone dashboard with embedded AI chat. The goal: catch issues before they become incidents
- Created Recap — AI post-incident reporting built on top of the MCP layer. Scrapes Slack PRB channels, generates AI summaries, and stores across Elasticsearch. A Slack bot auto-detects PRB numbers and posts condensed summaries. Quarterly and YTD analytics feed directly into the SRE dashboard for trend analysis
- Built Raven — AI-powered Slack bot that ties it all together. Routes queries to the right MCP servers dynamically, maintains thread awareness and Redis-backed memory. Includes a DR runbook UI with live session state, phase-based exercises, and auto-generated summaries. The interface for engineers to query infrastructure in natural language
- Led the firm's first FinOps initiative, costing an estate of ~5,000 servers. Built workload analytics and AI usage dashboards
- Built the flagship SRE dashboard covering ServiceNow (scheduled changes, AI-powered risk scoring, incidents, problems, PTASKs), Opsgenie (alert fatigue analysis, P1 reduction tracking, engineer load distribution), and Recap analytics
- Storage capacity forecasting using Facebook Prophet across 7 storage platforms with 30+ models running three forecast horizons
- Owns the production Elasticsearch cluster (~170TB) and ILM lifecycle policies
- Ported operational tooling from PowerShell to Python — 50+ scripts across 17 technology categories
- Positioning Claude Code as a debugging sidekick for engineers across the firm
-
Principal Engineer, Head of DevOpsMan Group2020 – 2025
- Led or contributed to response for every major industry vulnerability during tenure — Heartbleed, Spectre/Meltdown (provided technical direction on hyperthreading risk vs performance impact, advising against blanket disablement based on our threat profile), Log4Shell (patched team-owned systems), and CrowdStrike (Jul 2024 — led firm-wide Windows recovery across the estate)
- Introduced Terraform at Man Group — VM provisioning via vSphere, Keycloak group management, AD service accounts. Groundwork led to a dedicated IaC team being created in 2022
- Scaled the team's platform to 1,000+ Octopus deployment projects, 300+ operational runbooks, and 1,300+ TeamCity build configurations across 600+ hosts and 15 platform groups
- Designed devops-metadata — a centralised schema defining hosts, DNS aliases, service accounts, SPNs, and HA status across the estate. Acted as a formal contract between the DevOps team and developers for what was being built, and as a data source for other internal systems to consume
- Designed and rolled out event_streams — a platform for capturing webhook events from Jira, Octopus Deploy, ServiceNow, TeamCity, and more, persisting them to Kafka so downstream applications could react in real time
- ELK clusters grew to ~170TB, ingesting ~140M docs/day. Metrics and logging infrastructure each processing ~60k events per second
- Delivered new Rosa and OMS environments entirely via IaC. Managed Server 2012 migrations and DR plans for all Tier 1 Windows applications
- Team provisioned and maintained HAProxy, Grafana, Kafka, Redis, Telegraf, Filebeat, and Linux bootstrap roles via Ansible
- Team built CI/CD pipelines and deployment infrastructure for Cortex on Azure
- Began ArgoCD exploration and the transition toward SRE
-
Head of DevOpsMan Group2016 – 2020
- Recruited and grew the DevOps team, created technical assessment frameworks, and built the operational tooling estate — a PowerShell module ecosystem for Octopus, TeamCity, and AD automation, ELK packaging for Windows, and the scripting infrastructure that held everything together
- The toolchain established in the DevOps Engineer years became the standard — 15+ application platform groups onboarded, including Rosa, OMS, Mole, Tomahawk, and Charles River. Releases moved out of developer hands entirely and into proper, repeatable pipelines
- Owned the MiFID II platform for the Mole trading system end-to-end — Kafka (6-environment topology, 5-node production cluster, Burrow consumer lag monitoring) and MongoDB. Built tooling to manage Kafka topics, Mongo collections, and schema migrations as code, giving developers a repeatable deployment pipeline for their data infrastructure
- Introduced Telegraf, InfluxDB, and Grafana as a unified observability stack — it spread firm-wide on its own merit. Designed the HAProxy layer providing active/active load balancing across InfluxDB, Elasticsearch, Logstash, APM Server, and TeamCity
- Re-platformed Rosa from Server 2008R2 to 2012/2016 and built out full testing environments spanning Prod, Dev, QA, UAT, SIT, Staging, and DR
- Team upgraded Elasticsearch through 6.x to 7.x, hardening it with OpenDistro security and SSL
- Led migration from SVN to Git and GitFlow across the team's consumer applications, continuing the modernisation started in the DevOps Engineer years
-
DevOps EngineerMan Group2014 – 2016
- Established the DevOps function at Man Group and introduced the toolchain that would define the next decade — replacing SVN with Git, CruiseControl with TeamCity, and manual desktop builds and hand-rolled releases with Octopus Deploy. Migrated SVN Externals to NuGet packaging along the way
- Rosa was the first platform onboarded. By 2015 it was fully on the new toolchain and IT Operations owned all releases instead of dev teams
- Tomahawk — the new Electronic Trading platform for GLG — was built out on the toolchain from day one. New applications for the Numeric acquisition were handled the same way, with close collaboration with developers on infrastructure and guidance throughout
- Introduced the ELK stack for centralised logging
- Managed an existing ITRS Geneos monitoring setup
-
Systems ArchitecteSpares2012 – 2014
- Led systems development for a complete overhaul of the eCommerce platform — both software and hardware. Launch achieved 100% uptime and a 35% reduction in page load times
- Developed and built on a genuine everything-as-code philosophy — before Terraform or cloud-init existed as mainstream concepts. VM provisioning, machine bootstrapping, Icinga monitoring, GPOs, application deployments, and switch configs were all version-controlled and automatically applied. Home-rolled tooling filled the gaps where nothing off-the-shelf existed yet
- Hardware estate included Dell PowerConnect switching, Dell EqualLogic SAN, Cisco ASA firewalls, and Netscaler load balancers — all configured and managed as code
- Led a full P2V migration to VMware and managed two datacentre migrations — one to a new colo facility, one to Birmingham including staff training
- Configured two-way domain trusts, Cisco ASA remote access, and provided remote working capability across the company
- Worked closely with the development team on CI pipelines and deployment tooling using TeamCity and Octopus Deploy
- Led a small engineering team and oversaw the handover during company buyout and wind-down
-
Systems AdmineSpares2010 – 2012
- Responsible for the uptime of websites, internal applications, warehouse, mail, and phone systems across multiple sites in London
- Supported ~120 staff across warehouse and call centre operations
- Managed Windows Server estate, VMware virtualisation, Mitel phone systems, and Postfix mail
- Antivirus management, desktop patching, and software deployments