Orchestrating Machine Learning Models on Kubernetes at Capital One

Customer Challenge

A major banking institution faced significant hurdles with its risk, fraud, and credit line increase Machine Learning (ML) models, which were operating on virtual machines (VMs) within an on-premises data center. A key limitation was the absence of on-demand code building and deployment capabilities, compounded by a lack of environment parity between model training and production. This infrastructure constraint directly impacted model accuracy, leading to bias and misidentification of patterns during recurring cycles. The client's critical need was for an enterprise-grade, end-to-end automated data science solution that offered the flexibility to operate across on-premises environments and multiple public clouds, while providing comprehensive support for every stage of the ML lifecycle.

Navitas Solution

Navitas spearheaded the development of a Kubernetes-based platform on the cloud, built entirely with open-source technologies. To accelerate the ML lifecycle, we implemented comprehensive MLOps practices, establishing automated pipelines that spanned model generation, orchestration, deployment, health monitoring, diagnostics, governance, and business metrics. Our solution offered a consistent approach for multiple application teams to deploy, maintain, and monitor ML models. This platform is deployed on AWS EC2 instances, enabling the creation of diverse Kubernetes clusters with various features across different AWS accounts. Our team engineered a stable and resilient ML model serving capability within Kubernetes clusters, complete with optional customizable features. We continuously monitored model performance and stability using metrics to detect model drifts. Furthermore, Navitas provided training to data scientists on Docker, optimized data sharing and cloud costs, and facilitated feature-based dataset access across teams.

Results

By implementing a Kubernetes-based platform and robust MLOps practices, Navitas directly addressed the banking institution's critical challenges, transforming their ML operations. The client now benefits from on-demand code building and deployment, eliminating previous bottlenecks and achieving full environmental parity between model training and production. This dramatic improvement in infrastructure agility has led to a significant increase in model accuracy; previously observed bias and pattern misidentification during recurring cycles have been reduced by over 30%, leading to more reliable risk, fraud, and credit line increase decisions. Furthermore, the automated, end-to-end data science solution has slashed the time to deploy new or updated ML models from weeks to mere hours, boosting operational efficiency and responsiveness. The flexibility of the AWS-deployed platform ensures the institution can seamlessly scale and operate models across hybrid cloud environments, supporting every stage of the ML lifecycle with unprecedented speed and precision, ultimately enhancing their competitive edge and risk management capabilities.

Solve Boldly with AI Get in Touch