Platform Perseverance: Taming 1,000 Kubernetes Clusters

Introduction

In the modern era of cloud-native computing, Kubernetes has emerged as the de facto standard for container orchestration. For organizations with complex infrastructure demands, managing thousands of Kubernetes clusters presents both a technical challenge and an opportunity for innovation. This article explores how a global financial institution, Pik, navigated the complexities of scaling Kubernetes to support 1,000 clusters across diverse environments, leveraging CNCF technologies to achieve stability, scalability, and developer productivity.

Core Challenges and Transformation

Pik, a Swiss private bank with 5,000 employees and 30 global offices, faced significant operational hurdles when adopting Kubernetes in 2018. Initially deployed on bare-metal infrastructure using Kubernetes v1.12, the organization encountered challenges such as:

Upgrade complexity: Full-scale testing and version management led to escalating costs and delays.
Shared cluster risks: OS upgrades caused widespread DNS resolution failures, impacting critical applications.
Developer bottlenecks: While developers rapidly adopted Kubernetes, platform teams struggled to support custom Operators, limiting innovation.

To address these issues, Pik shifted toward a micro-segmentation strategy, requiring each environment to operate on its own Kubernetes cluster. This necessitated the creation of approximately 1,000 clusters, demanding a rethinking of automation, release management, and platform design.

Platform Design Principles

The transformation was guided by four foundational principles:

Platform as a Product: Developers became co-creators, with platform features evolving through continuous feedback to enhance productivity and usability.
Stability and Standardization: Consistent environments, deployments, and configurations ensured reliability across teams.
Day Two Operations: Long-term maintenance processes were embedded into platform design, including release management and iterative improvements.
Eat Your Own Dog Food: Operations teams used the same tools they recommended, ensuring alignment between user experience and platform capabilities.

Technical Implementation and Architecture

Cluster Model Evolution

Pik transitioned from a shared cluster model to an isolated cluster per product, enabling granular control and reducing cross-environment dependencies. This shift required robust automation and a scalable infrastructure.

Tooling and CNCF Integration

Custom Operators: Defined via CRDs, these automated state reconciliation and supported cross-cloud (AWS, VMware) and hybrid deployments.
Argo CD: Enabled declarative, multi-cluster deployments with version-controlled configurations.
CNCF Ecosystem: Leveraged CNCF components for networking, security, and monitoring, ensuring consistency with industry standards.

Cluster Lifecycle Management

A custom KubeCtl tool was developed to manage cluster lifecycle operations, while CI/CD pipelines evolved from Bash scripts and Helm 2 to a unified Argo CD framework, supporting both cloud and on-premises environments.

Platform Migration and Optimization

The migration spanned two years, with phased rollouts to minimize disruption. Key strategies included:

Developer Support: A "Genius Bar" model provided hands-on assistance, while feature flags enabled controlled feature rollouts.
Documentation and Training: High-quality documentation and onboarding programs ensured smooth adoption.
CI/CD Modernization: Transitioned to a declarative, multi-cluster pipeline, unifying developer experiences across environments.

Platform Advantages

Efficient Upgrades: Independent cluster upgrades reduced risk, with monthly version releases validated through staged testing.
Scalability and Reliability: The system scaled seamlessly to 1,000+ clusters, with isolation enhancing overall stability.
Security and Observability: Declarative configurations and standardized tools improved security and operational visibility.

Lessons Learned

Single Clusters Are Not Sustainable: Fine-grained cluster management is essential for long-term reliability.
Built-in Testing: Embedding testing mechanisms into the platform reduces risk and accelerates innovation.
Developer Collaboration: Platform evolution must align with developer needs, fostering a culture of continuous improvement.
User Experience First: Infrastructure advancements must prioritize usability and developer workflows.

Conclusion

Managing 1,000 Kubernetes clusters requires a holistic approach that balances technical rigor with developer-centric design. By embracing CNCF standards, adopting declarative automation, and fostering collaboration between infrastructure and development teams, Pik achieved a scalable, secure, and resilient platform. This case underscores the importance of treating the platform as a product, with continuous feedback and iterative improvements driving long-term success.