Infosys Topaz Fabric
Project URL: https://www.infosys.com/services/topaz-fabric.html
Architecting the DevOps Foundation for Infosys Topaz Fabric: Building an Enterprise AI Platform
Enterprise AI transformation demands more than just powerful models—it requires robust, scalable infrastructure that can support AI agents, services, and workflows at scale. As the DevOps Architect for Infosys Topaz Fabric, I designed and implemented the foundational infrastructure that powers this next-generation composable AI platform.What is Infosys Topaz Fabric?
Infosys Topaz Fabric is a composable stack of AI agents, services, and models designed to accelerate IT and business service delivery for enterprises. It's a modular, integrated platform that delivers:- AI-first operations with intelligent agents that learn, predict, and self-heal
- Transformation services for AI-driven modernization and engineering
- Quality engineering with automated validation and testing
- Cybersecurity embedded by design across the enterprise
The DevOps Challenge
Building infrastructure for an AI agent platform at enterprise scale presented unique challenges:1. Multi-Tenancy at Scale
- Support hundreds of enterprise clients with isolated, secure environments
- Enable multiple AI agents running concurrently across different business contexts
- Ensure resource isolation while maintaining cost efficiency
2. Dynamic Workload Management
- Handle unpredictable AI inference workloads with varying compute requirements
- Support both deterministic digital workers and resource-intensive AI workers
- Scale automatically based on agent orchestration demands
3. Platform Flexibility
- Abstract models, prompts, and tools for seamless integration of emerging AI technologies
- Support multiple LLM providers (OpenAI, Claude, custom SLMs) without vendor lock-in
- Enable rapid deployment of new AI capabilities without infrastructure rebuilds
4. Production-Grade Reliability
- Achieve 99.9%+ uptime for mission-critical enterprise operations
- Implement self-healing capabilities and automated incident response
- Ensure data consistency across distributed AI agent workflows
The Architecture
Kubernetes-Native Architecture
At the heart of Topaz Fabric is a sophisticated Kubernetes-based platform: Multi-Cluster Strategy- Development Cluster: Rapid iteration for AI agent development
- QA/Testing Cluster: Automated validation and performance testing
- Pre-Production Cluster: Staging environment mirroring production
- Production Clusters: Multi-region deployment for high availability and disaster recovery
- Specialty Clusters: GPU-enabled clusters for AI model inference using vLLM
- Tenant-specific namespaces with strict RBAC policies
- Network policies enforcing inter-service communication rules
- Resource quotas preventing noisy neighbor issues
- Secure service-to-service communication with mutual TLS (mTLS)
- Advanced traffic management: canary deployments, A/B testing, blue-green rollouts
- Distributed tracing for AI agent workflow observability
- Circuit breaking and retry logic for resilient inter-service calls
CI/CD Pipeline Architecture
Designed fully automated pipelines enabling rapid, safe deployments: Build Pipeline- Multi-stage Docker builds with layer caching optimization
- Security scanning: vulnerability detection, secret scanning, license compliance
- Artifact versioning and storage in container registry
- Automated unit and integration testing
- GitOps-based deployments using declarative manifests
- Automated smoke tests and health checks
- Automated rollback on deployment failures
- Multiple production deployments daily with zero downtime
- Feature flags for gradual feature enablement
- Automated change management and audit trails
Autoscaling Strategy
Built intelligent scaling for AI workloads: Knative Integration- Event-driven autoscaling for AI agent invocations
- Scale-to-zero for idle agents, reducing infrastructure costs
- Burst capacity for sudden workload spikes
- Request-based and concurrency-based scaling metrics
- CPU/memory-based scaling for deterministic services
- Custom metrics (queue depth, request latency) for intelligent scaling
- Predictive scaling using historical workload patterns
- Automatic node provisioning based on pending pod requirements
- GPU node pools for AI inference workloads
- Cost-optimized instance selection
Observability Stack
Comprehensive monitoring and logging for AI operations: Metrics & Monitoring (Prometheus + Grafana)- Infrastructure metrics: CPU, memory, disk, network across all clusters
- Application metrics: request rates, latencies, error rates
- AI-specific metrics: inference time, model performance, agent success rates
- Custom dashboards for platform health, tenant usage, and cost analysis
- Aggregated logs from all services, AI agents, and infrastructure components
- Structured logging with correlation IDs for tracing agent workflows
- Log-based alerting for anomaly detection
- Compliance and audit logging
- End-to-end tracing of AI agent execution flows
- Performance bottleneck identification
- Root cause analysis for failed agent tasks
- Proactive alerts based on SLOs (Service Level Objectives)
- Integration with PagerDuty for on-call management
- Automated runbooks for common incident scenarios
Security & Compliance
Embedded security throughout the infrastructure: Zero Trust Architecture- No implicit trust between services
- Mutual TLS for all inter-service communication
- Regular certificate rotation
- HashiCorp Vault integration for dynamic secrets
- Kubernetes secrets encrypted at rest
- API key rotation policies
- Audit logging for all infrastructure changes
- Policy enforcement using Open Policy Agent (OPA)
- Regular security scanning and penetration testing
Operational Excellence
Self-Healing Capabilities
- Automated pod restarts on health check failures
- Node failure detection and workload migration
- Database backup and disaster recovery automation
Cost Optimization
- Right-sizing recommendations based on actual usage
- Reserved instance planning for predictable workloads
- Spot instance utilization for non-critical workloads
- Chargeback reporting per tenant for cost transparency
Multi-Environment Management
Successfully managing 7 Kubernetes clusters across:- Development and experimentation environments
- QA and automated testing environments
- Pre-production staging
- Multi-region production clusters
Key Achievements
Performance Metrics
- 99.95% Platform Uptime: Exceeding enterprise SLA requirements
- Sub-second Deployment: GitOps-based continuous delivery
- 50% Cost Reduction: Through intelligent autoscaling and resource optimization
- 10x Faster Time-to-Market: For new AI agent capabilities
Scale Milestones
- Supporting 100+ enterprise clients on shared infrastructure
- Running 1000+ concurrent AI agents across various business contexts
- Processing millions of AI agent invocations daily
- Managing 7 production-grade Kubernetes clusters across environments
Developer Velocity
- Reduced infrastructure provisioning time from weeks to hours
- Enabled self-service deployments for AI agent developers
- Created reusable infrastructure templates accelerating new project onboarding
Future Roadmap
We're continuously evolving the platform:- Multi-Cloud Strategy: Expanding beyond single cloud provider for resilience
- FinOps Maturity: Enhanced cost allocation, forecasting, and optimization
Building the DevOps foundation for Infosys Topaz Fabric has been a rewarding journey in architecting enterprise-scale AI infrastructure. The platform now powers intelligent operations, transformations, and services for enterprises globally—demonstrating that robust DevOps practices are the backbone of successful AI adoption. The key takeaway: AI transformation is as much about infrastructure excellence as it is about model intelligence. Without a solid DevOps foundation, even the most powerful AI capabilities cannot reach production at scale.
Technologies: Kubernetes, Docker , Istio, Knative, Jenkins, GitLab CI/CD, Prometheus, Grafana, Loki, ELK, Azure, vLLM, GitOps