Documentation
Everything you need to deploy and operate ITOps V4.
Getting Started
ITOps V4 consists of three components: Core (Go backend API), UI (Vue.js frontend), and Agent (K8s operator). Deploy the platform with Helm, then install agents on each cluster.
Quick Deploy
# 1. Add ITOps Helm repo
helm repo add itops https://charts.mlops.hu
helm repo update
# 2. Deploy ITOps platform
helm install itops itops/itops -n itops --create-namespace
# 3. Deploy agent on each K8s cluster
helm install itops-agent itops/itops-agent \
--set node.id="myorg/myplatform/prod/cluster1" \
--set itops.url="https://demo-api.mlops.hu" \
-n itops --create-namespace
# 3. Verify
kubectl get pods -n itops
Architecture
Multi-agent architecture with 5-level hierarchy: organization/platform/environment/cluster/service.
K8s Clusters (N agents) ITOps Platform
+------------------+ +------------------+
| Agent |---sync---->| Core (Go API) |
| - ConfigMap watch| 5 min | - GraphQL |
| - Pod monitor | | - REST API |
| - Status report | | - SLA Aggregator |
+------------------+ +--------+---------+
|
+--------+---------+
| PostgreSQL |
| Redis |
+------------------+
Agent Deploy
The agent watches K8s resources and reports service status every 30 seconds. It discovers services from ConfigMaps labeled with itops.io/config: "true".
Agent values.yaml
node:
id: "myorg/myplatform/prod/cluster1"
name: "production-cluster"
itops:
url: "https://api.itops.example.com"
apiKey:
existingSecret: "itops-api-key"
existingSecretKey: "api-key"
slaGroups:
- name: "payment-system"
displayName: "Payment System"
tier: "critical"
targets:
uptime: 99.99
Service Config
Each service is configured via values.yaml in its Helm chart. The agent reads the rendered it-ops.yaml from ConfigMaps.
Service values.yaml
# helmcharts/my-service/values.yaml
itops:
criticality: "critical" # critical / high / medium / low
slaGroup: "payment-system" # SLA group membership
type: "database" # service type
team: "data" # owning team
tags:
- database
backup:
expected: true # backup monitoring enabled
maxAgeDays: 1 # alert if older than N days
ConfigMap Template
# templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ .Chart.Name }}-itops
labels:
itops.io/config: "true"
data:
it-ops.yaml: |
version: "1"
hierarchy:
organization: {{ .Values.itops.organization | default "myorg" }}
platform: {{ .Values.itops.platform | default "myplatform" }}
environment: {{ .Values.itops.environment | default "prod" }}
cluster: {{ .Values.itops.cluster | default "cluster1" }}
service: {{ .Chart.Name }}
service:
name: {{ .Chart.Name }}
criticality: {{ .Values.itops.criticality | default "low" }}
slaGroup: {{ .Values.itops.slaGroup }}
operations:
backup:
expected: {{ .Values.itops.backup.expected | default false }}
maxAgeDays: {{ .Values.itops.backup.maxAgeDays | default 1 }}
SLA Monitoring
SLA is measured from real agent data using 5-minute snapshots. The aggregator processes snapshots with a 15-minute delay to ensure all agents have reported.
How it works
| Component | Interval | Function |
|---|---|---|
| Agent Sync | 30s | Reports service status (OPERATIONAL/DEGRADED/DOWN) |
| Snapshot Store | Per sync | Writes to sla_snapshots table |
| Aggregator | 5 min | Bucketed aggregation with 15-min delay |
| Period Results | Per aggregation | Daily + monthly uptime % calculated |
| Cleanup | 1 hour | Deletes aggregated snapshots >90 days |
SLA Tiers
| Tier | Uptime Target | Response Time | Resolution Time |
|---|---|---|---|
| Critical | 99.99% | 15 min | 4 hours |
| High | 99.9% | 60 min | 8 hours |
| Medium | 99.5% | 4 hours | 3 days |
| Low | 99.0% | 24 hours | 5 days |
Backup Monitoring
Any backup tool can report completion via webhook. ITOps tracks the last successful backup and alerts if it's older than maxAgeDays.
Webhook Call
# Service-level report
curl -X POST https://api.itops.example.com/api/v1/backup/report \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"service":"my-database","status":"success","sizeBytes":5242880}'
# SLA Group-level report (propagates to all services with backup.expected=true)
curl -X POST https://api.itops.example.com/api/v1/backup/report \
-H "Authorization: Bearer $API_KEY" \
-d '{"slaGroup":"payment-system","status":"success"}'
# Namespace-level report
curl -X POST https://api.itops.example.com/api/v1/backup/report \
-H "Authorization: Bearer $API_KEY" \
-d '{"namespace":"production","status":"success"}'
Auto Incidents
When a service goes DOWN, ITOps automatically:
- Creates an SLA incident (source: MONITORING)
- Generates an INCIDENT ticket (if ticketing plugin active)
- Updates the SLA dashboard in real-time
- Closes the incident when service recovers
Helm Charts
| Chart | Purpose | Namespace |
|---|---|---|
| itops | Core platform (API + UI + Landing) | itops |
| itops-agent | K8s operator (per cluster) | itops |
| test-database | Example database service | itops |
Configuration
All backend settings are configurable via environment variables with ITOPS_ prefix.
| Variable | Default | Description |
|---|---|---|
| ITOPS_SERVER_PORT | 8080 | API server port |
| ITOPS_DATABASE_HOST | localhost | PostgreSQL host |
| ITOPS_DATABASE_NAME | itops | Database name |
| ITOPS_REDIS_HOST | localhost | Redis host |
| ITOPS_JWT_SECRET | (required) | JWT signing secret |
| ITOPS_SECURITY_OPERATOR_API_KEY | (required) | Agent API key |
| ITOPS_PLUGINS_SLA_ENABLED | true | Enable SLA plugin |
| ITOPS_PLUGINS_TICKETING_ENABLED | true | Enable Ticketing plugin |