Documentation

Everything you need to deploy and operate ITOps V4.

Getting Started

ITOps V4 consists of three components: Core (Go backend API), UI (Vue.js frontend), and Agent (K8s operator). Deploy the platform with Helm, then install agents on each cluster.

Quick Deploy

# 1. Add ITOps Helm repo
helm repo add itops https://charts.mlops.hu
helm repo update

# 2. Deploy ITOps platform
helm install itops itops/itops -n itops --create-namespace

# 3. Deploy agent on each K8s cluster
helm install itops-agent itops/itops-agent \
  --set node.id="myorg/myplatform/prod/cluster1" \
  --set itops.url="https://demo-api.mlops.hu" \
  -n itops --create-namespace

# 3. Verify
kubectl get pods -n itops

Architecture

Multi-agent architecture with 5-level hierarchy: organization/platform/environment/cluster/service.

K8s Clusters (N agents)          ITOps Platform
+------------------+            +------------------+
| Agent            |---sync---->| Core (Go API)    |
| - ConfigMap watch|  5 min     | - GraphQL        |
| - Pod monitor    |            | - REST API       |
| - Status report  |            | - SLA Aggregator |
+------------------+            +--------+---------+
                                         |
                                +--------+---------+
                                | PostgreSQL       |
                                | Redis            |
                                +------------------+

Agent Deploy

The agent watches K8s resources and reports service status every 30 seconds. It discovers services from ConfigMaps labeled with itops.io/config: "true".

Agent values.yaml

node:
  id: "myorg/myplatform/prod/cluster1"
  name: "production-cluster"

itops:
  url: "https://api.itops.example.com"
  apiKey:
    existingSecret: "itops-api-key"
    existingSecretKey: "api-key"

slaGroups:
  - name: "payment-system"
    displayName: "Payment System"
    tier: "critical"
    targets:
      uptime: 99.99

Service Config

Each service is configured via values.yaml in its Helm chart. The agent reads the rendered it-ops.yaml from ConfigMaps.

Service values.yaml

# helmcharts/my-service/values.yaml
itops:
  criticality: "critical"     # critical / high / medium / low
  slaGroup: "payment-system"  # SLA group membership
  type: "database"            # service type
  team: "data"                # owning team
  tags:
    - database
  backup:
    expected: true             # backup monitoring enabled
    maxAgeDays: 1              # alert if older than N days

ConfigMap Template

# templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ .Chart.Name }}-itops
  labels:
    itops.io/config: "true"
data:
  it-ops.yaml: |
    version: "1"
    hierarchy:
      organization: {{ .Values.itops.organization | default "myorg" }}
      platform: {{ .Values.itops.platform | default "myplatform" }}
      environment: {{ .Values.itops.environment | default "prod" }}
      cluster: {{ .Values.itops.cluster | default "cluster1" }}
      service: {{ .Chart.Name }}
    service:
      name: {{ .Chart.Name }}
      criticality: {{ .Values.itops.criticality | default "low" }}
      slaGroup: {{ .Values.itops.slaGroup }}
    operations:
      backup:
        expected: {{ .Values.itops.backup.expected | default false }}
        maxAgeDays: {{ .Values.itops.backup.maxAgeDays | default 1 }}

SLA Monitoring

SLA is measured from real agent data using 5-minute snapshots. The aggregator processes snapshots with a 15-minute delay to ensure all agents have reported.

How it works

ComponentIntervalFunction
Agent Sync30sReports service status (OPERATIONAL/DEGRADED/DOWN)
Snapshot StorePer syncWrites to sla_snapshots table
Aggregator5 minBucketed aggregation with 15-min delay
Period ResultsPer aggregationDaily + monthly uptime % calculated
Cleanup1 hourDeletes aggregated snapshots >90 days

SLA Tiers

TierUptime TargetResponse TimeResolution Time
Critical99.99%15 min4 hours
High99.9%60 min8 hours
Medium99.5%4 hours3 days
Low99.0%24 hours5 days

Backup Monitoring

Any backup tool can report completion via webhook. ITOps tracks the last successful backup and alerts if it's older than maxAgeDays.

Webhook Call

# Service-level report
curl -X POST https://api.itops.example.com/api/v1/backup/report \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"service":"my-database","status":"success","sizeBytes":5242880}'

# SLA Group-level report (propagates to all services with backup.expected=true)
curl -X POST https://api.itops.example.com/api/v1/backup/report \
  -H "Authorization: Bearer $API_KEY" \
  -d '{"slaGroup":"payment-system","status":"success"}'

# Namespace-level report
curl -X POST https://api.itops.example.com/api/v1/backup/report \
  -H "Authorization: Bearer $API_KEY" \
  -d '{"namespace":"production","status":"success"}'

Auto Incidents

When a service goes DOWN, ITOps automatically:

  1. Creates an SLA incident (source: MONITORING)
  2. Generates an INCIDENT ticket (if ticketing plugin active)
  3. Updates the SLA dashboard in real-time
  4. Closes the incident when service recovers

Helm Charts

ChartPurposeNamespace
itopsCore platform (API + UI + Landing)itops
itops-agentK8s operator (per cluster)itops
test-databaseExample database serviceitops

Configuration

All backend settings are configurable via environment variables with ITOPS_ prefix.

VariableDefaultDescription
ITOPS_SERVER_PORT8080API server port
ITOPS_DATABASE_HOSTlocalhostPostgreSQL host
ITOPS_DATABASE_NAMEitopsDatabase name
ITOPS_REDIS_HOSTlocalhostRedis host
ITOPS_JWT_SECRET(required)JWT signing secret
ITOPS_SECURITY_OPERATOR_API_KEY(required)Agent API key
ITOPS_PLUGINS_SLA_ENABLEDtrueEnable SLA plugin
ITOPS_PLUGINS_TICKETING_ENABLEDtrueEnable Ticketing plugin