Introduction

Managing monitoring at scale is challenging. As your infrastructure grows from a handful of services to dozens or hundreds, maintaining consistent, comprehensive monitoring becomes increasingly complex. Manually creating Datadog monitors through the UI doesn’t scale, leads to inconsistencies, and makes it difficult to apply organizational standards.

This guide walks through building a modular Terraform framework for managing Datadog monitors that scales from small teams to enterprise deployments. The framework emphasizes composability, allowing you to enable only the monitoring suites needed for each service while maintaining consistency across your entire infrastructure.

What you’ll learn:

  • Build a modular monitoring framework with Terraform
  • Implement the Four Golden Signals (latency, traffic, errors, saturation)
  • Configure tier-based alert routing
  • Create service-specific monitoring profiles
  • Manage monitors as code with version control
  • Scale monitoring across multiple services and environments

Why Terraform for Datadog?

  • Version Control: Track all monitoring changes in Git
  • Consistency: Apply organizational standards automatically
  • Reusability: Define monitoring patterns once, apply to many services
  • Collaboration: Review monitoring changes like code changes
  • Automation: Deploy monitoring alongside application code

Prerequisites

Before starting, ensure you have:

  • Datadog account with API and Application keys
  • Terraform installed (v1.0+)
  • Basic Terraform knowledge (providers, resources, modules)
  • Datadog permissions to create monitors and dashboards
  • Git for version control (recommended)

Recommended knowledge:

  • SRE principles and Golden Signals
  • Your application architecture and dependencies
  • Alert routing preferences (Slack, PagerDuty, email)

Architecture Overview

The Problem: Monitoring Sprawl

Without a framework:

Service A: 47 monitors (manually created, inconsistent thresholds)
Service B: 12 monitors (missing error rate monitoring)
Service C: 89 monitors (duplicates, alert fatigue)
Service D: 3 monitors (woefully inadequate)

With this framework:

Service A: golden-signals + infrastructure + database = 23 focused monitors
Service B: golden-signals + application = 15 comprehensive monitors
Service C: golden-signals + infrastructure = 18 optimized monitors
Service D: golden-signals = 8 essential monitors

Module Hierarchy

Root Configuration (main.tf)
    └── Complete Monitoring Module (per service)
            ├── Golden Signals Module
            ├── Infrastructure Monitors Module
            ├── Database Monitors Module
            ├── Application Monitors Module
            ├── Security Monitors Module
            └── Business Monitors Module

How it works:

  1. Services are defined in main.tf with configuration (runtime, database, tier, monitoring suites)
  2. Complete Monitoring orchestrates specialized modules based on each service’s monitoring_suites setting
  3. Specialized modules create targeted monitors for their domain
  4. Alert routing is automatic based on service tier and environment

Part 1: Project Setup

Step 1: Create Project Structure

mkdir -p ~/terraform/datadog-monitors
cd ~/terraform/datadog-monitors

# Create module directories
mkdir -p modules/{golden-signals,infrastructure,database,application,security,business,complete-monitoring}

# Initialize git
git init

Step 2: Configure .gitignore

cat > .gitignore << 'EOF'
# Terraform files
*.tfstate
*.tfstate.*
*.tfstate.backup
.terraform/
.terraform.lock.hcl

# Sensitive files
terraform.tfvars
*.auto.tfvars
.env

# IDE
.idea/
.vscode/
*.swp
EOF

Step 3: Get Datadog API Credentials

In Datadog UI:

  1. Navigate to Organization Settings → API Keys
  2. Create API Key (or use existing)
  3. Navigate to Organization Settings → Application Keys
  4. Create Application Key with appropriate permissions

Store credentials securely:

# Option 1: Environment variables (recommended)
export TF_VAR_datadog_api_key="your-api-key"
export TF_VAR_datadog_app_key="your-app-key"

# Option 2: terraform.tfvars (never commit!)
cat > terraform.tfvars << 'EOF'
datadog_api_key = "your-api-key"
datadog_app_key = "your-app-key"
organization_name = "YourCompany"
EOF

Part 2: Build the Golden Signals Module

The Four Golden Signals from Google’s SRE book form the foundation of effective monitoring.

Create Module Structure

cd modules/golden-signals

variables.tf

variable "service_name" {
  description = "Name of the service being monitored"
  type        = string
}

variable "environment" {
  description = "Environment (production, staging, development)"
  type        = string
}

variable "service_tier" {
  description = "Service tier (critical, important, standard)"
  type        = string
}

variable "alert_channels" {
  description = "List of alert notification channels"
  type        = list(string)
}

variable "tags" {
  description = "Additional tags for monitors"
  type        = list(string)
  default     = []
}

# Latency thresholds
variable "latency_p95_critical_ms" {
  description = "P95 latency critical threshold in milliseconds"
  type        = number
  default     = 1000
}

variable "latency_p95_warning_ms" {
  description = "P95 latency warning threshold in milliseconds"
  type        = number
  default     = 500
}

# Error rate thresholds
variable "error_rate_critical_pct" {
  description = "Error rate critical threshold percentage"
  type        = number
  default     = 5
}

variable "error_rate_warning_pct" {
  description = "Error rate warning threshold percentage"
  type        = number
  default     = 2
}

# Saturation thresholds
variable "cpu_critical_pct" {
  description = "CPU usage critical threshold percentage"
  type        = number
  default     = 90
}

variable "memory_critical_pct" {
  description = "Memory usage critical threshold percentage"
  type        = number
  default     = 90
}

main.tf

# Latency Monitor - P95 Response Time
resource "datadog_monitor" "latency_p95" {
  name    = "[${var.environment}] ${var.service_name} - High P95 Latency"
  type    = "metric alert"
  message = <<-EOT
    **Service**: ${var.service_name}
    **Environment**: ${var.environment}
    **Severity**: {{#is_alert}}CRITICAL{{/is_alert}}{{#is_warning}}WARNING{{/is_warning}}

    P95 latency has exceeded acceptable thresholds.

    **Current Value**: {{value}}ms
    **Threshold**: Critical: ${var.latency_p95_critical_ms}ms | Warning: ${var.latency_p95_warning_ms}ms

    **Runbook**: https://wiki.company.com/runbooks/${var.service_name}/latency
    **Grafana**: https://grafana.company.com/d/${var.service_name}

    ${join(" ", var.alert_channels)}
  EOT

  query = "avg(last_5m):p95:trace.web.request.duration{service:${var.service_name},env:${var.environment}} > ${var.latency_p95_critical_ms}"

  monitor_thresholds {
    critical = var.latency_p95_critical_ms
    warning  = var.latency_p95_warning_ms
  }

  notify_no_data    = true
  no_data_timeframe = 10
  renotify_interval = 60

  priority = var.service_tier == "critical" ? 1 : (var.service_tier == "important" ? 2 : 3)

  tags = concat(
    [
      "service:${var.service_name}",
      "env:${var.environment}",
      "tier:${var.service_tier}",
      "signal:latency",
      "terraform:true"
    ],
    var.tags
  )
}

# Latency Monitor - P99 Response Time
resource "datadog_monitor" "latency_p99" {
  name    = "[${var.environment}] ${var.service_name} - High P99 Latency"
  type    = "metric alert"
  message = <<-EOT
    **Service**: ${var.service_name}
    **Tail latency** (P99) is degraded.

    **Current Value**: {{value}}ms
    **Impact**: Worst-case user experience

    ${join(" ", var.alert_channels)}
  EOT

  query = "avg(last_5m):p99:trace.web.request.duration{service:${var.service_name},env:${var.environment}} > ${var.latency_p95_critical_ms * 2}"

  monitor_thresholds {
    critical = var.latency_p95_critical_ms * 2
    warning  = var.latency_p95_critical_ms * 1.5
  }

  notify_no_data    = true
  no_data_timeframe = 10

  tags = concat(
    [
      "service:${var.service_name}",
      "env:${var.environment}",
      "tier:${var.service_tier}",
      "signal:latency",
      "terraform:true"
    ],
    var.tags
  )
}

# Traffic Monitor - Request Rate Drop
resource "datadog_monitor" "traffic_drop" {
  name    = "[${var.environment}] ${var.service_name} - Traffic Drop Detected"
  type    = "query alert"
  message = <<-EOT
    **Service**: ${var.service_name}
    **Alert**: Significant traffic drop detected

    Request rate has dropped by more than 50% compared to the previous hour.
    This could indicate:
    - Upstream service failure
    - Load balancer misconfiguration
    - DNS issues
    - Deployment problem

    **Action Required**: Investigate immediately

    ${join(" ", var.alert_channels)}
  EOT

  query = "pct_change(avg(last_5m),last_1h):sum:trace.web.request.hits{service:${var.service_name},env:${var.environment}}.as_count() < -50"

  monitor_thresholds {
    critical = -50
    warning  = -30
  }

  notify_no_data    = true
  no_data_timeframe = 10

  tags = concat(
    [
      "service:${var.service_name}",
      "env:${var.environment}",
      "tier:${var.service_tier}",
      "signal:traffic",
      "terraform:true"
    ],
    var.tags
  )
}

# Errors Monitor - Error Rate Percentage
resource "datadog_monitor" "error_rate" {
  name    = "[${var.environment}] ${var.service_name} - High Error Rate"
  type    = "query alert"
  message = <<-EOT
    **Service**: ${var.service_name}
    **Alert**: Error rate exceeds acceptable threshold

    **Current Error Rate**: {{value}}%
    **Threshold**: Critical: ${var.error_rate_critical_pct}% | Warning: ${var.error_rate_warning_pct}%

    **Common Causes**:
    - Database connection issues
    - Downstream service failures
    - Invalid input validation
    - Recent deployment issues

    **Runbook**: https://wiki.company.com/runbooks/${var.service_name}/errors

    ${join(" ", var.alert_channels)}
  EOT

  query = "avg(last_5m):(sum:trace.web.request.errors{service:${var.service_name},env:${var.environment}}.as_count() / sum:trace.web.request.hits{service:${var.service_name},env:${var.environment}}.as_count()) * 100 > ${var.error_rate_critical_pct}"

  monitor_thresholds {
    critical = var.error_rate_critical_pct
    warning  = var.error_rate_warning_pct
  }

  notify_no_data    = false

  tags = concat(
    [
      "service:${var.service_name}",
      "env:${var.environment}",
      "tier:${var.service_tier}",
      "signal:errors",
      "terraform:true"
    ],
    var.tags
  )
}

# Saturation Monitor - CPU Usage
resource "datadog_monitor" "cpu_saturation" {
  name    = "[${var.environment}] ${var.service_name} - High CPU Usage"
  type    = "metric alert"
  message = <<-EOT
    **Service**: ${var.service_name}
    **Alert**: CPU saturation detected

    **Current CPU**: {{value}}%
    **Threshold**: ${var.cpu_critical_pct}%

    High CPU usage can lead to:
    - Increased latency
    - Request timeouts
    - Service instability

    **Actions**:
    1. Check for CPU-intensive operations
    2. Review recent deployments
    3. Consider horizontal scaling

    ${join(" ", var.alert_channels)}
  EOT

  query = "avg(last_10m):avg:system.cpu.user{service:${var.service_name},env:${var.environment}} by {host} > ${var.cpu_critical_pct}"

  monitor_thresholds {
    critical = var.cpu_critical_pct
    warning  = var.cpu_critical_pct - 10
  }

  notify_no_data    = true
  no_data_timeframe = 20

  tags = concat(
    [
      "service:${var.service_name}",
      "env:${var.environment}",
      "tier:${var.service_tier}",
      "signal:saturation",
      "terraform:true"
    ],
    var.tags
  )
}

# Saturation Monitor - Memory Usage
resource "datadog_monitor" "memory_saturation" {
  name    = "[${var.environment}] ${var.service_name} - High Memory Usage"
  type    = "metric alert"
  message = <<-EOT
    **Service**: ${var.service_name}
    **Alert**: Memory saturation detected

    **Current Memory**: {{value}}%
    **Threshold**: ${var.memory_critical_pct}%

    **Potential Issues**:
    - Memory leaks
    - Insufficient capacity
    - Caching issues

    **Actions**:
    1. Check for memory leaks
    2. Review heap dumps (JVM) / memory profiles
    3. Consider vertical scaling

    ${join(" ", var.alert_channels)}
  EOT

  query = "avg(last_10m):avg:system.mem.pct_usable{service:${var.service_name},env:${var.environment}} by {host} < ${100 - var.memory_critical_pct}"

  monitor_thresholds {
    critical = 100 - var.memory_critical_pct
    warning  = 100 - (var.memory_critical_pct - 10)
  }

  notify_no_data    = true
  no_data_timeframe = 20

  tags = concat(
    [
      "service:${var.service_name}",
      "env:${var.environment}",
      "tier:${var.service_tier}",
      "signal:saturation",
      "terraform:true"
    ],
    var.tags
  )
}

outputs.tf

output "monitor_ids" {
  description = "Map of monitor names to IDs"
  value = {
    latency_p95        = datadog_monitor.latency_p95.id
    latency_p99        = datadog_monitor.latency_p99.id
    traffic_drop       = datadog_monitor.traffic_drop.id
    error_rate         = datadog_monitor.error_rate.id
    cpu_saturation     = datadog_monitor.cpu_saturation.id
    memory_saturation  = datadog_monitor.memory_saturation.id
  }
}

output "monitor_urls" {
  description = "Datadog URLs for each monitor"
  value = {
    for name, id in {
      latency_p95       = datadog_monitor.latency_p95.id
      latency_p99       = datadog_monitor.latency_p99.id
      traffic_drop      = datadog_monitor.traffic_drop.id
      error_rate        = datadog_monitor.error_rate.id
      cpu_saturation    = datadog_monitor.cpu_saturation.id
      memory_saturation = datadog_monitor.memory_saturation.id
    } : name => "https://app.datadoghq.com/monitors/${id}"
  }
}

Part 3: Create the Root Configuration

Back to project root

cd ~/terraform/datadog-monitors

providers.tf

terraform {
  required_version = ">= 1.0"

  required_providers {
    datadog = {
      source  = "DataDog/datadog"
      version = "~> 3.0"
    }
  }
}

provider "datadog" {
  api_key = var.datadog_api_key
  app_key = var.datadog_app_key
  api_url = "https://api.datadoghq.com/"  # Use US5: https://api.us5.datadoghq.com/ if needed
}

variables.tf

variable "datadog_api_key" {
  description = "Datadog API key"
  type        = string
  sensitive   = true
}

variable "datadog_app_key" {
  description = "Datadog Application key"
  type        = string
  sensitive   = true
}

variable "organization_name" {
  description = "Organization name for tagging"
  type        = string
}

main.tf

locals {
  # Service definitions
  services = {
    "web-app" = {
      environment       = "production"
      tier             = "critical"
      team             = "frontend"
      runtime          = "node"
      database_type    = "postgresql"
      monitoring_suites = ["golden-signals"]
      custom_tags      = ["customer-facing"]

      # Custom thresholds
      latency_p95_critical_ms = 800
      error_rate_critical_pct = 3
    }

    "api-service" = {
      environment       = "production"
      tier             = "critical"
      team             = "backend"
      runtime          = "jvm"
      database_type    = "postgresql"
      monitoring_suites = ["golden-signals"]
      custom_tags      = ["api", "external"]

      latency_p95_critical_ms = 500
      error_rate_critical_pct = 1
    }

    "worker-service" = {
      environment       = "production"
      tier             = "important"
      team             = "backend"
      runtime          = "python"
      queue_type       = "sqs"
      monitoring_suites = ["golden-signals"]
      custom_tags      = ["async"]

      error_rate_critical_pct = 5
    }
  }

  # Alert routing configuration
  alert_configs = {
    critical = {
      production = ["@slack-oncall", "@pagerduty-critical"]
      staging    = ["@slack-dev-alerts"]
    }
    important = {
      production = ["@slack-alerts", "@pagerduty-high"]
      staging    = ["@slack-dev"]
    }
    standard = {
      production = ["@slack-alerts"]
      staging    = ["@slack-dev"]
    }
  }
}

# Deploy golden signals monitoring for each service
module "service_monitoring" {
  source   = "./modules/golden-signals"
  for_each = local.services

  service_name = each.key
  environment  = each.value.environment
  service_tier = each.value.tier

  # Alert routing based on tier and environment
  alert_channels = lookup(
    lookup(local.alert_configs, each.value.tier, local.alert_configs.standard),
    each.value.environment,
    ["@slack-dev"]
  )

  # Custom thresholds (use defaults if not specified)
  latency_p95_critical_ms = lookup(each.value, "latency_p95_critical_ms", 1000)
  latency_p95_warning_ms  = lookup(each.value, "latency_p95_warning_ms", 500)
  error_rate_critical_pct = lookup(each.value, "error_rate_critical_pct", 5)
  error_rate_warning_pct  = lookup(each.value, "error_rate_warning_pct", 2)

  tags = concat(
    [
      "team:${each.value.team}",
      "runtime:${lookup(each.value, "runtime", "unknown")}",
      "org:${var.organization_name}"
    ],
    lookup(each.value, "custom_tags", [])
  )
}

outputs.tf

output "monitoring_summary" {
  description = "Summary of deployed monitoring"
  value = {
    services_monitored = keys(local.services)
    total_services     = length(local.services)
    monitor_links = {
      for service, config in local.services :
      service => "https://app.datadoghq.com/monitors/manage?q=service%3A${service}"
    }
  }
}

output "service_monitor_ids" {
  description = "Monitor IDs by service"
  value = {
    for service, monitors in module.service_monitoring :
    service => monitors.monitor_ids
  }
}

Part 4: Deploy Monitoring

Step 1: Initialize Terraform

terraform init

Expected output:

Initializing modules...
- service_monitoring in modules/golden-signals

Initializing provider plugins...
- Finding datadog/datadog versions matching "~> 3.0"...
- Installing datadog/datadog v3.37.0...

Terraform has been successfully initialized!

Step 2: Plan Deployment

terraform plan

Review the plan:

Terraform will perform the following actions:

  # module.service_monitoring["web-app"].datadog_monitor.latency_p95 will be created
  + resource "datadog_monitor" "latency_p95" {
      + name    = "[production] web-app - High P95 Latency"
      + type    = "metric alert"
      + query   = "avg(last_5m):p95:trace.web.request.duration{service:web-app,env:production} > 800"
      + priority = 1
      + tags    = [
          + "service:web-app",
          + "env:production",
          + "tier:critical",
          + "signal:latency",
          + "terraform:true",
          + "team:frontend",
          + "runtime:node",
          + "customer-facing"
        ]
    }

  # module.service_monitoring["web-app"].datadog_monitor.error_rate will be created
  ...

Plan: 18 to add, 0 to change, 0 to destroy.

Step 3: Apply Configuration

terraform apply

Type yes when prompted.

Expected output:

module.service_monitoring["web-app"].datadog_monitor.latency_p95: Creating...
module.service_monitoring["web-app"].datadog_monitor.latency_p99: Creating...
module.service_monitoring["api-service"].datadog_monitor.error_rate: Creating...
...
Apply complete! Resources: 18 added, 0 changed, 0 destroyed.

Outputs:

monitoring_summary = {
  "services_monitored" = ["web-app", "api-service", "worker-service"]
  "total_services" = 3
  "monitor_links" = {
    "web-app" = "https://app.datadoghq.com/monitors/manage?q=service%3Aweb-app"
    "api-service" = "https://app.datadoghq.com/monitors/manage?q=service%3Aapi-service"
    "worker-service" = "https://app.datadoghq.com/monitors/manage?q=service%3Aworker-service"
  }
}

Step 4: Verify in Datadog

Check Datadog UI:

  1. Navigate to MonitorsManage Monitors
  2. Filter by terraform:true tag
  3. Verify monitors are created and in correct state

Part 5: Managing Monitors

Add a New Service

Edit main.tf and add to local.services:

"cache-service" = {
  environment       = "production"
  tier             = "important"
  team             = "infrastructure"
  runtime          = "redis"
  monitoring_suites = ["golden-signals"]
  custom_tags      = ["cache"]

  latency_p95_critical_ms = 100  # Cache should be fast
  error_rate_critical_pct = 1
}

Apply changes:

terraform plan
terraform apply

Adjust Thresholds

Scenario: Error rate threshold too sensitive for worker-service

Edit threshold in main.tf:

"worker-service" = {
  # ... existing config ...
  error_rate_critical_pct = 10  # Changed from 5
  error_rate_warning_pct  = 7   # Added warning threshold
}

Apply:

terraform apply

Only the affected monitors will be updated.

Update Alert Routing

Scenario: Add Opsgenie for critical services

Edit alert_configs:

alert_configs = {
  critical = {
    production = [
      "@slack-oncall",
      "@pagerduty-critical",
      "@opsgenie-critical"  # Added
    ]
    staging = ["@slack-dev-alerts"]
  }
  # ... rest unchanged ...
}

Apply changes - all critical service monitors will be updated.


Best Practices

1. Version Control Everything

# Commit monitoring changes like code
git add main.tf modules/
git commit -m "monitoring: add cache-service golden signals

- Add Redis cache service monitoring
- Set aggressive latency threshold (100ms)
- Route alerts to infrastructure team"

git push origin main

2. Use Consistent Tagging

Required tags:

  • service:name - Service identifier
  • env:environment - production/staging/development
  • tier:criticality - critical/important/standard
  • team:owner - Owning team
  • terraform:true - Managed by Terraform

Recommended tags:

  • runtime:technology - node/jvm/python/go
  • component:type - api/worker/cache
  • Custom business tags

3. Test in Staging First

# Deploy to staging first
"api-service-staging" = {
  environment = "staging"
  tier       = "important"  # Lower tier for staging
  # ... same config as production ...
}

# After validation, promote to production
"api-service" = {
  environment = "production"
  tier       = "critical"
  # ... same thresholds ...
}

4. Document Runbooks

Include runbook links in monitor messages:

message = <<-EOT
  **Runbook**: https://wiki.company.com/runbooks/${var.service_name}/latency
  **Grafana**: https://grafana.company.com/d/${var.service_name}
  **On-Call**: https://pagerduty.com/schedules/${var.service_name}
EOT

5. Use Remote State

# backend.tf
terraform {
  backend "s3" {
    bucket = "company-terraform-state"
    key    = "datadog-monitors/terraform.tfstate"
    region = "us-east-1"
  }
}

Or use Terraform Cloud for team collaboration.

6. Implement CI/CD

Example GitHub Actions:

name: Datadog Monitoring

on:
  pull_request:
    paths:
      - '**.tf'
  push:
    branches: [main]

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2

      - name: Terraform Format
        run: terraform fmt -check

      - name: Terraform Init
        run: terraform init
        env:
          TF_VAR_datadog_api_key: ${{ secrets.DATADOG_API_KEY }}
          TF_VAR_datadog_app_key: ${{ secrets.DATADOG_APP_KEY }}

      - name: Terraform Plan
        run: terraform plan
        if: github.event_name == 'pull_request'

      - name: Terraform Apply
        run: terraform apply -auto-approve
        if: github.ref == 'refs/heads/main'

Troubleshooting

Monitor Not Alerting

Check:

  1. Metric exists: Verify trace.web.request.duration is being sent to Datadog
  2. Tags match: Ensure service:web-app,env:production tags are on metrics
  3. Threshold appropriate: Check if values actually exceed threshold
  4. No data handling: Set notify_no_data = true to catch missing metrics

Debug query:

# In Datadog Metrics Explorer
trace.web.request.duration{service:web-app,env:production}

Duplicate Monitors

Cause: Changing monitor name or service key in Terraform

Solution: Import existing monitors or destroy/recreate

# Import existing monitor
terraform import module.service_monitoring[\"web-app\"].datadog_monitor.latency_p95 1234567

# Or destroy and recreate
terraform destroy -target=module.service_monitoring[\"web-app\"]
terraform apply

API Rate Limiting

Error:

Error: API rate limit exceeded

Solution: Add delays between resource creation

resource "time_sleep" "wait_between_monitors" {
  create_duration = "2s"
  depends_on = [datadog_monitor.latency_p95]
}

Alert Fatigue

Symptoms: Too many alerts, team ignoring notifications

Solutions:

  1. Increase thresholds: Start conservative, tighten over time
  2. Add warning thresholds: Two-tier alerting (warning + critical)
  3. Tune evaluation windows: Use last_10m instead of last_5m for stability
  4. Use anomaly detection: For metrics with variable baselines
# Anomaly detection example
query = "avg(last_1h):anomalies(avg:trace.web.request.duration{service:${var.service_name}}, 'agile', 3) >= 1"

Next Steps

Now that you have golden signals monitoring:

  1. Add infrastructure module: Host-level monitoring (disk, network, processes)
  2. Add database module: PostgreSQL/MySQL specific monitors
  3. Add application module: Runtime-specific monitoring (JVM heap, Node.js event loop)
  4. Create dashboards: Terraform can manage dashboards too
  5. Implement SLOs: Define and track Service Level Objectives

Advanced: Complete Monitoring Orchestrator

For more complex setups, create a complete-monitoring module that conditionally enables monitoring suites:

# modules/complete-monitoring/main.tf
module "golden_signals" {
  source = "../golden-signals"
  count  = contains(var.monitoring_suites, "golden-signals") ? 1 : 0

  service_name = var.service_name
  # ... pass variables ...
}

module "database_monitoring" {
  source = "../database"
  count  = contains(var.monitoring_suites, "database") ? 1 : 0

  service_name  = var.service_name
  database_type = var.database_type
  # ... pass variables ...
}

# Enable monitoring suites conditionally

Then in main.tf:

module "service_monitoring" {
  source   = "./modules/complete-monitoring"
  for_each = local.services

  monitoring_suites = each.value.monitoring_suites  # ["golden-signals", "database"]
  # ... rest of config ...
}

  • How to Automate Proxmox with Terraform
  • How to Set Up Datadog APM for Node.js Applications
  • How to Implement SRE Principles in Your Team

Resources


Last updated: November 2025