How to manage Datadog monitors with Terraform at scale

Introduction

Managing monitoring at scale is challenging. As your infrastructure grows from a handful of services to dozens or hundreds, maintaining consistent, comprehensive monitoring becomes increasingly complex. Manually creating Datadog monitors through the UI doesn’t scale, leads to inconsistencies, and makes it difficult to apply organizational standards.

This guide walks through building a modular Terraform framework for managing Datadog monitors that scales from small teams to enterprise deployments. The framework emphasizes composability, allowing you to enable only the monitoring suites needed for each service while maintaining consistency across your entire infrastructure.

What you’ll learn:

Build a modular monitoring framework with Terraform
Implement the Four Golden Signals (latency, traffic, errors, saturation)
Configure tier-based alert routing
Create service-specific monitoring profiles
Manage monitors as code with version control
Scale monitoring across multiple services and environments

Why Terraform for Datadog?

Version Control: Track all monitoring changes in Git
Consistency: Apply organizational standards automatically
Reusability: Define monitoring patterns once, apply to many services
Collaboration: Review monitoring changes like code changes
Automation: Deploy monitoring alongside application code

Prerequisites

Before starting, ensure you have:

Datadog account with API and Application keys
Terraform installed (v1.0+)
Basic Terraform knowledge (providers, resources, modules)
Datadog permissions to create monitors and dashboards
Git for version control (recommended)

Recommended knowledge:

SRE principles and Golden Signals
Your application architecture and dependencies
Alert routing preferences (Slack, PagerDuty, email)

Architecture Overview

The Problem: Monitoring Sprawl

Without a framework:

Service A: 47 monitors (manually created, inconsistent thresholds)
Service B: 12 monitors (missing error rate monitoring)
Service C: 89 monitors (duplicates, alert fatigue)
Service D: 3 monitors (woefully inadequate)

With this framework:

Service A: golden-signals + infrastructure + database = 23 focused monitors
Service B: golden-signals + application = 15 comprehensive monitors
Service C: golden-signals + infrastructure = 18 optimized monitors
Service D: golden-signals = 8 essential monitors

Module Hierarchy

Root Configuration (main.tf)
    └── Complete Monitoring Module (per service)
            ├── Golden Signals Module
            ├── Infrastructure Monitors Module
            ├── Database Monitors Module
            ├── Application Monitors Module
            ├── Security Monitors Module
            └── Business Monitors Module

How it works:

Services are defined in main.tf with configuration (runtime, database, tier, monitoring suites)
Complete Monitoring orchestrates specialized modules based on each service’s monitoring_suites setting
Specialized modules create targeted monitors for their domain
Alert routing is automatic based on service tier and environment

Part 1: Project Setup

Step 1: Create Project Structure

mkdir -p ~/terraform/datadog-monitors
cd ~/terraform/datadog-monitors

# Create module directories
mkdir -p modules/{golden-signals,infrastructure,database,application,security,business,complete-monitoring}

# Initialize git
git init

Step 2: Configure .gitignore

cat > .gitignore << 'EOF'
# Terraform files
*.tfstate
*.tfstate.*
*.tfstate.backup
.terraform/
.terraform.lock.hcl

# Sensitive files
terraform.tfvars
*.auto.tfvars
.env

# IDE
.idea/
.vscode/
*.swp
EOF

Step 3: Get Datadog API Credentials

In Datadog UI:

Navigate to Organization Settings → API Keys
Create API Key (or use existing)
Navigate to Organization Settings → Application Keys
Create Application Key with appropriate permissions

Store credentials securely:

# Option 1: Environment variables (recommended)
export TF_VAR_datadog_api_key="your-api-key"
export TF_VAR_datadog_app_key="your-app-key"

# Option 2: terraform.tfvars (never commit!)
cat > terraform.tfvars << 'EOF'
datadog_api_key = "your-api-key"
datadog_app_key = "your-app-key"
organization_name = "YourCompany"
EOF

Part 2: Build the Golden Signals Module

The Four Golden Signals from Google’s SRE book form the foundation of effective monitoring.

Create Module Structure

cd modules/golden-signals

variables.tf

variable "service_name" {
  description = "Name of the service being monitored"
  type        = string
}

variable "environment" {
  description = "Environment (production, staging, development)"
  type        = string
}

variable "service_tier" {
  description = "Service tier (critical, important, standard)"
  type        = string
}

variable "alert_channels" {
  description = "List of alert notification channels"
  type        = list(string)
}

variable "tags" {
  description = "Additional tags for monitors"
  type        = list(string)
  default     = []
}

# Latency thresholds
variable "latency_p95_critical_ms" {
  description = "P95 latency critical threshold in milliseconds"
  type        = number
  default     = 1000
}

variable "latency_p95_warning_ms" {
  description = "P95 latency warning threshold in milliseconds"
  type        = number
  default     = 500
}

# Error rate thresholds
variable "error_rate_critical_pct" {
  description = "Error rate critical threshold percentage"
  type        = number
  default     = 5
}

variable "error_rate_warning_pct" {
  description = "Error rate warning threshold percentage"
  type        = number
  default     = 2
}

# Saturation thresholds
variable "cpu_critical_pct" {
  description = "CPU usage critical threshold percentage"
  type        = number
  default     = 90
}

variable "memory_critical_pct" {
  description = "Memory usage critical threshold percentage"
  type        = number
  default     = 90
}

main.tf

# Latency Monitor - P95 Response Time
resource "datadog_monitor" "latency_p95" {
  name    = "[${var.environment}] ${var.service_name} - High P95 Latency"
  type    = "metric alert"
  message = <<-EOT
    **Service**: ${var.service_name}
    **Environment**: ${var.environment}
    **Severity**: {{#is_alert}}CRITICAL{{/is_alert}}{{#is_warning}}WARNING{{/is_warning}}

    P95 latency has exceeded acceptable thresholds.

    **Current Value**: {{value}}ms
    **Threshold**: Critical: ${var.latency_p95_critical_ms}ms | Warning: ${var.latency_p95_warning_ms}ms

    **Runbook**: https://wiki.company.com/runbooks/${var.service_name}/latency
    **Grafana**: https://grafana.company.com/d/${var.service_name}

    ${join(" ", var.alert_channels)}
  EOT

  query = "avg(last_5m):p95:trace.web.request.duration{service:${var.service_name},env:${var.environment}} > ${var.latency_p95_critical_ms}"

  monitor_thresholds {
    critical = var.latency_p95_critical_ms
    warning  = var.latency_p95_warning_ms
  }

  notify_no_data    = true
  no_data_timeframe = 10
  renotify_interval = 60

  priority = var.service_tier == "critical" ? 1 : (var.service_tier == "important" ? 2 : 3)

  tags = concat(
    [
      "service:${var.service_name}",
      "env:${var.environment}",
      "tier:${var.service_tier}",
      "signal:latency",
      "terraform:true"
    ],
    var.tags
  )
}

# Latency Monitor - P99 Response Time
resource "datadog_monitor" "latency_p99" {
  name    = "[${var.environment}] ${var.service_name} - High P99 Latency"
  type    = "metric alert"
  message = <<-EOT
    **Service**: ${var.service_name}
    **Tail latency** (P99) is degraded.

    **Current Value**: {{value}}ms
    **Impact**: Worst-case user experience

    ${join(" ", var.alert_channels)}
  EOT

  query = "avg(last_5m):p99:trace.web.request.duration{service:${var.service_name},env:${var.environment}} > ${var.latency_p95_critical_ms * 2}"

  monitor_thresholds {
    critical = var.latency_p95_critical_ms * 2
    warning  = var.latency_p95_critical_ms * 1.5
  }

  notify_no_data    = true
  no_data_timeframe = 10

  tags = concat(
    [
      "service:${var.service_name}",
      "env:${var.environment}",
      "tier:${var.service_tier}",
      "signal:latency",
      "terraform:true"
    ],
    var.tags
  )
}

# Traffic Monitor - Request Rate Drop
resource "datadog_monitor" "traffic_drop" {
  name    = "[${var.environment}] ${var.service_name} - Traffic Drop Detected"
  type    = "query alert"
  message = <<-EOT
    **Service**: ${var.service_name}
    **Alert**: Significant traffic drop detected

    Request rate has dropped by more than 50% compared to the previous hour.
    This could indicate:
    - Upstream service failure
    - Load balancer misconfiguration
    - DNS issues
    - Deployment problem

    **Action Required**: Investigate immediately

    ${join(" ", var.alert_channels)}
  EOT

  query = "pct_change(avg(last_5m),last_1h):sum:trace.web.request.hits{service:${var.service_name},env:${var.environment}}.as_count() < -50"

  monitor_thresholds {
    critical = -50
    warning  = -30
  }

  notify_no_data    = true
  no_data_timeframe = 10

  tags = concat(
    [
      "service:${var.service_name}",
      "env:${var.environment}",
      "tier:${var.service_tier}",
      "signal:traffic",
      "terraform:true"
    ],
    var.tags
  )
}

# Errors Monitor - Error Rate Percentage
resource "datadog_monitor" "error_rate" {
  name    = "[${var.environment}] ${var.service_name} - High Error Rate"
  type    = "query alert"
  message = <<-EOT
    **Service**: ${var.service_name}
    **Alert**: Error rate exceeds acceptable threshold

    **Current Error Rate**: {{value}}%
    **Threshold**: Critical: ${var.error_rate_critical_pct}% | Warning: ${var.error_rate_warning_pct}%

    **Common Causes**:
    - Database connection issues
    - Downstream service failures
    - Invalid input validation
    - Recent deployment issues

    **Runbook**: https://wiki.company.com/runbooks/${var.service_name}/errors

    ${join(" ", var.alert_channels)}
  EOT

  query = "avg(last_5m):(sum:trace.web.request.errors{service:${var.service_name},env:${var.environment}}.as_count() / sum:trace.web.request.hits{service:${var.service_name},env:${var.environment}}.as_count()) * 100 > ${var.error_rate_critical_pct}"

  monitor_thresholds {
    critical = var.error_rate_critical_pct
    warning  = var.error_rate_warning_pct
  }

  notify_no_data    = false

  tags = concat(
    [
      "service:${var.service_name}",
      "env:${var.environment}",
      "tier:${var.service_tier}",
      "signal:errors",
      "terraform:true"
    ],
    var.tags
  )
}

# Saturation Monitor - CPU Usage
resource "datadog_monitor" "cpu_saturation" {
  name    = "[${var.environment}] ${var.service_name} - High CPU Usage"
  type    = "metric alert"
  message = <<-EOT
    **Service**: ${var.service_name}
    **Alert**: CPU saturation detected

    **Current CPU**: {{value}}%
    **Threshold**: ${var.cpu_critical_pct}%

    High CPU usage can lead to:
    - Increased latency
    - Request timeouts
    - Service instability

    **Actions**:
    1. Check for CPU-intensive operations
    2. Review recent deployments
    3. Consider horizontal scaling

    ${join(" ", var.alert_channels)}
  EOT

  query = "avg(last_10m):avg:system.cpu.user{service:${var.service_name},env:${var.environment}} by {host} > ${var.cpu_critical_pct}"

  monitor_thresholds {
    critical = var.cpu_critical_pct
    warning  = var.cpu_critical_pct - 10
  }

  notify_no_data    = true
  no_data_timeframe = 20

  tags = concat(
    [
      "service:${var.service_name}",
      "env:${var.environment}",
      "tier:${var.service_tier}",
      "signal:saturation",
      "terraform:true"
    ],
    var.tags
  )
}

# Saturation Monitor - Memory Usage
resource "datadog_monitor" "memory_saturation" {
  name    = "[${var.environment}] ${var.service_name} - High Memory Usage"
  type    = "metric alert"
  message = <<-EOT
    **Service**: ${var.service_name}
    **Alert**: Memory saturation detected

    **Current Memory**: {{value}}%
    **Threshold**: ${var.memory_critical_pct}%

    **Potential Issues**:
    - Memory leaks
    - Insufficient capacity
    - Caching issues

    **Actions**:
    1. Check for memory leaks
    2. Review heap dumps (JVM) / memory profiles
    3. Consider vertical scaling

    ${join(" ", var.alert_channels)}
  EOT

  query = "avg(last_10m):avg:system.mem.pct_usable{service:${var.service_name},env:${var.environment}} by {host} < ${100 - var.memory_critical_pct}"

  monitor_thresholds {
    critical = 100 - var.memory_critical_pct
    warning  = 100 - (var.memory_critical_pct - 10)
  }

  notify_no_data    = true
  no_data_timeframe = 20

  tags = concat(
    [
      "service:${var.service_name}",
      "env:${var.environment}",
      "tier:${var.service_tier}",
      "signal:saturation",
      "terraform:true"
    ],
    var.tags
  )
}

outputs.tf

output "monitor_ids" {
  description = "Map of monitor names to IDs"
  value = {
    latency_p95        = datadog_monitor.latency_p95.id
    latency_p99        = datadog_monitor.latency_p99.id
    traffic_drop       = datadog_monitor.traffic_drop.id
    error_rate         = datadog_monitor.error_rate.id
    cpu_saturation     = datadog_monitor.cpu_saturation.id
    memory_saturation  = datadog_monitor.memory_saturation.id
  }
}

output "monitor_urls" {
  description = "Datadog URLs for each monitor"
  value = {
    for name, id in {
      latency_p95       = datadog_monitor.latency_p95.id
      latency_p99       = datadog_monitor.latency_p99.id
      traffic_drop      = datadog_monitor.traffic_drop.id
      error_rate        = datadog_monitor.error_rate.id
      cpu_saturation    = datadog_monitor.cpu_saturation.id
      memory_saturation = datadog_monitor.memory_saturation.id
    } : name => "https://app.datadoghq.com/monitors/${id}"
  }
}

Part 3: Create the Root Configuration

Back to project root

cd ~/terraform/datadog-monitors

providers.tf

terraform {
  required_version = ">= 1.0"

  required_providers {
    datadog = {
      source  = "DataDog/datadog"
      version = "~> 3.0"
    }
  }
}

provider "datadog" {
  api_key = var.datadog_api_key
  app_key = var.datadog_app_key
  api_url = "https://api.datadoghq.com/"  # Use US5: https://api.us5.datadoghq.com/ if needed
}

variables.tf

variable "datadog_api_key" {
  description = "Datadog API key"
  type        = string
  sensitive   = true
}

variable "datadog_app_key" {
  description = "Datadog Application key"
  type        = string
  sensitive   = true
}

variable "organization_name" {
  description = "Organization name for tagging"
  type        = string
}

main.tf

locals {
  # Service definitions
  services = {
    "web-app" = {
      environment       = "production"
      tier             = "critical"
      team             = "frontend"
      runtime          = "node"
      database_type    = "postgresql"
      monitoring_suites = ["golden-signals"]
      custom_tags      = ["customer-facing"]

      # Custom thresholds
      latency_p95_critical_ms = 800
      error_rate_critical_pct = 3
    }

    "api-service" = {
      environment       = "production"
      tier             = "critical"
      team             = "backend"
      runtime          = "jvm"
      database_type    = "postgresql"
      monitoring_suites = ["golden-signals"]
      custom_tags      = ["api", "external"]

      latency_p95_critical_ms = 500
      error_rate_critical_pct = 1
    }

    "worker-service" = {
      environment       = "production"
      tier             = "important"
      team             = "backend"
      runtime          = "python"
      queue_type       = "sqs"
      monitoring_suites = ["golden-signals"]
      custom_tags      = ["async"]

      error_rate_critical_pct = 5
    }
  }

  # Alert routing configuration
  alert_configs = {
    critical = {
      production = ["@slack-oncall", "@pagerduty-critical"]
      staging    = ["@slack-dev-alerts"]
    }
    important = {
      production = ["@slack-alerts", "@pagerduty-high"]
      staging    = ["@slack-dev"]
    }
    standard = {
      production = ["@slack-alerts"]
      staging    = ["@slack-dev"]
    }
  }
}

# Deploy golden signals monitoring for each service
module "service_monitoring" {
  source   = "./modules/golden-signals"
  for_each = local.services

  service_name = each.key
  environment  = each.value.environment
  service_tier = each.value.tier

  # Alert routing based on tier and environment
  alert_channels = lookup(
    lookup(local.alert_configs, each.value.tier, local.alert_configs.standard),
    each.value.environment,
    ["@slack-dev"]
  )

  # Custom thresholds (use defaults if not specified)
  latency_p95_critical_ms = lookup(each.value, "latency_p95_critical_ms", 1000)
  latency_p95_warning_ms  = lookup(each.value, "latency_p95_warning_ms", 500)
  error_rate_critical_pct = lookup(each.value, "error_rate_critical_pct", 5)
  error_rate_warning_pct  = lookup(each.value, "error_rate_warning_pct", 2)

  tags = concat(
    [
      "team:${each.value.team}",
      "runtime:${lookup(each.value, "runtime", "unknown")}",
      "org:${var.organization_name}"
    ],
    lookup(each.value, "custom_tags", [])
  )
}

outputs.tf

output "monitoring_summary" {
  description = "Summary of deployed monitoring"
  value = {
    services_monitored = keys(local.services)
    total_services     = length(local.services)
    monitor_links = {
      for service, config in local.services :
      service => "https://app.datadoghq.com/monitors/manage?q=service%3A${service}"
    }
  }
}

output "service_monitor_ids" {
  description = "Monitor IDs by service"
  value = {
    for service, monitors in module.service_monitoring :
    service => monitors.monitor_ids
  }
}

Part 4: Deploy Monitoring

Step 1: Initialize Terraform

terraform init

Expected output:

Initializing modules...
- service_monitoring in modules/golden-signals

Initializing provider plugins...
- Finding datadog/datadog versions matching "~> 3.0"...
- Installing datadog/datadog v3.37.0...

Terraform has been successfully initialized!

Step 2: Plan Deployment

terraform plan

Review the plan:

Terraform will perform the following actions:

  # module.service_monitoring["web-app"].datadog_monitor.latency_p95 will be created
  + resource "datadog_monitor" "latency_p95" {
      + name    = "[production] web-app - High P95 Latency"
      + type    = "metric alert"
      + query   = "avg(last_5m):p95:trace.web.request.duration{service:web-app,env:production} > 800"
      + priority = 1
      + tags    = [
          + "service:web-app",
          + "env:production",
          + "tier:critical",
          + "signal:latency",
          + "terraform:true",
          + "team:frontend",
          + "runtime:node",
          + "customer-facing"
        ]
    }

  # module.service_monitoring["web-app"].datadog_monitor.error_rate will be created
  ...

Plan: 18 to add, 0 to change, 0 to destroy.

Step 3: Apply Configuration

terraform apply

Type yes when prompted.

Expected output:

module.service_monitoring["web-app"].datadog_monitor.latency_p95: Creating...
module.service_monitoring["web-app"].datadog_monitor.latency_p99: Creating...
module.service_monitoring["api-service"].datadog_monitor.error_rate: Creating...
...
Apply complete! Resources: 18 added, 0 changed, 0 destroyed.

Outputs:

monitoring_summary = {
  "services_monitored" = ["web-app", "api-service", "worker-service"]
  "total_services" = 3
  "monitor_links" = {
    "web-app" = "https://app.datadoghq.com/monitors/manage?q=service%3Aweb-app"
    "api-service" = "https://app.datadoghq.com/monitors/manage?q=service%3Aapi-service"
    "worker-service" = "https://app.datadoghq.com/monitors/manage?q=service%3Aworker-service"
  }
}

Step 4: Verify in Datadog

Check Datadog UI:

Navigate to Monitors → Manage Monitors
Filter by terraform:true tag
Verify monitors are created and in correct state

Part 5: Managing Monitors

Add a New Service

Edit main.tf and add to local.services:

"cache-service" = {
  environment       = "production"
  tier             = "important"
  team             = "infrastructure"
  runtime          = "redis"
  monitoring_suites = ["golden-signals"]
  custom_tags      = ["cache"]

  latency_p95_critical_ms = 100  # Cache should be fast
  error_rate_critical_pct = 1
}

Apply changes:

terraform plan
terraform apply

Adjust Thresholds

Scenario: Error rate threshold too sensitive for worker-service

Edit threshold in main.tf:

"worker-service" = {
  # ... existing config ...
  error_rate_critical_pct = 10  # Changed from 5
  error_rate_warning_pct  = 7   # Added warning threshold
}

Apply:

terraform apply

Only the affected monitors will be updated.

Update Alert Routing

Scenario: Add Opsgenie for critical services

Edit alert_configs:

alert_configs = {
  critical = {
    production = [
      "@slack-oncall",
      "@pagerduty-critical",
      "@opsgenie-critical"  # Added
    ]
    staging = ["@slack-dev-alerts"]
  }
  # ... rest unchanged ...
}

Apply changes - all critical service monitors will be updated.

Best Practices

1. Version Control Everything

# Commit monitoring changes like code
git add main.tf modules/
git commit -m "monitoring: add cache-service golden signals

- Add Redis cache service monitoring
- Set aggressive latency threshold (100ms)
- Route alerts to infrastructure team"

git push origin main

2. Use Consistent Tagging

Required tags:

service:name - Service identifier
env:environment - production/staging/development
tier:criticality - critical/important/standard
team:owner - Owning team
terraform:true - Managed by Terraform

Recommended tags:

runtime:technology - node/jvm/python/go
component:type - api/worker/cache
Custom business tags

3. Test in Staging First

# Deploy to staging first
"api-service-staging" = {
  environment = "staging"
  tier       = "important"  # Lower tier for staging
  # ... same config as production ...
}

# After validation, promote to production
"api-service" = {
  environment = "production"
  tier       = "critical"
  # ... same thresholds ...
}

4. Document Runbooks

Include runbook links in monitor messages:

message = <<-EOT
  **Runbook**: https://wiki.company.com/runbooks/${var.service_name}/latency
  **Grafana**: https://grafana.company.com/d/${var.service_name}
  **On-Call**: https://pagerduty.com/schedules/${var.service_name}
EOT

5. Use Remote State

# backend.tf
terraform {
  backend "s3" {
    bucket = "company-terraform-state"
    key    = "datadog-monitors/terraform.tfstate"
    region = "us-east-1"
  }
}

Or use Terraform Cloud for team collaboration.

6. Implement CI/CD

Example GitHub Actions:

name: Datadog Monitoring

on:
  pull_request:
    paths:
      - '**.tf'
  push:
    branches: [main]

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2

      - name: Terraform Format
        run: terraform fmt -check

      - name: Terraform Init
        run: terraform init
        env:
          TF_VAR_datadog_api_key: ${{ secrets.DATADOG_API_KEY }}
          TF_VAR_datadog_app_key: ${{ secrets.DATADOG_APP_KEY }}

      - name: Terraform Plan
        run: terraform plan
        if: github.event_name == 'pull_request'

      - name: Terraform Apply
        run: terraform apply -auto-approve
        if: github.ref == 'refs/heads/main'

Troubleshooting

Monitor Not Alerting

Check:

Metric exists: Verify trace.web.request.duration is being sent to Datadog
Tags match: Ensure service:web-app,env:production tags are on metrics
Threshold appropriate: Check if values actually exceed threshold
No data handling: Set notify_no_data = true to catch missing metrics

Debug query:

# In Datadog Metrics Explorer
trace.web.request.duration{service:web-app,env:production}

Duplicate Monitors

Cause: Changing monitor name or service key in Terraform

Solution: Import existing monitors or destroy/recreate

# Import existing monitor
terraform import module.service_monitoring[\"web-app\"].datadog_monitor.latency_p95 1234567

# Or destroy and recreate
terraform destroy -target=module.service_monitoring[\"web-app\"]
terraform apply

API Rate Limiting

Error:

Error: API rate limit exceeded

Solution: Add delays between resource creation

resource "time_sleep" "wait_between_monitors" {
  create_duration = "2s"
  depends_on = [datadog_monitor.latency_p95]
}

Alert Fatigue

Symptoms: Too many alerts, team ignoring notifications

Solutions:

Increase thresholds: Start conservative, tighten over time
Add warning thresholds: Two-tier alerting (warning + critical)
Tune evaluation windows: Use last_10m instead of last_5m for stability
Use anomaly detection: For metrics with variable baselines

# Anomaly detection example
query = "avg(last_1h):anomalies(avg:trace.web.request.duration{service:${var.service_name}}, 'agile', 3) >= 1"

Next Steps

Now that you have golden signals monitoring:

Add infrastructure module: Host-level monitoring (disk, network, processes)
Add database module: PostgreSQL/MySQL specific monitors
Add application module: Runtime-specific monitoring (JVM heap, Node.js event loop)
Create dashboards: Terraform can manage dashboards too
Implement SLOs: Define and track Service Level Objectives

Advanced: Complete Monitoring Orchestrator

For more complex setups, create a complete-monitoring module that conditionally enables monitoring suites:

# modules/complete-monitoring/main.tf
module "golden_signals" {
  source = "../golden-signals"
  count  = contains(var.monitoring_suites, "golden-signals") ? 1 : 0

  service_name = var.service_name
  # ... pass variables ...
}

module "database_monitoring" {
  source = "../database"
  count  = contains(var.monitoring_suites, "database") ? 1 : 0

  service_name  = var.service_name
  database_type = var.database_type
  # ... pass variables ...
}

# Enable monitoring suites conditionally

Then in main.tf:

module "service_monitoring" {
  source   = "./modules/complete-monitoring"
  for_each = local.services

  monitoring_suites = each.value.monitoring_suites  # ["golden-signals", "database"]
  # ... rest of config ...
}

How to Automate Proxmox with Terraform
How to Set Up Datadog APM for Node.js Applications
How to Implement SRE Principles in Your Team

Resources

Datadog Terraform Provider: https://registry.terraform.io/providers/DataDog/datadog/latest/docs
Google SRE Book: https://sre.google/sre-book/monitoring-distributed-systems/
Four Golden Signals: https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals
Datadog Monitor API: https://docs.datadoghq.com/api/latest/monitors/

Last updated: November 2025

Introduction#

Prerequisites#

Architecture Overview#

The Problem: Monitoring Sprawl#

Module Hierarchy#

Part 1: Project Setup#

Step 1: Create Project Structure#

Step 2: Configure .gitignore#

Step 3: Get Datadog API Credentials#

Part 2: Build the Golden Signals Module#

Create Module Structure#

variables.tf#

main.tf#

outputs.tf#

Part 3: Create the Root Configuration#

Back to project root#

providers.tf#

variables.tf#

main.tf#

outputs.tf#

Part 4: Deploy Monitoring#

Step 1: Initialize Terraform#

Step 2: Plan Deployment#

Step 3: Apply Configuration#

Step 4: Verify in Datadog#

Part 5: Managing Monitors#

Add a New Service#

Adjust Thresholds#

Update Alert Routing#

Best Practices#

1. Version Control Everything#

2. Use Consistent Tagging#

3. Test in Staging First#

4. Document Runbooks#

5. Use Remote State#

6. Implement CI/CD#

Troubleshooting#

Monitor Not Alerting#

Duplicate Monitors#

API Rate Limiting#

Alert Fatigue#

Next Steps#

Advanced: Complete Monitoring Orchestrator#

Related Guides#

Resources#

Introduction

Prerequisites

Architecture Overview

The Problem: Monitoring Sprawl

Module Hierarchy

Part 1: Project Setup

Step 1: Create Project Structure

Step 2: Configure .gitignore

Step 3: Get Datadog API Credentials

Part 2: Build the Golden Signals Module

Create Module Structure

variables.tf

main.tf

outputs.tf

Part 3: Create the Root Configuration

Back to project root

providers.tf

variables.tf

main.tf

outputs.tf

Part 4: Deploy Monitoring

Step 1: Initialize Terraform

Step 2: Plan Deployment

Step 3: Apply Configuration

Step 4: Verify in Datadog

Part 5: Managing Monitors

Add a New Service

Adjust Thresholds

Update Alert Routing

Best Practices

1. Version Control Everything

2. Use Consistent Tagging

3. Test in Staging First

4. Document Runbooks

5. Use Remote State

6. Implement CI/CD

Troubleshooting

Monitor Not Alerting

Duplicate Monitors

API Rate Limiting

Alert Fatigue

Next Steps

Advanced: Complete Monitoring Orchestrator

Related Guides

Resources