dracoblue.net

deploy kubernetes gitlab runner on azure with terraform

For one of my use cases I had been given an azure subscription and the task to ensure that gitlab runners are running within them. You might be left with the choice to bootstrap a virtual machine (or more) and install the gitlab runners manually on them. But today I wanted to use terraform to deploy the aks (azure kubernetes services) cluster and to deploy the gitlab runner within it.

As prerequisites you need the following cli tools installed:

  1. az (from azure cli at https://docs.microsoft.com/en-us/cli/azure/install-azure-cli-linux)
  2. terraform (from terraform cli at https://learn.hashicorp.com/tutorials/terraform/install-cli)

Now login with your credentials to azure cli and choose the right subscription:

$ az login
$ az account set --subscription aabbccdd-eeff-aaee-bbcc-aabbccddeeff

Create a file called terraform.tfvars containing all the credentials (the registration token can be found on the ci/cd page or your gitlab group, project or server):

registration_token="fHJASAK_ss_DD"
gitlab_url="https://gitlab.com"

The create a main.tf with the following parts:

terraform {
  backend "http" {
  }

  required_version = ">=0.12"

  required_providers {
    azurerm = {
      source = "hashicorp/azurerm"
      version = "2.90.0"
    }
    helm = {
      source = "hashicorp/helm"
      version = "2.4.1"
    }
  }
}

Initialize terraform with terraform init or gitlab hosted with the backend config taken from https://docs.gitlab.com/ee/user/infrastructure/iac/terraform_state.html#get-started-using-local-development.

Add the azurerm provider to the main.tf:

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "gitlab_runners_group" {
  name     = "gitlab-runners-resources"
  location = "West Europe"

  lifecycle {
    ignore_changes = [
      tags,
    ]
  }
}

resource "azurerm_kubernetes_cluster" "gitlab_runners_cluster" {
  name                = "gitlab-runners-kubernetes-cluster"
  location            = azurerm_resource_group.gitlab_runners_group.location
  resource_group_name = azurerm_resource_group.gitlab_runners_group.name
  dns_prefix          = "gitlab-runners-kubernetes"
  kubernetes_version  = "1.21.7"

  // we don't use the system node pool - workarounds like
  // https://pumpingco.de/blog/modify-aks-default-node-pool-in-terraform-without-redeploying-the-cluster/
  // are quite complicated
  default_node_pool {
    name       = "default"
    node_count = 1
    vm_size    = "standard_a2_v2"
  }

  identity {
    type = "SystemAssigned"
  }

  tags = {
    Environment = "Production"
  }

  addon_profile {
    aci_connector_linux {
      enabled = false
    }

    azure_policy {
      enabled = false
    }

    http_application_routing {
      enabled = false
    }

    kube_dashboard {
      enabled = false
    }

    oms_agent {
      enabled = false
    }
  }

  lifecycle {
    ignore_changes = [
      tags,
    ]
  }

}

resource "azurerm_kubernetes_cluster_node_pool" "gitlab_runners_spot_node_pool" {
  name                  = "builds"
  availability_zones    = []
  enable_host_encryption = false
  enable_node_public_ip = false
  fips_enabled          = false
  kubernetes_cluster_id = azurerm_kubernetes_cluster.gitlab_runners_cluster.id
  vm_size               = "standard_f16s_v2"
  priority              = "Spot"
  eviction_policy       = "Delete"
  spot_max_price        = -1
  enable_auto_scaling   = true
  max_count             = 3
  min_count             = 1
  max_pods              = 110
  node_count            = 1

  tags = {
    Environment = "Production"
  }

  node_labels = {
    "kubernetes.azure.com/scalesetpriority" = "spot"
  }
  node_taints = [
    "kubernetes.azure.com/scalesetpriority=spot:NoSchedule"
  ]

  lifecycle {
    ignore_changes = [
      node_count,
      orchestrator_version,
      os_disk_size_gb,
      os_sku,
      kubernetes_cluster_id,
      kubelet_disk_type,
      availability_zones
    ]
  }
}

This makes it possible to have:

  1. the default node pool a system pool, with a very slim standard_a2_v2 machine (we do this because we cannot get rid of it :()
  2. a second node pool containing spot instances (which is not changeable on the default node pool :() of type standard_f16s_v2
  3. at least 1 node on the spot instances (see end of this post, if you have a solution!)

Now execute:

$ terraform plan -out=cache.tfplan
$ terraform apply cache.tfplan

to make the cluster ready.

Add the helm configuration to the main.tf

resource "helm_release" "gitlab_runners_infrastructure" {
  name = "gitlab-runner"
  namespace = "gitlab-runner"
  repository = "https://charts.gitlab.io"
  chart      = "gitlab-runner"
  version    = "0.33.3"
  create_namespace = true
  set {
    name = "runnerRegistrationToken"
    value = var.gitlab_infrastructure_registration_token
  }
  values = [
    <<-EOT
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.azure.com/cluster
              operator: Exists
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
            - key: kubernetes.azure.com/mode
              operator: In
              values:
                - system
runners:
  tags: "azure-runners"
  secret: "gitlab-runner"
  privileged: true
  runUntagged: true
  namespace: "gitlab-runner"
  config: |
    [[runners]]
      [runners.kubernetes]
        image = "ubuntu:20.04"
        namespace = "gitlab-runner"
        privileged = true
      [runners.kubernetes.node_selector]
        "kubernetes.azure.com/agentpool" = "builds"
      [runners.kubernetes.node_tolerations]
        "kubernetes.azure.com/scalesetpriority" = "NoSchedule"
EOT
  ]
  set {
    name = "gitlabUrl"
    value = var.gitlab_url
  }
  set {
    name = "resources\\.requests\\.cpu"
    value = "200m"
  }
  set {
    name = "resources\\.requests\\.memory"
    value = "126Mi"
  }
  set {
    name = "concurrent"
    value = "32"
  }
  set {
    name = "unregisterRunners"
    value = "true"
  }
}

This will make:

  1. kubernetes executor running in privileged mode
  2. request 200m resources for each pod, to ensure that it would scale up if too much
  3. unregister runner, if it gets uninstalled
  4. run concurent 32 jobs on this runner
  5. the executor itself runs on the default node pool
  6. the ci/cd jobs executed run on the builds node pool (thus if no jobs are executed, the builds node pool can scale down)

Now execute:

$ terraform plan -out=cache.tfplan
$ terraform apply cache.tfplan

to make the gitlab runner ready.

One last hint: If you want to build docker images with this kubernetes executor, you might use a setup like this in gitlab-ci.yml:

stages:
  - build

image: docker:19.03.12

variables:
  TAG_ID: "1.0.${CI_PIPELINE_IID}"
  DOCKER_HOST: tcp://docker:2375
  DOCKER_TLS_CERTDIR: ""

services:
  - docker:19.03.12-dind

before_script:
  - docker login -u "$CI_REGISTRY_USER" -p "$CI_REGISTRY_PASSWORD" $CI_REGISTRY

build and push on main:
  stage: build
  script:
    - docker build --pull -t "$CI_REGISTRY_IMAGE:${TAG_ID}" .
    - docker push "$CI_REGISTRY_IMAGE:${TAG_ID}"
  only:
   - main

Take care of the DOCKER_HOST and DOCKER_TLS_CERTDIR variables.

Conclusion

With the final main.tf file + terraform.tfvars file we can create a kubernetes cluster on azure with a preinstalled gitlab runner, which runs workload on less expensive spot instances, which scales down if there is little to no work on them. All of this is deployed by terraform.

What's missing?

Is there a way how to scale this down to 0 nodes and scale it up only when builds happen? I know that this works in gcp but couldn't make azure node auto scaler to do the same.

In azure, docker, gitlab, helm, kubernetes, terraform by
@ 30 Dec 2021, Comments at Reddit & Hackernews