For one of my use cases I had been given an azure subscription and the task to ensure that gitlab runners are running within them. You might be left with the choice to bootstrap a virtual machine (or more) and install the gitlab runners manually on them. But today I wanted to use terraform to deploy the aks (azure kubernetes services) cluster and to deploy the gitlab runner within it.
As prerequisites you need the following cli tools installed:
- az (from azure cli at https://docs.microsoft.com/en-us/cli/azure/install-azure-cli-linux)
- terraform (from terraform cli at https://learn.hashicorp.com/tutorials/terraform/install-cli)
Now login with your credentials to azure cli and choose the right subscription:
$ az login
$ az account set --subscription aabbccdd-eeff-aaee-bbcc-aabbccddeeff
Create a file called terraform.tfvars
containing all the credentials (the registration token can be found on the ci/cd page or your gitlab group, project or server):
registration_token="fHJASAK_ss_DD"
gitlab_url="https://gitlab.com"
The create a main.tf with the following parts:
terraform {
backend "http" {
}
required_version = ">=0.12"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "2.90.0"
}
helm = {
source = "hashicorp/helm"
version = "2.4.1"
}
}
}
Initialize terraform with terraform init
or gitlab hosted with the backend config taken from https://docs.gitlab.com/ee/user/infrastructure/iac/terraform_state.html#get-started-using-local-development.
Add the azurerm provider to the main.tf
:
provider "azurerm" {
features {}
}
resource "azurerm_resource_group" "gitlab_runners_group" {
name = "gitlab-runners-resources"
location = "West Europe"
lifecycle {
ignore_changes = [
tags,
]
}
}
resource "azurerm_kubernetes_cluster" "gitlab_runners_cluster" {
name = "gitlab-runners-kubernetes-cluster"
location = azurerm_resource_group.gitlab_runners_group.location
resource_group_name = azurerm_resource_group.gitlab_runners_group.name
dns_prefix = "gitlab-runners-kubernetes"
kubernetes_version = "1.21.7"
// we don't use the system node pool - workarounds like
// https://pumpingco.de/blog/modify-aks-default-node-pool-in-terraform-without-redeploying-the-cluster/
// are quite complicated
default_node_pool {
name = "default"
node_count = 1
vm_size = "standard_a2_v2"
}
identity {
type = "SystemAssigned"
}
tags = {
Environment = "Production"
}
addon_profile {
aci_connector_linux {
enabled = false
}
azure_policy {
enabled = false
}
http_application_routing {
enabled = false
}
kube_dashboard {
enabled = false
}
oms_agent {
enabled = false
}
}
lifecycle {
ignore_changes = [
tags,
]
}
}
resource "azurerm_kubernetes_cluster_node_pool" "gitlab_runners_spot_node_pool" {
name = "builds"
availability_zones = []
enable_host_encryption = false
enable_node_public_ip = false
fips_enabled = false
kubernetes_cluster_id = azurerm_kubernetes_cluster.gitlab_runners_cluster.id
vm_size = "standard_f16s_v2"
priority = "Spot"
eviction_policy = "Delete"
spot_max_price = -1
enable_auto_scaling = true
max_count = 3
min_count = 1
max_pods = 110
node_count = 1
tags = {
Environment = "Production"
}
node_labels = {
"kubernetes.azure.com/scalesetpriority" = "spot"
}
node_taints = [
"kubernetes.azure.com/scalesetpriority=spot:NoSchedule"
]
lifecycle {
ignore_changes = [
node_count,
orchestrator_version,
os_disk_size_gb,
os_sku,
kubernetes_cluster_id,
kubelet_disk_type,
availability_zones
]
}
}
This makes it possible to have:
- the default node pool a system pool, with a very slim
standard_a2_v2
machine (we do this because we cannot get rid of it :() - a second node pool containing spot instances (which is not changeable on the default node pool :() of type
standard_f16s_v2
- at least 1 node on the spot instances (see end of this post, if you have a solution!)
Now execute:
$ terraform plan -out=cache.tfplan
$ terraform apply cache.tfplan
to make the cluster ready.
Add the helm configuration to the main.tf
resource "helm_release" "gitlab_runners_infrastructure" {
name = "gitlab-runner"
namespace = "gitlab-runner"
repository = "https://charts.gitlab.io"
chart = "gitlab-runner"
version = "0.33.3"
create_namespace = true
set {
name = "runnerRegistrationToken"
value = var.gitlab_infrastructure_registration_token
}
values = [
<<-EOT
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.azure.com/cluster
operator: Exists
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: kubernetes.azure.com/mode
operator: In
values:
- system
runners:
tags: "azure-runners"
secret: "gitlab-runner"
privileged: true
runUntagged: true
namespace: "gitlab-runner"
config: |
[[runners]]
[runners.kubernetes]
image = "ubuntu:20.04"
namespace = "gitlab-runner"
privileged = true
[runners.kubernetes.node_selector]
"kubernetes.azure.com/agentpool" = "builds"
[runners.kubernetes.node_tolerations]
"kubernetes.azure.com/scalesetpriority" = "NoSchedule"
EOT
]
set {
name = "gitlabUrl"
value = var.gitlab_url
}
set {
name = "resources\\.requests\\.cpu"
value = "200m"
}
set {
name = "resources\\.requests\\.memory"
value = "126Mi"
}
set {
name = "concurrent"
value = "32"
}
set {
name = "unregisterRunners"
value = "true"
}
}
This will make:
- kubernetes executor running in privileged mode
- request 200m resources for each pod, to ensure that it would scale up if too much
- unregister runner, if it gets uninstalled
- run concurent 32 jobs on this runner
- the executor itself runs on the default node pool
- the ci/cd jobs executed run on the builds node pool (thus if no jobs are executed, the builds node pool can scale down)
Now execute:
$ terraform plan -out=cache.tfplan
$ terraform apply cache.tfplan
to make the gitlab runner ready.
One last hint: If you want to build docker images with this kubernetes executor, you might use a setup like this in gitlab-ci.yml:
stages:
- build
image: docker:19.03.12
variables:
TAG_ID: "1.0.${CI_PIPELINE_IID}"
DOCKER_HOST: tcp://docker:2375
DOCKER_TLS_CERTDIR: ""
services:
- docker:19.03.12-dind
before_script:
- docker login -u "$CI_REGISTRY_USER" -p "$CI_REGISTRY_PASSWORD" $CI_REGISTRY
build and push on main:
stage: build
script:
- docker build --pull -t "$CI_REGISTRY_IMAGE:${TAG_ID}" .
- docker push "$CI_REGISTRY_IMAGE:${TAG_ID}"
only:
- main
Take care of the DOCKER_HOST
and DOCKER_TLS_CERTDIR
variables.
Conclusion
With the final main.tf file + terraform.tfvars file we can create a kubernetes cluster on azure with a preinstalled gitlab runner, which runs workload on less expensive spot instances, which scales down if there is little to no work on them. All of this is deployed by terraform.
What's missing?
Is there a way how to scale this down to 0 nodes and scale it up only when builds happen? I know that this works in gcp but couldn't make azure node auto scaler to do the same.