Improving compatibility of registry cache with GitLab Container Registry

feather icon

This article was originally published on 11th October 2023. Some of its contents may not be relevant anymore.

2023-10-11

#Tech

Overview

In the ever-advancing realm of Docker & container technology, recent updates might help us better manage the container cache. This article delves deep into the dynamic world of container registry cache, shedding light on the latest developments that are reshaping the way we manage GitLab registry, clean up rules, and garbage collection.

Registry Cache migration

With the recent move to buildkit as a default builder for docker, the registry cache became available. Take a sneak peek at how to use it:

docker build

1
2
3
4
5
6
7
8
docker buildx build \
  --target app \
  --cache-from=type=registry,ref=${CI_REGISTRY}/${CI_PROJECT_PATH}/app.cache:main \
  --cache-from=type=registry,ref=${CI_REGISTRY}/${CI_PROJECT_PATH}/app.cache:feature \
  --cache-to=type=registry,mode=max,ref=${CI_REGISTRY}/${CI_PROJECT_PATH}/app.cache:feature \
  --tag ${CI_REGISTRY}/${CI_PROJECT_PATH}/app:feature \
  --push \
  .

With this example, we are looking for two possible cache images main which is the latest cache from our default branch. The feature is our current branch slug, we can use it if it exists (we ran this command before). This command replaces a set of other commands that were required before to build images using cache

docker build

1
2
3
4
5
6
7
8
9
docker pull ${CI_REGISTRY}/${CI_PROJECT_PATH}/app:main
docker pull ${CI_REGISTRY}/${CI_PROJECT_PATH}/app:feature || true
docker build \
  --target app \
  --cache-from ${CI_REGISTRY}/${CI_PROJECT_PATH}/app:main \
  --cache-from ${CI_REGISTRY}/${CI_PROJECT_PATH}/app:feature \
  --tag ${CI_REGISTRY}/${CI_PROJECT_PATH}/app:feature \
  .
docker push ${CI_REGISTRY}/${CI_PROJECT_PATH}/app:feature

Note: This is a setup for the old build engine, to make it work with buildkit, you need to add --load or --push to save the built image somewhere. What is more, --build-arg BUILDKIT_INLINE_CACHE=1 needs to be added to the build command, because after the switch to buildkit the build command will not save the cache anywhere, except locally, which is not ideal in the CI environment.

And the discrepancy only becomes more obvious with multi-target Dockerfiles. With the old approach, you had to pull an image for every target in your Dockerfile (and push it to the registry as well), now, with the registry cache, it all resides within a single image, thanks to mode=max argument. Please check the description here.

Benefits of registry cache vs inline cache:

  • final images are smaller - we do not store any cache information inside the regular image anymore, which is why the final image might be smaller. However, the extent of this heavily depends on a few factors (like the number of layers or amount of files), so you need to check this for your use case.
  • less image clutter when using multi-staged builds - as mentioned above, we do not have to explicitly store our intermediate targets from our multi-staged build. We just store the final application image and use one cache image for the entire build process
  • faster build times - inline cache only supports min mode, and because of that with a multi-stage build you need more images; registry cache supports max mode, and because of that, we cache all of the layers in one image. Besides that, you do not need to pull the layers of the old image to build the new one. It only pulls cache manifests and makes decisions based on that.

The only drawback that I could find, was the compatibility with GitLab Container Registry. Cache images are shown as broken in the UI (and the API call to get the details of this tag returns a 404 error)

code-image.png

This makes the cleanup policies not work properly.

Navigating Kubernetes complexities? We've got the expertise

contact box image

How to fix the clean-up policies? The partial fix

A detailed breakdown of the issue can be found here.

TLDR: The main reason for the broken compatibility of cache images is the use of an image index instead of an image manifest.

Example (from the GitHub ticket): for example, instead of this index:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
    "schemaVersion": 2,
    "mediaType": "application/vnd.oci.image.index.v1+json",
    "manifests": [
        {
            "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
            "digest": "sha256:136482bf81d1fa351b424ebb8c7e34d15f2c5ed3fc0b66b544b8312bda3d52d9",
            "size": 2427917,
            "annotations": {
                "buildkit/createdat": "2021-07-02T17:08:09.095229615Z",
                "containerd.io/uncompressed": "sha256:3461705ddf3646694bec0ac1cc70d5d39631ca2f4317b451c01d0f0fd0580c90"
            }
        },
        ...
        {
            "mediaType": "application/vnd.buildkit.cacheconfig.v0",
            "digest": "sha256:7aa23086ec6b7e0603a4dc72773ee13acf22cdd760613076ea53b1028d7a22d8",
            "size": 1654
        }
    ]
}

We could have the following manifest:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
    "schemaVersion": 2,
    "config": {
        "mediaType": "application/vnd.buildkit.cacheconfig.v0",
        "size": 1654,
        "digest": "sha256:7aa23086ec6b7e0603a4dc72773ee13acf22cdd760613076ea53b1028d7a22d8"
      },
    "layers": [
        {
            "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
            "digest": "sha256:136482bf81d1fa351b424ebb8c7e34d15f2c5ed3fc0b66b544b8312bda3d52d9",
            "size": 2427917,
            "annotations": {
                "buildkit/createdat": "2021-07-02T17:08:09.095229615Z",
                "containerd.io/uncompressed": "sha256:3461705ddf3646694bec0ac1cc70d5d39631ca2f4317b451c01d0f0fd0580c90"
            }
        },
        ...
    ]
}

Thankfully, a flag was implemented, that allows switching between an index and a manifest.

image-manifest=<true|false>: whether to export cache manifest as an OCI-compatible image manifest rather than a manifest list/index (default: false, must be used with oci-mediatypes=true)

Example code:

docker build

1
2
3
4
5
6
7
8
docker buildx build \
  --target app \
  --cache-from=type=registry,ref=${CI_REGISTRY}/${CI_PROJECT_PATH}/app.cache:main \
  --cache-from=type=registry,ref=${CI_REGISTRY}/${CI_PROJECT_PATH}/app.cache:feature \
  --cache-to=type=registry,mode=max,image-manifest=true,ref=${CI_REGISTRY}/${CI_PROJECT_PATH}/app.cache:feature \
  --tag ${CI_REGISTRY}/${CI_PROJECT_PATH}/app:feature \
  --push \
  .

Note: image-manifest=true an export option has been officially available since buildkit v0.12.0, however, some v0.11.6 versions also include it. The best way, that I found, to check if you have it, is just to try to use it.

Results:

Now the compatibility is much better. In the UI we no longer see a warning symbol, we can see the size of the cache, but one issue remains, something that might not be obvious at first glance. If we look into the API response for this tag, we will notice that the created_at attribute is empty. Unfortunately, that means that these images will never be cleaned by cleanup policy, since it can remove only images that are 7 or more days old.

code-image2.png

api response

1
2
3
4
5
6
7
8
9
10
{
  "name": "cache",
  "path": "*******/docker-regsitry-test/cra-test:cache",
  "location": "registry.gitlab.com/*******/docker-regsitry-test/cra-test:cache",
  "revision": "b9129182721421bdbe20646a6bc3072d1d6d39f7e503b1c74cebb6fd047f7056",
  "short_revision": "b91291827",
  "digest": "sha256:d6f357fb4ca5de37862c5dcfc8e6e5236a71d86505cba077c6cfe4730050fdf0",
  "created_at": null,
  "total_size": 518727681
}

One more thing about provenance attestation

If you are reading this article because you are experiencing similar issues, you might have also noticed a problem with regular images in your GitLab registry. They might look to be broken in the same way as the cache images. This is most likely caused by SLSA Provenance that Docker Desktop as of v4.16 and docker buildx as of v0.10.0 default to using SLSA Provenance for multi-architecture builds. At the moment of writing, GitLab Container Registry does not support that, hence you can follow the issue here and here.

The current fix for that is to add --provenance=false flag to the build command, or set BUILDX_NO_DEFAULT_ATTESTATIONS=1 as an environment variable.

docker build

1
2
3
4
5
6
7
8
9
docker buildx build \
  --target app \
  --cache-from=type=registry,ref=${CI_REGISTRY}/${CI_PROJECT_PATH}/app.cache:main \
  --cache-from=type=registry,ref=${CI_REGISTRY}/${CI_PROJECT_PATH}/app.cache:feature \
  --cache-to=type=registry,mode=max,image-manifest=true,ref=${CI_REGISTRY}/${CI_PROJECT_PATH}/app.cache:feature \
  --tag ${CI_REGISTRY}/${CI_PROJECT_PATH}/app:feature \
  --provenance=off \
  --push \
  .

Final build command that is as compatible as it gets at the time of writing and what we achieved

What we achieved:

  • regular images work perfectly with the GitLab Container Registry
  • cache images work much better, they are only missing the created/published date
  • we are using a registry cache with all of its benefits

Custom garbage collect script

What is left to do:

  • we need to clean up the existing container registries of corrupted tags because the in-built cleanup policy will not do it for us
  • we need a way of cleaning the cache images once they become obsolete (the corresponding regular image is deleted by the gitlab cleanup policy)

To tackle this problem, I've created a Python script, that iterates over the repositories of container registries in projects it has access to. Each tag is processed with the following algorithm:

  • latest, master, main, and tags that match v.*\..*\..* are always left untouched

  • tags are deleted under the following conditions:

    • Option 1: The details of the tag cannot be fetched from the API.
    • Option 2: The details of the tag can be fetched, but the created_at is empty, the registry name ends with .cache, and there is no counterpart registry without the .cache suffix.
    • Option 3: Like option 2, however, the counterpart registry exists, however, the counterpart tag in it does not exist.

This script of course works on some assumptions, that is:

  • both the cache and regular image are tagged the same (for example the slug of the branch name) but are in different registries
  • if the registry for regular images is named app , the corresponding cache registry is named app.cache
  • latest, master, main, and tags that match v.*\..*\..* are considered stable and should never be removed, even if corrupted

Naturally, if your registry setup is a bit different, the script can be adjusted and I will explain how and where below.

First, let's bootstrap the main function, in which we will get the required parameters from the command line:

  • gitlab-url - the base URL of the gitlab instance
  • token - PAT that the script will use to authenticate.
  • dry-run - optional - can be used for testing/seeing what the script will do

Later we will write down our main loop logic, that loop will iterate over all projects the user has access to, if a project has container registry enabled it will iterate over all registries within that project and lastly process every tag within those registries.

main

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import argparse

def main():

    parser = argparse.ArgumentParser(description="Script to process GitLab registry tags.")
    parser.add_argument('--gitlab-url', required=True, help='GitLab URL.')
    parser.add_argument('--token', required=True, help='Authentication token.')
    parser.add_argument('--dry-run', action='store_true', help='If present, only print tags that should be deleted without actually deleting them.')
    
    args = parser.parse_args()

    global DRY_RUN, GITLAB_URL, HEADERS
    DRY_RUN = args.dry_run
    GITLAB_URL = args.gitlab_url
    HEADERS = {
        "Private-Token": args.token
    }

    projects = get_all_projects()
    for project in projects:
        if not project['container_registry_enabled']:
            continue

        registries = get_project_registries(project['id'])
        for registry in registries:
            tags = get_registry_tags(project['id'], registry['id'])
            for tag in tags:
                process_tag(project, registry, tag)

if __name__ == "__main__":
    main()

Now let's write those data-gathering functions:

data gathering

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import requests

def get_all_projects():
    return get_paginated_data("/api/v4/projects?order_by=id&sort=asc")

def get_project_registries(project_id):
    return get_paginated_data(f"/api/v4/projects/{project_id}/registry/repositories?")

def get_registry_tags(project_id, registry_id):
    return get_paginated_data(f"/api/v4/projects/{project_id}/registry/repositories/{registry_id}/tags?")

def get_tag_details(project_id, registry_id, tag_name):
    response = requests.get(f"{GITLAB_URL}/api/v4/projects/{project_id}/registry/repositories/{registry_id}/tags/{tag_name}", headers=HEADERS)
    return response

def get_paginated_data(endpoint):
    page = 1
    items = []
    while True:
        url = f"{GITLAB_URL}{endpoint}&per_page=100&page={page}"
        response = requests.get(url, headers=HEADERS)
        if response.status_code == 200:
            data = response.json()
            if not data:
                break
            items.extend(data)
            page += 1
        else:
            print(f"Failed to retrieve data from {url}. Status code: {response.status_code}. Payload: {response.json()}")
            break
    return items

Since we now have all the data that we need to process the tag, we can now implement the process_tag function. If you wish to make any changes to the cleanup logic, reason_and_delete_decision function is the best place to do so. I've added extra comments explaining in detail what is happening.

tag processing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import re

def process_tag(project, registry, tag):
    should_delete, reason = reason_and_delete_decision(project, registry, tag)
    if should_delete:
        if DRY_RUN:
            print(f"Tag '{tag['name']}' in Project: '{project['name']}', Registry: '{registry['name']}' should be deleted! - Reason: {reason}")
        else:
            print(f"Deleting tag '{tag['name']}' in Project: '{project['name']}', Registry: '{registry['name']}' - Reason: {reason}..", end=" ")
            delete_tag(project['id'], registry['id'], tag['name'])
    else:
        print(f"Tag '{tag['name']}' in Project: '{project['name']}', Registry: '{registry['name']}' - {reason}")

def reason_and_delete_decision(project, registry, tag):
    tag_name = tag['name']

	# if the tag name is something that we consider stable we should leave it be, you can change the condition here to suit your needs
    if tag_name in ["master", "main", "latest"] or re.match(r"v.*\..*\..*", tag_name):
        return False, f"Tag '{tag_name}' is a stable tag."

	# this is the first option in the above algorithm, it covers images with provenance attestation and cache images using manifest index
    response = get_tag_details(project['id'], registry['id'], tag_name)
    if response.status_code == 404:
        return True, f"Details of tag '{tag_name}' do not exist."

	# if we are dealing with the cache registry and the tag created_at is empty, than we are dealing with partially compatible cache image
    if not tag.get('created_at') and registry['name'].endswith('.cache'):
        counterpart_registry_name = registry['name'].removesuffix('.cache')
		# this looks for the couterpart registry without .cache suffix, there are only two possible results for this operation
		#  - empty array is the counterpart registry does not exist
		#  - an array with one element in it, the counterpart registry object
        counterpart_registries = [r for r in get_project_registries(project['id']) if r['name'] == counterpart_registry_name]
        
		# if the counterpart registry does not exists
        if not counterpart_registries:
            return True, f"Counterpart registry for '{registry['name']}' does not exist."

        counterpart_tags = get_registry_tags(project['id'], counterpart_registries[0]['id'])
		# if the counterpart image is missing, it probably got deleted by gitlab built-in cleanup so we can remove the corresponding cache image
        if not any(t['name'] == tag_name for t in counterpart_tags):
            return True, f"Counterpart tag for '{tag_name}' in registry '{counterpart_registry_name}' does not exist."

    return False, "LGTM!"

def delete_tag(project_id, registry_id, tag_name):
    response = requests.delete(f"{GITLAB_URL}/api/v4/projects/{project_id}/registry/repositories/{registry_id}/tags/{tag_name}", headers=HEADERS)
    if response.status_code == 200:
        print(f"Tag '{tag_name}' has been deleted!")
    else:
        print(f"Failed to delete tag '{tag_name}'. Status code: {response.status_code}. Payload: {response.json()}")

That's it, we can now put it all together, I will also add a statistical summary at the end of the script execution

registry-gc.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import requests
import re
import argparse
import signal

def main():
    # Register signal handlers
    signal.signal(signal.SIGINT, print_counters)
    signal.signal(signal.SIGTERM, print_counters)

    parser = argparse.ArgumentParser(description="Script to process GitLab registry tags.")
    parser.add_argument('--gitlab-url', required=True, help='GitLab URL.')
    parser.add_argument('--token', required=True, help='Authentication token.')
    parser.add_argument('--dry-run', action='store_true', help='If present, only print tags that should be deleted without actually deleting them.')
    
    args = parser.parse_args()

    global DRY_RUN, GITLAB_URL, HEADERS
    DRY_RUN = args.dry_run
    GITLAB_URL = args.gitlab_url
    HEADERS = {
        "Private-Token": args.token
    }

    # Initialize counters
    global project_count, registry_count, tag_count, deleted_tag_count
    project_count = 0
    registry_count = 0
    tag_count = 0
    deleted_tag_count = 0

    projects = get_all_projects()
    for project in projects:
        if not project['container_registry_enabled']:
            continue

        project_count += 1
        registries = get_project_registries(project['id'])
        for registry in registries:
            registry_count += 1
            tags = get_registry_tags(project['id'], registry['id'])
            for tag in tags:
                process_tag(project, registry, tag)

    print_counters()

## Data gathering

def get_all_projects():
    return get_paginated_data("/api/v4/projects?order_by=id&sort=asc")

def get_project_registries(project_id):
    return get_paginated_data(f"/api/v4/projects/{project_id}/registry/repositories?")

def get_registry_tags(project_id, registry_id):
    return get_paginated_data(f"/api/v4/projects/{project_id}/registry/repositories/{registry_id}/tags?")

def get_tag_details(project_id, registry_id, tag_name):
    response = requests.get(f"{GITLAB_URL}/api/v4/projects/{project_id}/registry/repositories/{registry_id}/tags/{tag_name}", headers=HEADERS)
    return response

def get_paginated_data(endpoint):
    page = 1
    items = []
    while True:
        url = f"{GITLAB_URL}{endpoint}&per_page=100&page={page}"
        response = requests.get(url, headers=HEADERS)
        if response.status_code == 200:
            data = response.json()
            if not data:
                break
            items.extend(data)
            page += 1
        else:
            print(f"Failed to retrieve data from {url}. Status code: {response.status_code}. Payload: {response.json()}")
            break
    return items

# Tag processing

def process_tag(project, registry, tag):
    global tag_count, deleted_tag_count
    tag_count += 1

    should_delete, reason = reason_and_delete_decision(project, registry, tag)
    if should_delete:
        deleted_tag_count += 1
        if DRY_RUN:
            print(f"Tag '{tag['name']}' in Project: '{project['name']}', Registry: '{registry['name']}' should be deleted! - Reason: {reason}")
        else:
            print(f"Deleting tag '{tag['name']}' in Project: '{project['name']}', Registry: '{registry['name']}' - Reason: {reason}..", end=" ")
            delete_tag(project['id'], registry['id'], tag['name'])
    else:
        print(f"Tag '{tag['name']}' in Project: '{project['name']}', Registry: '{registry['name']}' - {reason}")

def reason_and_delete_decision(project, registry, tag):
    tag_name = tag['name']

    if tag_name in ["master", "main", "latest"] or re.match(r"v.*\..*\..*", tag_name):
        return False, f"Tag '{tag_name}' is a stable tag."

    response = get_tag_details(project['id'], registry['id'], tag_name)
    if response.status_code == 404:
        return True, f"Details of tag '{tag_name}' do not exist."

    if not tag.get('created_at') and registry['name'].endswith('.cache'):
        counterpart_registry_name = registry['name'].removesuffix('.cache')
        counterpart_registries = [r for r in get_project_registries(project['id']) if r['name'] == counterpart_registry_name]
        
        if not counterpart_registries:
            return True, f"Counterpart registry for '{registry['name']}' does not exist."

        counterpart_tags = get_registry_tags(project['id'], counterpart_registries[0]['id'])
        if not any(t['name'] == tag_name for t in counterpart_tags):
            return True, f"Counterpart tag for '{tag_name}' in registry '{counterpart_registry_name}' does not exist."

    return False, "LGTM!"

def delete_tag(project_id, registry_id, tag_name):
    response = requests.delete(f"{GITLAB_URL}/api/v4/projects/{project_id}/registry/repositories/{registry_id}/tags/{tag_name}", headers=HEADERS)
    if response.status_code == 200:
        print(f"Tag '{tag_name}' has been deleted!")
    else:
        print(f"Failed to delete tag '{tag_name}'. Status code: {response.status_code}. Payload: {response.json()}")

# Statistics

def print_counters(signum=None, frame=None):
    # Print the counters
    print("\nSummary:")
    print(f"Number of projects processed: {project_count}")
    print(f"Number of registries processed: {registry_count}")
    print(f"Number of tags processed: {tag_count}")    
    if DRY_RUN:
        print(f"Number of tags marked for deletion: {deleted_tag_count}")
    else:
        print(f"Number of tags deleted: {deleted_tag_count}")

    if signum:
        exit(1)

#####

if __name__ == "__main__":
    main()

Lastly, you might want to run this script on a schedule, to do so you can use this .gitlab-ci.yml configuration and GitLab Pipeline Schedules, make sure to pass API_TOKEN variable using GitLab Variables

.gitlab-ci.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
stages:
  - run

.base garbage collect:
  stage: run
  image: python:3-alpine
  timeout: 12 hours
  interruptible: true
  before_script:
    - pip install requests
  artifacts:
    when: always
    paths:
      - gc.log
    expire_in: 30 days  

garbage collect:
  extends: .base garbage collect
  only:
    variables:
      - $CI_DEFAULT_BRANCH == $CI_COMMIT_REF_NAME
  script:   
    - python3 registry-gc.py --gitlab-url "$CI_SERVER_PROTOCOL://$CI_SERVER_HOST:$CI_SERVER_PORT" --token $API_TOKEN 2>&1 | tee gc.log

dry garbage collect:
  extends: .base garbage collect
  only:
    refs:
      - merge_requests
  script:   
    - python3 registry-gc.py --gitlab-url "$CI_SERVER_PROTOCOL://$CI_SERVER_HOST:$CI_SERVER_PORT" --token $API_TOKEN --dry-run 2>&1 | tee gc.log

Summary

With the above script, I managed to delete 85% of tags from a single namespace, containing a mid-sized microservice project. This amounted to a little bit over 4.5TB of data removed from the S3 storage backend, used for that instance of GitLab.

If you have any doubts, contact a trusted software development company.

Share this post

Want to light up your ideas with us?

Józefitów 8, 30-039 Cracow, Poland

hidevanddeliver.com

(+48) 789 188 353

NIP: 9452214307

REGON: 368739409