July 6, 2024

Storing Blobs on the GitHub Container Registry

Recently, I had the need to access some virtual machine images in one of my GitHub Actions workflows. As they came in at more than 1 GB, I was hard-pressed for a place to store them that was also easy to access from a GitHub Actions workflow. If only I could upload a ZIP archive to GitHub Packages!

It turns out that it is not only possible but also much easier than I imagined it to be. Let’s learn how.

Container Images Are Nothing but Fancy Tarballs

The key is to realise that container images (those things that you feed to docker run) are nothing but tarballs and some metadata bundled together. There is no rule¹ that says “Thou shalt only store operating systems in container images.” So, there is nothing that stops us from basically storing anything in a container image. And those can be uploaded to the GitHub Container Registry (GHCR), which is part of the GitHub Packages offering, or any other container registry.

Now, we only need to find a tool that makes putting anything into a container image easy. Enter oras.

Upload Anything with ORAS

oras is a nifty command-line tool made by the ORAS project. ORAS stands for “OCI Registry As Storage”, and OCI is the Open Container Initiative. OCI is the standards body regulating the format of container images, registries, and so on. They ensure that any container image can be run by any runtime (Docker, Podman, containerd, …) and can be uploaded to any registry (GHCR, Docker Hub, …). Thanks to this standardisation, oras is compatible with a dozen different registries.

oras itself is available for all major operating systems and most common platforms. You can either download and run the binary from GitHub Releases or use one of the other installation methods.

I have prepared a directory called data that I want to store on GHCR. It looks like this:

$ tree data
data
├── 50m.bin
└── a-directory
    └── hello.txt

2 directories, 2 files

Uploading it to GHCR is as simple as running:

$ oras push ghcr.io/example/data:1 data

To download the files, create an empty directory and change into it. Then run:

$ oras pull ghcr.io/example/data:1

That’s it! There is now a directory called data in the current working directory that looks exactly like the one I uploaded above:

$ tree data
data
├── 50m.bin
└── a-directory
    └── hello.txt

2 directories, 2 files

While this is not exactly “uploading a ZIP archive”, it is actually much better: You do not have to create the ZIP archive yourself, checksum verification is built-in, and in some circumstances, it is even possible to change the container image without having to re-upload everything.

Saving Space and Bandwith with Layers

In the background, oras takes the directory and turns it into an OCI container image before uploading it. That is the same kind of image that Docker uses. If you have used Docker before, you probably heard about “layers”. In a nutshell, a container image consists of one or more layers. Layer is just a fancy term for a tarball. Having multiple tarballs in an image instead of a single one helps with caching and reducing the amount of storage consumed by all those images in a registry. oras creates layers, too, as we can see when we look at the manifest² of the container image we uploaded:

$ oras manifest fetch ghcr.io/example/data:1 | jq
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "artifactType": "application/vnd.unknown.artifact.v1",
  "config": {
    "mediaType": "application/vnd.oci.empty.v1+json",
    "digest": "sha256:44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a",
    "size": 2,
    "data": "e30="
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:8c9d7307a4263817ad8dd2b845c4bac3a4a59621d98a063d3482df77763e7cee",
      "size": 52445175,
      "annotations": {
        "io.deis.oras.content.digest": "sha256:7520f1358115aa8ffd0ca65b22ba5bf9ef4555e9f9212032f65f8cf91e7ec93a",
        "io.deis.oras.content.unpack": "true",
        "org.opencontainers.image.title": "data"
      }
    }
  ],
  "annotations": {
    "org.opencontainers.image.created": "2024-07-04T15:18:29Z"
  }
}

There it is, a single layer with the SHA-256 checksum 8c9d7307a4263817ad8dd2b845c4bac3a4a59621d98a063d3482df77763e7cee. There is even an annotation called org.opencontainers.image.title with the folder’s name: data. The manifest of a “normal” Docker image does not look much different:

$ oras manifest fetch --platform linux/amd64 docker.io/library/postgres:16.3 | jq
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:f23dc7cd74bd7693fc164fd829b9a7fa1edf8eaaed488c117312aef2a48cafaa",
    "size": 10091
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:f11c1adaa26e078479ccdd45312ea3b88476441b91be0ec898a7e07bfd05badc",
      "size": 29126278
    },
    // Many more layers omitted.
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:95c2c2ef9f02d7666e80992c98c53c9ec7b5e8ccf244d00a5c85e46bbc2820ae",
      "size": 184
    }
  ],
  "annotations": {
    "com.docker.official-images.bashbrew.arch": "amd64",
    "org.opencontainers.image.base.digest": "sha256:39868a6f452462b70cf720a8daff250c63e7342970e749059c105bf7c1e8eeaf",
    "org.opencontainers.image.base.name": "debian:bookworm-slim",
    "org.opencontainers.image.created": "2024-05-09T18:58:11Z",
    "org.opencontainers.image.revision": "d08757ccb56ee047efd76c41dbc148e2e2c4f68f",
    "org.opencontainers.image.source": "https://github.com/docker-library/postgres.git#d08757ccb56ee047efd76c41dbc148e2e2c4f68f:16/bookworm",
    "org.opencontainers.image.url": "https://hub.docker.com/_/postgres",
    "org.opencontainers.image.version": "16.3"
  }
}

Back to those layers. As I mentioned before, it is possible to change the container image created by oras in some circumstances without re-uploading everything. Those circumstances have a lot to do with those layers. When you run oras push <name> <file> [...], oras creates a separate layer per argument. Let’s upload the same directory as before, but this time, specify every file as a separate argument:

$ oras push ghcr.io/example/data:1 data/50m.bin data/a-directory/hello.txt

While the result on disk is the same when we download the image again, the manifest looks different:

$ oras manifest fetch ghcr.io/example/data:1 | jq                            
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "artifactType": "application/vnd.unknown.artifact.v1",
  "config": {
    "mediaType": "application/vnd.oci.empty.v1+json",
    "digest": "sha256:44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a",
    "size": 2,
    "data": "e30="
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar",
      "digest": "sha256:1e4c2dd682422beba2fa33db0f926935afe1414f722ee54be7788c6a6c40ebca",
      "size": 52428800,
      "annotations": {
        "org.opencontainers.image.title": "data/50m.bin"
      }
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar",
      "digest": "sha256:03ba204e50d126e4674c005e04d82e84c21366780af1f43bd54a37816b6ab340",
      "size": 13,
      "annotations": {
        "org.opencontainers.image.title": "data/a-directory/hello.txt"
      }
    }
  ],
  "annotations": {
    "org.opencontainers.image.created": "2024-07-05T13:51:19Z"
  }
}

There is now one layer per file, two in total. That means you can now change the contents of the image in place by adding or removing layers. Let’s add another file, data/10m.bin:

$ oras push ghcr.io/example/data:1 data/50m.bin data/10m.bin data/a-directory/hello.txt

You will see that oras only uploads data/10m.bin because the other two files (layers) are already part of the image. Omit data/50m.bin and oras will only delete its layer, leaving everything else in place:

$ oras push ghcr.io/example/data:1 data/10m.bin data/a-directory/hello.txt

So if you expect that you need to update parts of the container image frequently or you want to save space³ when storing multiple images that share some contents, it might be beneficial to put every file, or at least every folder, in a separate layer as shown in the preceding examples. If you want to save on typing, find and xargs can help:

$ find data -type f -print0 | xargs -r -0 oras push ghcr.io/example/data:1

oras has more to offer, but what we have seen so far should suffice for everyday use.

ORAS in a GitHub Actions Workflow

This is a minimal GitHub Actions workflow to download our data folder onto the runner:

name: Build

on:
  push:

jobs:
  build:
    name: Build
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: read  # Required to access GHCR

    steps:
      - name: Install oras
        run: |
          sudo snap install oras --classic          

      - name: Download data
        run: |
          oras login --username "${{ github.actor }}" --password "${{ secrets.GITHUB_TOKEN }}" ghcr.io
          oras pull ghcr.io/example/data:1

The highlights:

You need to declare the permission packages: read to access GHCR. If you use a different container registry, you can omit it.
I install oras using snap. Any other method is fine, too.
Log into GHCR with ${{ github.actor }} as username and ${{ secrets.GITHUB_TOKEN }} as password. This also saves you some money⁴.

Then, you can use oras as usual.

If the container images you are accessing are private, and they are private by default, you also have to link the image with the repository that the GitHub Actions workflow is part of. Otherwise, you get permission errors. There are two ways to do this:

Connect a repository to a package using the GitHub UI.
You can add the annotation org.opencontainers.image.source to the container image. Assuming you want to access the image in https://github.com/example/my-repository, then the command would look as follows:
```
$ oras push ghcr.io/example/data:1 \
    -a "org.opencontainers.image.source=https://github.com/example/my-repository" \
    data/50m.bin data/10m.bin data/a-directory/hello.txt
```

Docker Can Do This, Too

Before I stumbled upon ORAS, I tried my luck with a normal image builder. My preferred tool is Buildah, and it can actually do it. The key is to use the empty base image scratch. The equivalent to oras push ghcr.io/example/data:1 data looks as follows:

$ export newcontainer=$(buildah from scratch)
$ buildah unshare
$ buildah copy $newcontainer data /data
$ buildah unmount $newcontainer
$ buildah commit $newcontainer data
$ buildah rm $newcontainer
$ buildah push data:latest ghcr.io/example/data:1

When you think that this is kinda gross, it absolutely is.

Docker does not fare better. First, we need a Dockerfile next to the data folder:

FROM scratch
COPY data /data

Then, we can build the image and push it to GHCR:

$ docker build -t ghcr.io/example/data:1 .
$ docker image push ghcr.io/example/data:1

Skopeo is probably the best tool (relatively speaking) to get the data folder back:

$ skopeo copy docker://ghcr.io/example/data:1 dir:output

This command will extract the image into the pre-existing folder output. Unfortunately, we are far from done. We still have to look into the manifest to figure out which file contains the filesystem layer and what compression algorithm was used. Then, we can extract it with tar to get our folder data back.

This is absolutely no fun, and nobody should do it. I only wanted to mention it. After all, you never know when this otherwise useless knowledge might come in handy.

GitHub itself showcases that Homebrew stores at least half a petabyte of binaries on GHCR. If you are curious how Homebrew does it: Homebrew writes the OCI image itself and then uploads it using skopeo. ↩︎
A manifest is a piece of metadata that describes the contents of a container image. ↩︎
Container registries usually store each layer only once, even if it is part of hundreds or thousands of images. ↩︎
Data transfer is free of charge when GHCR is accessed with GITHUB_TOKEN. ↩︎