Endless Orange Week

December 1, 2021

We recently completed Endless Orange Week, which was a week for all employees to explore a topic of their choosing and expand their skills. I chose to explore openQA. This wasn’t a totally new topic as we’d previously had an openQA setup that we eventually shut down for various reasons. There were several reasons I wanted to look into it again, though:

I’m a relatively recent convert to the idea that you should have automated tests for everything you possibly can. Sometimes that means investing significant effort in constructing a proper test environment, but the payoff from being able to catch issues before you release to the world is worth it. A novel concept, I know.
The infrastructure we setup before was difficult for us to maintain. We were using Fedora’s packages on an Equinix bare metal server since openQA runs tests in VMs. In theory this is fine, but there were 2 problems. First, we’re a Debian shop (both for the OS and our infrastructure), and maintaining the equivalent tooling to support a Fedora host was cumbersome. Second, Fedora is not a supported OS on Equinix, so we were using their custom iPXE process. This involved manually walking through Anaconda over a remote serial console, which is… not fun. I’m sure there’s a better way to do all of this, but ultimately my life would be better if I could make this setup less snowflake and more cattle.
Certain parts of our OS are currently only tested manually and would really benefit from automated testing. Of particular interest to me is the OS upgrading process using OSTree. I’ve invested significant time at Endless on both the client and server side of this process and have been responsible for releasing upgrade failures into the wild at least once. To put it mildly, I’d prefer that we not do that anymore.
Even though I was involved in the previous openQA infrastructure, I didn’t really get involved in the tests and wanted to understand how they worked better. As a corollary, I’d heard from the people that were involved that it was difficult to keep the tests up to date, so I was interested to try it myself and see where the friction was.

Success?

In a word, meh. If the goal was to get automated tests for our OS going again, then it was mostly a fail. I ended up spending almost the entire time working on the infrastructure. While this was personally satisfying since I was able to overcome some obstacles from the previous setup, I essentially got back to the same functionality we had before. Which is to say that I didn’t work on any upgrade tests that I was interested in having. I’ll detail a few of the things I did work on below.

Google Cloud Platform

As mentioned at the beginning, the previous setup was on a bare metal server that was difficult to provision. Since most of our infrastructure is in AWS, I’d normally use that, but they don’t supported nested VMs and the price for one of their provisionable bare metal EC2 instances was a bit much for what I was doing.

Enter Google Cloud Platform (GCP). Their VMs do support nested virtualization. Like AWS, it’s well supported in most ops tools (for me this is Terraform and Packer). I was able to reuse most of our tooling basically as is to get a GCE VM going. The exception being that the Debian GCE base image doesn’t include cloud-init.

Once it was running and I enabled nested virtualization, I was able to run a simple QEMU image with KVM. Yay.

Containers

Although openQA is in Debian unstable, I wanted to try using containers since it insulates the application from the host. The openQA worker is fairly simple, but my experience running webapps from Debian packages has not been great. Furthermore, while the workers have the requirement of running VMs, the webui could later be split out to a more proper container manager.

I chose to use the docker-compose method with openSUSE’s openQA containers. It mostly worked very nicely, although there’s always fun bootstrapping new containers and providing configuration to them. I sent a couple fixes upstream that they were kind enough to merge.

Needles

In openQA you can do screen matching using “needles”. Endless OS is a desktop OS, so we do want to test that what’s showing up on the screen is actually what we want. In our previous openQA configuration we did have the tests and needles being automatically pulled from our repo, but we didn’t have the proper configuration to create and push commits from the webui.

After figuring out how to get an SSH key into the container and some subsequent hair pulling, I was able to update a needle from the webui. This was a significant roadblock for our developers before where they’d need to manually edit the needle and make a PR from it. Yay again.

Authentication

Previously openQA required OpenID 2.0 for user authentication. I ended up spending a significant amount of time setting up an identity provider just for openQA. Not fun.

Since then, openQA has gained support for OAuth2, which is great. I didn’t get a chance to test it out, but I worked on supporting Google as the OAuth2 provider out of the box. Since we use Google Workspace, Google authentication is our preferred method and I’ve spent quite a bit of time in the weeds with OAuth2 and OIDC. I hope to find a little time to spin that up and test it so I can send it upstream.

Where to now?

I’m not sure. I still believe that we should have automated testing for Endless OS and that openQA can do that for us. I think the work I did here would provide a solid foundation for us to start from again. There would still need to be a significant investment in the tests themselves, though.