Skip to main content

From Google Test to Catch

If you ever met me, you'll probably know that I'm a big believer in automated testing. Even for small projects, I tend to implement some testing early on, and for large projects I consider testing an absolute necessity. I could ramble on for quite a while as to why tests are important and you should be doing them, but that's not the topic for today. Instead, I'm going to cover why I moved all of my unit tests from Google Test -- the previous test framework I used -- to Catch, and shed some light on how I did this as well. Before we start with the present, let's take a look back at how I arrived at Google Test and why I wanted to change something in the first place.

A brief history

Many many moons ago this blog post got me interested into unit testing. Given I had no experience whatsoever, and as UnitTest++ looked as good as any other framework, I wrote my initial tests using that. This was sometime around 2008. In 2010, I was getting a bit frustrated with UnitTest++ as development wasn't exactly going strong there, I was hoping for more test macros for things like string comparison, and so on. Long story short, I ended up porting all my tests to Google Test.

Back in the day, Google Test was developed on Google Code, and releases did happen regularly but not too often. Which was rather good as bundling Google Test into a single file required running a separate tool (and it still does.) I ended up using Google Test for all of my tests -- roughly 3000 of them total, with a bunch of fixtures. While developing, I run the unit tests on every build, so I also wrote a custom reporter so my console output would look like this:

SUCCESS (11 tests, 0 ms)
SUCCESS (1 tests, 0 ms)
SUCCESS (23 tests 1 ms)

You might wonder why the time is logged there as well: Given the tests were run on every single compilation, they better ran fast, so I always had my eye on the test times, and if something started to go slow, I could move it into a separate test suite.

Over the years, this served me well, but there were a few gripes with Google Test. First of all, it was clear this project was developed by and for Google, so the direction they were going -- death tests, etc. -- was not exactly making my life simpler. At the same time, a new framework appeared on my radar: Catch.

Enter Catch

Why Catch, you may ask? For me, mostly for two reasons:

  • Simple setup -- it's always just a single header, no manual combining needed.
  • No fixtures!
  • More expressive matchers.

The first reason should be obvious, but let me elaborate on the second one. The way Catch solves the "fixture problem" is by having sections in your code which contain the test code, and everything before that is executed once per section. Here's a small appetizer:

TEST_CASE("DateTime", "[core]")
{
    const DateTime dt (1969, 7, 20, 20, 17, 40, 42, DateTimeReference::Utc);

    SECTION("GetYear")
    {
        CHECK (dt.GetYear () == 1969);
    }

    SECTION("GetMonth")
    {
        CHECK (dt.GetMonth () == 7);
    }

    // And so on
}

This, together with nicer matchers -- no more ASSERT_EQ macros, instead, you can use a normal comparison, was enough to convince me of Catch. Now I needed a couple of things, though:

  • Port a couple of thousand tests, with tens of thousands of test macros from Google Test to Catch.
  • Implement a custom reporter for Catch.

Porting

As I'm a rather lazy person, and because the tests are super-uniform in format, I decided to semi-automate the conversion from Google Test to Catch. It's probably possible to make a perfect automated tool, at least for the assertions, by building it on Clang and rewriting things, but I figured if I get 80% or so done automatically that should be still fine. On top of that, I'm porting tests, so I can easily validate if the conversion worked (as the tests still should pass.) The script is not super interesting, it does a lot of regular expression matching on the macros and then hopes for the best. While it's probably going to explode when used in anger, it still converted the vast majority of the tests in my code. In total, it took me less than a day of typing to finish porting all my tests over.

Before you ask why I'm not porting to some other framework like Catch which is supposed to be faster: In my testing, Catch is fast enough to the point that the test overhead doesn't matter. I can easily execute 20000 assertions in less than 10 milliseconds, so "faster" is not really an argument at this point.

What is interesting though is that there was a significant reduction in lines of code by moving over to Catch, most of which came from the fact that fixtures were gone, and some more code now used the SECTION macros and I could merge common code. Previously, I would often end up duplicating some small setup because it was still less typing than writing a fixture. Witch Catch, this is so simple that I ended up cleaning my tests voluntarily. To give you some idea, this is the commit for the core library: 114 files changed, 6717 insertions(+), 6885 deletions(-) (or -3%). For my geometry library, which has more setup code, the relative reduction was quite a bit higher: 36 files changed, 2342 insertions(+), 2478 deletions(-) -- 5%. A couple of percent here and there might not seem too significant, but they directly translate into improved readability due to less boilerplate.

There are a few corner cases where Catch just behaves differently from Google Test. Notably, a EXPECT_FLOAT_EQ with 0 needs to be translated into CHECK (a == Approx (0).margin (some_eps)) as Catch by default uses a relative epsilon, which becomes 0 when comparing to 0. The other one affects STREQ -- in Catch, you need to use a matcher for this, which turns the whole test into CHECK_THAT (str, Catch::Equals ("Expected str"));. The script wil try to translate that properly but be aware that those are the cases which are most likely to fail.

Terse reporter

The last missing bit is the terse reporter. This got changed again for Catch2, which is the current stable release. The reporter is part of a catch-main.cpp which I compile into a static library, which then gets linked into the test executable. The terse reporter is straightforward:

namespace Catch {
class TerseReporter : public StreamingReporterBase<TerseReporter>
{
public:
    TerseReporter (ReporterConfig const& _config)
        : StreamingReporterBase (_config)
    {
    }

    static std::string getDescription ()
    {
        return "Terse output";
    }

    virtual void assertionStarting (AssertionInfo const&) {}
    virtual bool assertionEnded (AssertionStats const& stats) {
        if (!stats.assertionResult.succeeded ()) {
            const auto location = stats.assertionResult.getSourceInfo ();
            std::cout << location.file << "(" << location.line << ") error\n"
                << "\t";

            switch (stats.assertionResult.getResultType ()) {
            case ResultWas::DidntThrowException:
                std::cout << "Expected exception was not thrown";
                break;

            case ResultWas::ExpressionFailed:
                std::cout << "Expression is not true: " << stats.assertionResult.getExpandedExpression ();
                break;

            case ResultWas::Exception:
                std::cout << "Unexpected exception";
                break;

            default:
                std::cout << "Test failed";
                break;
            }

            std::cout << std::endl;
        }

        return true;
    }

    void sectionStarting (const SectionInfo& info) override
    {
        ++sectionNesting_;

        StreamingReporterBase::sectionStarting (info);
    }

    void sectionEnded (const SectionStats& stats) override
    {
        if (--sectionNesting_ == 0) {
            totalDuration_ += stats.durationInSeconds;
        }

        StreamingReporterBase::sectionEnded (stats);
    }

    void testRunEnded (const TestRunStats& stats) override
    {
        if (stats.totals.assertions.allPassed ()) {
            std::cout << "SUCCESS (" << stats.totals.testCases.total () << " tests, "
                << stats.totals.assertions.total () << " assertions, "
                << static_cast<int> (totalDuration_ * 1000) << " ms)";
        } else {
            std::cout << "FAILURE (" << stats.totals.assertions.failed << " out of "
                << stats.totals.assertions.total () << " failed, "
                << static_cast<int> (totalDuration_ * 1000) << " ms)";
        }

        std::cout << std::endl;

        StreamingReporterBase::testRunEnded (stats);
    }

private:
    int sectionNesting_ = 0;
    double totalDuration_ = 0;
};

CATCH_REGISTER_REPORTER ("terse", TerseReporter)
}

To select it, run the tests with -r terse, which will pick up the reporter. This will produce output like this:

SUCCESS (11 tests, 18 assertions, 0 ms)
SUCCESS (1 tests, 2 assertions, 0 ms)
SUCCESS (23 tests, 283 assertions, 1 ms)

As an added bonus, it also shows the number of test macros executed. This is mostly helpful to identify tests running through some long loops.

Conclusion

Was the porting worth it? Having spent some time with the new Catch tests, and after writing some more tests in it, I'm still convinced it was worth it. Catch is really simple to integrate, the tests are terse and readable, and neither compile time nor runtime performance ended up being an issue for me. 10/10 would use again!

GraphQL in the GPU database

One thing I've been asked about is providing some kind of API access to the GPU database I'm running. I've been putting this off for most of the year, but over the last couple of days, I gave it yet another try. Previously, my goal was to provide a "classic" REST API, which would provide various endpoints like /card, /asic etc. where you could query a single object and get back some JSON describing it.

This is certainly no monumental task, but it never felt like the right thing to do. Mostly because I don't really know what people actually want to query, but also because it means I need to somehow version the API, provide tons of new routes, and then translate rather complex objects into JSON. Surely there must be some better way in 2017 to query structured data, no?

GraphQL

Turns out, there is, and it's called GraphQL. GraphQL is a query language where the user specifies the shape of the data needed, and the system then builds up tailor-made JSON. On top of that, introspection is also well defined so you can discover what fields are exposed by the endpoint. Finally, it provides a single end-point for everything, making it really easy to extend.

I've implemented a basic GraphQL endpoint which you can use to query the database. It does not expose all information, but provides access to hopefully the most frequently used data. I'm not exposing everything mostly due to the lack of pagination. If you use the allCards query, you can practically join large parts of the database together, and I don't want enable undue load on the server. As a small appetizer, here's a sample query executed locally through GraphiQL.

/images/2017/gpudb-graphql.png

Using GraphiQL to query the GPU database programmatically.

If you want to see more data exported, please drop me a line, either by sending me an email or by getting in tough through Twitter.

Background

What did I have to implement? Not that much, but at the same time, more than expected. The GPU database is built using Django, and fortunately there's a plugin for Django to expose GraphQL called graphene-django which in turn uses Graphene as the actual backend.

Unfortunately, Graphene and in particular, Graphene-Django is not as well documented as I was hoping for. There's quite a bit of magic happening where you just specify a model and it tries to map all fields, but those won't be documented. I ended up exposing things manually by restricting the fields I want using only_fields, and then writing at least a definition for each field, occasionally with a custom resolve function. For instance, here's a small excerpt from the Card class:

class CardType (DjangoObjectType):
    class Meta:
        model = Card
        name = "Card"
        description = 'A single card'
        interfaces = (graphene.Node, )
        only_fields = ['name', 'releaseDate'] # More fields omitted

    aluCount = graphene.Int (description = "Number of active ALU on this card.")
    computeUnitCount = graphene.Int (description = "Number of active compute units on this card.")

    powerConnectors = graphene.List (PowerConnectorType,
        description = "Power connectors")

    def resolve_powerConnectors(self, info, **kwargs):
        return [PowerConnectorType (c.count, c.connector.name, c.connector.power) for c in self.cardpowerconnector_set.all()]

    # more fields and accessors omitted

Here's also an interesting bit. The connection between a card and its power or display connector is a ManyToManyField, complete with custom data on it. Here's the underlying code for the link:

class CardDisplayConnector(models.Model):
    """Map from card to display connector.
    """
    connector = models.ForeignKey(DisplayConnector, on_delete=models.CASCADE)
    card = models.ForeignKey(Card, on_delete=models.CASCADE, db_index=True)
    count = models.IntegerField (default=1)

    def __str__(self):
        return '{}x {}'.format (self.count, self.connector)

In the card class, there's a field like this:

displayConnectors = models.ManyToManyField(DisplayConnector, through='CardDisplayConnector')

Now the problem is how to pre-fetch the whole thing, as otherwise iterating through the cards will issue a query to fetch the display connectors, and then one more query per connector to get the data related to this connector… which led to a rather lengthy quest to figure out how to optimize this.

Optimizing many-to-many prefetching with Django

The end goal we want is that we perform a single Card.objects.all() query which somehow pre-fetches the display connectors (equivalently, the power connectors, but I'll keep using the display connectors for the explanation.) We can't use select_related though as this is only designed for foreign keys. The documentation hints at prefetch_related but it's trickier than it seems. If we just use prefetch_related ('displayConnectors'), this will not prefetch what we want. What we want to prefetch is the actual relationship, and from there on select_related the connector. Turns out, we can use the Prefetch to achieve this. What we're going to do is to prefetch the set storing the relationship (which is called carddisplayconnector_set), and provide the explicit query set to use which can then specify the select_related data. Sounds complicated? Here's the actual query:

return Card.objects.select_related ().prefetch_related(
    Prefetch ('carddisplayconnector_set',
        queryset=CardDisplayConnector.objects.select_related ('connector__revision'))).all ()

What this does is to force an extra query on the display connector table (with joins, as we asked for a foreign key relation there), and then caches that data in Python. Now if we ask for display connector, we can look it up directly without an extra roundtrip to the database. How much does this help? It reduces the allCard query time from anywhere between 4-6 seconds, with 500-800 queries, down to 100 ms and 3 queries!

Wrapping it up

With GraphQL in place, and some Django query optimizations, I think I can tick off the "programmatic access to the GPU database" item from my todo list. Just in time for 2017 :) Thanks for reading and don't hesitate to get in touch with me if you got any questions.

Version numbers

Version numbers are the unsung heroes of software development, and it still baffles me how often they get ignored, neglected or not implemented properly. Selfish as I am, I'd wish everyone would get it right, and today I'm going to try to convince you why they are really important!

The smell of a release process

Having version numbers indicates some kind of release process, assuming you don't add a version to every single commit you do in your repository. This means you've reached a point where you think it's useful for your clients to update, otherwise there's no need yet to assign a new number. That's reason number one to have it -- communication with your downstream clients. It might sound stupid, but just by assigning version numbers to your commits, a client can learn a lot about your project:

  • Size of each release -- seeing how many commits go into every single version gives an idea of how much churn there is.
  • Release frequency -- do you assign a new number once a week? Once a month? This gives a good idea on how quickly you're going to react to pull requests, issues, and more. This is also critical information for any system level application as an administrator may have to install the update. Knowing the frequency and size of every release is critical to allocate the correct resources.
  • Bug fix check -- you fixed a bug, how does the client know it got fixed? Obviously, by presenting a version number to the user which can be queried.
  • Change logs -- assigning a version number is a good moment to sit back and think about what was added, writing up some documentation along the way.

You can encode even more information if you use semantic versioning, which in theory provides guarantees to clients when it's safe to update and more. While I like it in theory, I think that semantic versioning is mostly useful for libraries, less for large applications and frameworks as you'll typically end up incrementing the major version a lot. The only really large project I'm aware of that follows semantic versioning is Qt -- and they do a quite impressive job in regards to API and ABI compatibility. I think it's nice to have if you can enforce it, and I think it's worthing striving towards, but it's not the main value add.

But it's ... complicated!

I assume that most developers not using version numbers are aware of the reasons above, and didn't just "forget" them, but have a hard time versioning due to various reasons. Typically, there are two categories:

  • Continuous integration -- rapid releases, no formal release process.
  • Very branchy development process -- versions are branch-specific.

To point one, the continuous integration: No matter how you write software, your releases happen over time. You typically don't expect your clients to update to every single release you're doing, so how about using the ISO date (year-month-day.release) as your version number? Turns out, that will usually work just fine, and it still allows people to refer to things with a common naming system instead of referencing your code drops with a hash or some continuous integration commit. In fact, I'd argue you're set up for success already because the very same system you use for continuous integration can also assign version numbers.

The other problem is super branchy development, where you have multiple lines of code in development concurrently. Let's say you have one branch for stable releases, one branch for future releases, and one maintenance branch, and there's no good correlation. The trick here is to look at the problem from the client end -- for the client, there's only a single branch they see. It's your duty to fix it such that the useful properties outlined above are present for all your clients, which may mean that every branch gets versioned separately for instance, or that you treat your branches as separate products. This is something I noticed many people forget in software development -- we're not writing code for us, we're writing code for our users, and if our process makes their life harder, we've failed, because (at least, that's the theory) there will be many more users than us, so their time is more precious.

Version all the things

I hope I could shed some light on the value of version numbers and make you hesitate next time you're about to send an email which says "everyone should use commit #4237ac9b0f" or later :) Do yourself a favor, use that tag button in your revision control system, and make everyone's life simpler. Thanks!

JIRA & Confluence with systemd on CentOS

I used to run JIRA & Confluence "bare metal" on an Ubuntu machine, but recently I've decided to put them into a VM. Backing up JIRA and Confluence is always a bit of a pain, especially if you want to move across servers. In my case the biggest pain point was authentication, I'm using the JIRA user database for Confluence, and when restoring from backup, your application link from Confluence to JIRA won't get restored, so the first thing you do is you lock yourself out of Confluence. Fixing that requires some fun directly in the database, but that's a story for another day (it's also well documented on the Atlassian support page.)

Setting the stage

While deciding which OS to use for the VM, I ended up with CentOS, as that's the only one on which the installer is officially supported. It turns out though that the JIRA (and Confluence) installers set up some scripts in /etc/init.d (for System V init), while CentOS is a systemd based distribution. That on its own wouldn't bother me much, but what happened is that occasionally Confluence would start before the PostgreSQL database was online, then exit, and then JIRA would fail to see the application link (as Confluence was down). Long story short, there's a startup order which should be followed, and instead of mixing the systems and playing around with runlevels until things work, I've decided to move everything to systemd and solve it once and for all.

Getting rid of the init scripts

The first thing which really confused me was that systemctl would show up a jira and confluence service, but no such service was present in any of the systemd configuration directories. It turns out, systemd will automatically generate services for init scripts. With that knowledge, it's easy to see how we can fix this. A service with the same name will take precedence over a System V init script, so we could just set up some services and rely on that. I wanted to get rid of the init scripts wholesale though to clean everything up.

Unfortunately, it turns out chkconfig doesn't know about jira and confluence. I.e. running chkconfig --del jira will tell you:

service jira does not support chkconfig

This can be solved by changing the scripts to make them chkconfig compatible, but we might as well do the steps chkconfig --del performs manually. First, I got rid of jira and confluence in /etc/init.d. That leaves the symlinks in /etc/rc3.d (and so on -- one per runlevel.) First, I stopped the services using service stop jira and service stop confluence. Then I simply searched the 6 run level folders and removed all S95jira, K95jira, S95confluence and K95confluence entries there. After a reboot, nothing was started automatically any more -- time for systemd.

Moving to systemd

systemd requires a service configuration file which contains the commands to execute, as well as dependencies -- just what we need. According to the Red Hat documentation, services added by an administrator go into /etc/systemd/system. Let's create one file per service then!

For JIRA, I created /etc/systemd/system/jira.service with the following contents:

[Unit]
Description=Jira Issue & Project Tracking Software
Wants=nginx.service postgresql.service
After=network.target nginx.service postgresql.service

[Service]
Type=forking
User=jira
PIDFile=/opt/atlassian/jira/work/catalina.pid
ExecStart=/opt/atlassian/jira/bin/start-jira.sh
ExecStop=/opt/atlassian/jira/bin/stop-jira.sh

[Install]
WantedBy=multi-user.target

This assumes all the default paths. The only interesting lines are the Wants and After lines, which specify that JIRA has to come online after the postgresql.service, and it requires the postgresql.service to start with. For Confluence, the file looks virtually the same, except I make it dependent on JIRA as well -- otherwise, users can't log in anyway. Here's the corresponding /etc/systemd/system/confluence.service:

[Unit]
Description=Confluence Team Collaboration Software
Wants=postgresql.service nginx.service jira.service
After=network.target jira.service postgresql.service nginx.service

[Service]
Type=forking
User=confluence
PIDFile=/opt/atlassian/confluence/work/catalina.pid
ExecStart=/opt/atlassian/confluence/bin/start-confluence.sh
ExecStop=/opt/atlassian/confluence/bin/stop-confluence.sh

[Install]
WantedBy=multi-user.target

I later found out that there are virtually identical service definitions here, but they're missing the Wants/After dependencies. All that is left is to actually enable and run the services. This is rather straightforward:

$> systemctl daemon-reload
$> systemctl enable jira
$> systemctl enable confluence
$> systemctl start confluence

As Confluence depends on JIRA, it will start that automatically as well. Now we have both running through systemd, with dependencies properly specified. Just run systemctl list-dependencies jira to see it depend on PostgreSQL (and nginx.) With that, there are no more failures due to funky start ordering, and as a bonus, everything uses systemd instead of having to deal with some compatibility modes.

Docker, KVM and iptables

A quick blog post as I've been recently setting up a server with KVM and docker. If you're using a bridge interface to have your VMs talk to your network, you might notice that after a docker installation, your VMs have suddenly no connection. So what is going on?

It turns out, docker adds a bunch of iptables rules by default which prevent communication. These will interfere with an already existing bridge, and suddenly your VMs will report no network. There are two ways to solve it:

  • Add a new rule for your bridge.
  • Stop docker from adding iptables rules.

I'm assuming Ubuntu 17.04 for the commands below; they should be similar on any Debian based system.

Solution 1: Add a new rule

In my case, my bridge is called br0. docker changes the default for FORWARD from accept to drop, and adds a few exceptions for itself. Adding a new forward rule for br0 will allow your bridge (and the devices) behind it to get back into your network and not get dropped:

iptables -A FORWARD -i br0 -o br0 -j ACCEPT

Unfortunately, this won't be persistent -- you'll need the iptables-persistent package on Linux to make it persistent, plus some extra setup. It's good for a quick test though! (Source)

Solution 2: Stop docker

In my case, the server is not on the public internet, and I've got no need for the extra security. It turns out that the docker service adds the rules on startup, unless --iptables=false is used. This can be either added to the default docker configuration, or, slightly cleaner in my opinion, to the daemon.json configuration file (see the documentation for all options). Create a file /etc/docker/daemon.json with the following contents:

{
    "iptables" : false
}

That'll stop docker from adding new rules, and everything will work as it did before.