Additive Testing: One Approach to ActivityPub-Centric Testing

by bumblefudge

Introduction

Recently, there has been a renewed interest in federation mechanics across Fediverse platforms, and in introducing new features (or better-standardizing UX across implementations of existing ones) across the Masto-verse, the Fediverse, or both. An entirely natural instinct in times like these is to search for funding and coordination, which are traditionally linked closely; usually, a central actor or group applies for a public-good grant and stakes their reputation on what will get done by a group, and vets contributions to a central codebase.

All of us founders of the socialweb coöperative come from wide professional and community experience across many corners of the vast overlapping fields of FOSS, decentralized development, and decentralized communities. The testing approach guiding our work so far has been that:

good protocols are defined explicitly as a series of behaviors and actors,
good tests correspond to specified behaviors and actor-models primarily and explicitly, but also annotate their own assumptions where these are known to be divergently interpreted or controversial
good testing scripts or drivers should be composable and thus design with remixing, subsetting, and extending or supersetting in mind
good test artefacts and documentation clearly state their assumptions and goals.
human organization and collaboration on open software should be at least as additive as git: forks will be inevitable, so lean into upstream/downstream pull requests and maximize the chance that forks will share back and be the primary avenue of contribution.

Protocols, Platforms, Products

In the context of the Fediverse, it is crucial to keep clear not just the traditional layering model of scalable software design, but also the equivalent layering model of social and political systems around software. Simplifying a little, we could define these layers, from the ground up, as:

ActivityPub/Streams protocol for hooking up firehoses of heterogenous social content and making it reasonably intelligible, navigable, and sometimes searchable across different form-factors and contexts.
Platforms of federation, i.e. corpuses of inter-federated servers and/or serverless clients using AP/S to share content. There are already multiple overlapping platforms within the server-powered Fediverse today, as well as a few "de-federated" parallel platforms (like the Gabiverse and debatably the Japanese Misskeyverse) and less server-centric networks overlapping these.
Products, or perhaps "products," given the non-commercial or anti-commercial ethos of many of the fediverse's codebases.

The protocol was itself designed to be a common ground between a few participating platforms, a superset to all of them and a north-star towards guiding a broader and more open horizon of interoperability. In that sense, the assumed and explicit goal of, say, the Social Web Incubator Community Group (SWICG) at W3C should be for federation to become as ubiquitous and neutral a protocol as email, bittorrent, or RSS.

Admittedly, though, designing a spanning protocol from a few scrappy, bootstrapped FOSS and #indie platforms was an ambitious project and, like the work of all open source coalitions, an ongoing one always struggling for focus, availability, documentation, and feedback! "Pure" ActivityPub implementations remain a somewhat academic or purist endeavor, still quite removed from the bulk of the fediverse's userbase; this latter operates on de facto platforms and a handful of strong brands that "filled in the gaps" between the low-level Activity* specs and user-experience expectations that they could build in a reasonable span of time, to meet users where they are today.

In 2023, neither pure Activityconformance NOR "platform pragmatism" ("give me a recipe for interoperating with Mastodon, or Mastodon+2") is going to get us to a healthier version of today's ecosystem, much less to one large players and healthy Web2 platforms can and want to federate with. We need "platform profiles" subsetting or extending the Activity protocol, but we also need to re-enforce the core protocol at the same time, extending it with new features and nuancing it with errata and implementer feedback to strengthen it as the core of interoperability.

A plan for multiple plans

Because most economic and social incentives support the platform profile layer of interoperability and testing, we thought we would play foil to that and do the perhaps harder, more long-term-ist work of complementing it with a more protocol-focused approach. We are working from the starting point that the protocol defines behaviors in the abstract, making as few assumptions as possible about realistic architectures, privacy/security tradeoffs, moderation scaling, etc. We feel that the best way to balance long-term potential against short-term viability is to work up from abstract behaviors to architectural solutions, showing our work step by step and documenting every fork in the road where an architectural assumption or a tradeoff snuck in to our mental model.

It helps to conceive of a "testing stack", which a more centralized endeavor could have the luxury of skipping steps and rushing through, but that justifies the complexity here in being collaborative and additive, maximizing re-use and re-mixing by future contributors as well as current ones. This testing stack looks like:

Behaviors in a protocol
Architectural assumptions and choices (i.e. feature parity with Web2 social UX, server-based, local-first, self-hosted/microserver-friendly, etc.)
Individual tests in human-readable form (e.g. Gherkin tests) mapping 1 to 2
Machine-readable/automated tests interpreting ^
Collections of tests into scripts or drivers (e.g. CI scripts guarding main branch on a server or client)
Architecture-specific profiles of those behaviors (e.g. platform profiles across servers and clients in a defined interop network)

Starting from 6 and working back to 4 would be the fastest path if our top priority were, say, maximizing interoperability with today's Mastodon or Lemmy API as an end unto itself. Doing so would have a lot of side-effects, though, in that it amplifies short-term design choices and risks ossifying the status quo of an API into the end-all-be-all of an interoperability network. A less obvious side-effect of this approach is that it threatens and disincentivizes diversity of architectures and form-factors, narrowing the focus of the greater fediverse to a very specific sense of "server-centric social media" that marginalizes other forms of social web thinking and disincentivizes many forms of experimentation and evolution. In other words, it maps 6 onto 2 1:1 and freezes 2!

What we are trying to do, instead, is create a through-line from layer 1 to layer 6, annotating and connecting work at different layers to make it all maximally re-usable and re-contextualizable. Or, to put it another way, we want work at all 6 layers to be atomic and composable, so that all further work can be additive. That means:

tests that link to the behavior they choose (these are quite easy to slap a UUID on and formalize, see below)
human-readable tests that link to a set of use-case and architecture assumptions (this is the hardest to formalize and might never be "stable")
sets of human-readable tests that are considered "complete" for a given use-case/profile
human-readable tests that link to one or more testing implements for different contexts (not just different languages, but different authorization models/trust boundaries, etc)

Side note: Order of Operations

Of all the layers, perhaps layer 2 is the hardest to think about, document, and be objective about.
It is a real challenge to achieve neutrality and fairness in documenting and understanding the tradeoffs in any architecture, particularly the ones you don't like for cultural, economic, or political reasons, or the ones you worked on for many years and swore off dramatically. But without documenting these tradeoffs, we can be trapped in them: today's architectures can harden into a worldview and take on a kind of inerta that can be hard to break out of!

Less esoterically, though, it can also be very hard to cultivate interoperability across different architectures. Assuming one set or profile, coming up with a complete set of tests for that profile, and only then trying to interoperate with a different architecture and seeing where those assumptions bump up against equally-valid but incompatible interpretations of the same behavioral expectations might just be the fastest way to arrive at multiple valid architectural profiles.

Iterating in this kind of system means swapping and replacing individual units over time, tho. I.e., you'll thank us later for insisting on doing it the hard way at the start of the journey.

A piece offering: halfway to a set of scripts

We've started the ball rolling with a simple effort: we read through the ActivityPub spec and made a machine-readable version of every MUST behavior in it. Or, to be more precise, we started by making a spreadsheet of all the normative statements in the specification, one per row, and annotating them a bit. If they are useful to you (or if they help you to point out flaws in our work so far!) you can download them in TSV or CSV form. Since running through the specification together and making these notes, we've further refined the mental model and layered on more structure to the UUID-identified behaviors described in the next section-- we want those to be the definitive, interactive, and authoritative version of the behaviors! (Note: each UUID can be tacked onto the URL https://socialweb.coop/activitypub/behaviors/+{UUID} !) But these "working notes" along the way might inspire others (or just save them time) in working at other layers of the 1-6 stack defined above, so we're open-sourcing them to whomever finds them interesting.

While the CSV form is more useful to import as a dataset and includes a column linking each MUST behavior to its JSON-LD object described below, the earlier TSV is simply presented as a passive document, meant to be imported into, say, LibreOffice or Spreadsheet program of choice as a note-taking or organizational tool. Seen in this way (as a spreadsheet) the additional columns come into view as a mental exercise in structuring and grouping. While reading and analyzing each behavior, I tried to imagine how each of these behaviors could be tested in a context where users had accounts on today's Mastodon-style administrative community servers. Thus, where user A sends something to user B via server A communicating with server B, I imagined a scenario where each behavior could be tested by "spoofing" the behavior or outputs of user A, and/or server A, and/or server B, etc. You could say that without documenting my assumptions, I was jumping ahead to breaking up the list of behaviors into a set of scripts where one would replace, spoof, impersonate, or take control of server A to test server B, another would take control of server B to test server A and client B, etc etc until every behavior was tested.

This might still be worth doing, urgently, today! I will gladly work with anyone doing this work as well. But we are focusing on the big-picture collectively and coming back to layer 2 later, since it seems more urgent to reduce redundant work and get people aligned where their testing work is easily alignable.

A git offering: halfway to a dataset

We've also started the ball rolling on some UUID-identified JSON-LD blobs that identify each behavior under test, which so far only covers the MUST behaviors in ActivityPub, with the rest of ActivityPub and ActivityStreams coming soon. You can click around on them live on this very website and you can fork the code or contribute to it on the gitiverse.

Our hope is that this approach will get us more slowly but more carefully to the point where conformance can be deterministic, collaboration can be additive, and whatever tradeoffs and decisions have to be made can be made explicitly and in a well-documented way. Profiling will still be inevitably political and perhaps even divisive, but at least it won't be tendentious or subjective and hopefully it can be more fact-based and respectful.

A coordination offering: monthly meetings?

We have no interest in centralizing parallel and independent testing efforts, except insofar as those efforts could find common anchors in the specs themselves and share their work up- and down-stream as the forks proliferate. We do, however, have a lot of experience with testing efforts across open-source software and protocols, so if hosting SWICG "special topic calls" on the subject would be a welcome venue for collaboration, we would love to do host such calls, take notes at them, publish the notes to the SWICG archive, and pick up action items between calls to keep the momentum up.