Deciding what to test in AI prototypes

Deciding what to test is the first, and most important, step in defining an AI prototype. This decision shapes all other decisions in designing the prototype.

Defining the hypothesis under test is important because prototypes are messy. And messy experiments give muddled results; hiding the relevant amongst the incidental.

Prototypes are broad brush-stroked approximations of the final product. The learnings from a prototype can be game-changing, intriguing, and wholly surprising. But to learn from a prototype with confidence, the effect or insight will need to be large.

It is very easy to take a finding from a prototype and generalise it, only to later find that the learning was tied directly to some imperfection in the prototype itself. Minor differences between the prototype and end-product can and do impact the learnings. Details such as how fast an element loads, or being constrained to a few user journeys have very real effects on how the user responds.

With prototypes, we're looking for big effects. Things that are obvious once our attention is drawn to them. Not optimisations. For optimisations, do this later in the design process and consider A/B or multi-variate testing on large user groups.

With many elements under test, the feedback will be noisy. It is difficult to untangle the causes and effects of what our users tell and show us.

The types of things we might want to test include:

The technical details
- The performance of the model.
- The speed of delivering the model results.
- The rate of feedback from a model and whether a user can visibly ‘teach’ the system.
The interface
- How interactive is the AI feature.
- Are there separate elements for the AI feature; how are these delineated from the rest of the system.
The messaging
- Explaining the AI algorithm; what it does and how it learns.
- Teaching the user how to make the product learn.
- How numeric the model results are; how numerate is the user expected to be.
- Whether and how we communicate error messages.
Error correction
- How to put fail-safes in place in case of error.
- How to determine if the model has broken down.
- What we do when the model breaks down.
- How to recover from catastrophic error.

Separating these tests is important. For testing the user impact of technical details it is best to have arrived at a finalised design for the interface, messaging and error communication.

Messaging is closely tied to the interface and error-handling and often won’t be tested alone. Instead, the interface and messaging or the error-handling and messaging will be tested in pairs.

The important thing to bear in mind is that we don't want to be rapidly swapping these permutations in the hope that we’ll observe fine differences in user responses to help us determine the optimal combination. With small user groups the results will certainly not be statistically significant, nor usually generalisable and relevant.

Instead, choose a configuration with clearly defined upfront assumptions and observe whether the user behaves as expected, and if not, why not.

— — —

Deciding what to test in AI prototypes

Adaptive preference and shrinking Product Vision

Signalling in Innovation Management