Algo Wheel of Fortune

Algo wheels are one of the most talked about topics within the electronic trading community. Initially, the term “algo wheel” referred strictly to the process of randomly allocating orders across algorithms. But over time, the term “algo wheel” has come to include the entire strategy evaluation process, i.e., the wheel itself and the associated post-trade analysis. The conflation of these separate activities has unfortunately created confusion surrounding where an algo wheel adds value over traditional approaches, and where it doesn’t.

In this note, we attempt to address this confusion by identifying the specific areas where the algo wheel can add value, as well as those key issues that remain even in the presence of a well-functioning algo wheel. Throughout our discussion, we use the term “algo wheel” to refer strictly to the random order allocation process itself, as this is what separates the algo wheel process from the more traditional ones.

Where algo wheels can add value

Algo wheels help to create similar samples across algorithms

The primary benefit of the algo wheel is its ability to create subsamples of data that allow for apples-to-apples comparisons. By randomly allocating orders to algorithms, the algo wheel eliminates the biases that often resulted from trader-based routing. And by standardizing the algorithms within a given strategy grouping, the algo wheel makes it possible to collect sufficient data for each algorithm, without having to worry about controlling for all the various permutations within a given algorithm (e.g., different volume limits, etc.).

Algo wheels reduce complexity on the buyside trading desk

Not only can algo wheels create unbiased samples across algorithms, they do so in a workflow-friendly way. Prior to the advent of the (automated) algo wheel, trading desks hoping to standardize the routing process had to do so by turning the trader into a sort of human algo wheel. Such a process could be operationally tedious as well as error-prone. The algo wheel, by contrast, requires that traders simply select a specific strategy type, and from there, the algo wheel does all the work.

The key issues that still remain

Algo wheels do not guarantee that subsamples are sufficiently similar

While algo wheels can eliminate biases and create comparable subsamples of data, this is by no means assured. For example, a sequential randomization process simply cannot guarantee that the resulting subsamples will actually be comparable for any finite sample. Early on in the allocation process, the composition of each subsample will change, sometimes dramatically, as orders are allocated to that sample. Any cross-algorithm comparisons in the early stages would likely be apples-to-oranges. Over enough orders, however, the characteristics of each subsample will eventually stabilize. At that point, the subsamples would resemble the aggregate flow and consequently would resemble one another. But how many orders are needed for this convergence occurs? How long will it take before the oranges eventually turn into apples?

In essence, the wheel’s power relies on the “law of large numbers”, where eventually, if the wheel is truly random and the flow consistent, the subsample characteristics will converge, and the cross-algorithm comparisons will be truly apples-to-apples. But with so many factors influencing performance, it may take a lot of orders – and a lot of time – before each subsample is sufficiently similar. Consequently, even when using an algo wheel to allocate orders, the analyst may still need to employ the same types of standardization techniques that are traditionally used to eliminate any residual biases.

One potential solution to this issue is to apply the allocation process across the full sample of orders jointly, instead of sequentially. For example, one buyside firm we are familiar with would run a single optimization that split that day’s portfolio rebalance trade into multiple baskets. Each basket was then routed to a different algorithm. The objective of the optimization was to create baskets that were as similar as possible with respect to liquidity, order size, etc. It also aimed to create similar risk exposure, e.g., similar net dollar (beta) exposure, to reduce the effect of market-wide moves on realized performance.[1] While such a process is not feasible for all trading desks, some trading desks, like those of buyside quant managers, could see vast improvements relative to the allocations of sequential algo wheels.

Algo wheels do not eliminate the need for sufficiently large sample sizes

Institutional trading data generally have a relatively low signal-to-noise ratio. As a result, relatively large samples are often required to glean statistically meaningful insights. The potential standardization of the subsamples by algo wheels does not eliminate this need. In fact, even if the algo wheel worked perfectly and the subsamples submitted to each strategy were identical, each subsample would still need to be sufficiently large to yield sufficiently precise statistics.

So, how much data is enough for statistical comparisons? This depends on several factors – for example, the size of the differences that can be reasonably expected across strategies (the bigger the difference, the less data needed), the standard deviation of each order’s performance measure (the higher the standard deviation, the more data needed), the number of strategies tested (more strategies requires more orders), and how the orders are allocated across strategies (unequal allocations require more orders).

To get a sense of scale and of how these factors come into play, consider a simple thought experiment. Suppose the arrival price performance of only two strategies are being compared using a simple difference of means t-test.[2] Assume the standard deviation of each observation is 50 bps, the orders are completely independent, and each observation is given equal weight (i.e., the averages are order-weighted, not dollar-weighted). With this information in hand, the analyst could ask “If I uncovered a difference of 5 bps between the two strategies, how large would the sample have to be for that difference to be statistically significant?”

Given the inputs above, the analyst would need 768 orders per strategy, or 1537 orders in total.[3] If, on the other hand, the difference were much smaller, say 1 bp, the total sample size needed would jump to 38,416. And this is just for a comparison of two strategies – the aggregate sample would need to be even larger if more than two strategies were compared.[4]

Of course, if the data are less noisy, the differences between strategies is expected to be quite large, and/or the data were part of a portfolio trade where the buy and sell orders “hedged” one another, for example, then the sample sizes needed would be smaller than those presented above. But the broader point is that the presence of an algo wheel does not fully address the need for sufficient sample size. In fact, the analyst has even greater need to understand the requisite sample sizes, as this information is key in determining how many algorithms to compare contemporaneously, how long it will take before the results are statistically meaningful, etc.

Algo wheels do not solve outlier-driven issues

Another common issue in performance analysis is the presence of outlier trades. By definition, outliers occur in small numbers relative to the aggregate sample size. Consequently, algo wheels generally cannot allocate away the issue. Put another way, even if the algo wheel were to assign outliers across samples in the most optimal way possible, it would not eliminate the need to deal with the potential outsized influence these orders would have on the analysis.

Algo wheels cannot control for post-allocation changes

While the initial allocation of an algo wheel is random, traders may still have some flexibility to update order parameters, e.g., price limits. Traders and PMs may also cancel orders, perhaps because of bad performance, low fill rate, etc. As such, the analyst still needs to investigate these potential issues after the fact, both to properly measure performance and to ensure that these actions don’t create unintended biases across algorithms.[5]

Conclusion

Algo wheels have contributed greatly in helping buyside firms evaluate strategies and improve execution quality. Not only do they formalize the strategy evaluation process, they also can eliminate trader-driven allocation biases and help standardize the order flow routed to competing algorithms. But simply employing an algo wheel is nowhere near sufficient when it comes to proper execution strategy analysis. An analyst cannot assume that the data generated by an algo wheel is pristine enough to simply compute each algorithm’s aggregate performance and make cross-algorithm comparisons. Rigorous post-trade analysis is still required, to determine whether the algo wheel has successfully created comparable samples (and make appropriate adjustments if needed) as well as to address the issues discussed above. Otherwise, the resulting performance metrics may be biased and/or too noisy to draw meaningful inferences.

Put more bluntly, without putting in the extra effort to address the issues noted above, the conclusions drawn from algo wheel data may be no better than having chosen the best algorithm by spinning a wheel of fortune.[6]

The author is the Founder and President of The Bacidore Group, LLC. The Bacidore Group provides research and consulting services to buyside and sellside clients, including assistance with broker algo customizations, "algo wheel" design and analysis, custom TCA frameworks, custom algo development, and bespoke quantitative analysis.

[1] Of course, over time, the market noise would average out. But this would require more data, and more time, before any performance results would become statistically meaningful.

[2] This analysis could be applied to all performance measures, including VWAP. While VWAP performance generally has a smaller standard deviation than implementation shortfall on a given sample, the expected differences in VWAP performance across algorithms are also typically much smaller. Consequently, VWAP analysis also tends to require a relatively large sample.

[3] To compute this number, we re-wrote the t-statistic equation for a difference of means in terms of the common sample size of two subsamples (n) as a function of the test statistic (t). The resulting formula is n =[ sqrt(2) * t *stdev / x] ^2, where n is the sample size of each sample (i.e., the total number of orders is 2n), stdev is the standard deviation, and x is the difference in mean performance). We then plugged in a value of 1.96 for t to represent a 5% two-tailed confidence interval, the given standard deviation of 50 bps, and a difference in means of 5, to get the number of observations needed in each subsample (n).

[4] Note that we assumed equal-weighting here, though most analysts use dollar-weighting. Nevertheless, for a given weighting scheme and the appropriate assumptions, an analyst can generally do a similar analysis to get a rough estimate of the scale of the data needed.

[5] For example, if cancelations occur more frequently in one subsample relative to others, measures like implementation shortfall may not be fully comparable across samples, even after adjustments are made. For example, even if an opportunity cost were applied to unexecuted shares by “marking them to market”, the total cost would not include the market impact costs the algorithm avoided by not completing the order, thereby giving that algorithm an unfair advantage over algorithms with fewer cancelations.

[6] I know it took me a while to get to the “wheel of fortune” reference, and it is admittedly a bit strained. But the title was simply too good to pass up.

Comments