Case study: How We Lost $7,500 on Mobile App A/B Tests But Learned How to Do Them
The importance of A/B testing has been covered in hundreds of texts, books, and webinars. Today, no exponential growth is possible without proper experiments conducted. But not every marketer bothers doing some practical research, in most cases arguing that A/B tests are costly and time-consuming, especially for emerging projects. Let’s figure out whether it’s reasonable to invest resources in split tests if your company is much smaller than Google or eBay.
How Classic A/B Tests Work, and What’s Wrong With Them
Conventional A/B testing is a go-to method when it comes to making some changes to a product. It’s quite straightforward: we divide users into groups and assign one of the options to each group. Then we define the metric by which we assess the results and then, following the final values, we choose the best option.
There is only one winning option here. But during the experiment, users see designedly poor options longer than they could — resulting in money loss the project incurs. Yes, you will go with better options and optimize costs further; but the very period of testing will cost you something.
Can you have it both ways, finding the best performing alternative and curb losses at the same time? We tried to find a solution in this case.
Part 1. A/B Test
One of our customers had a hookup dating service named Sweet.
The client acquired paid traffic, conversion to purchase was about 4%. In terms of numbers, earnings covered the costs with effort, so it was hard to ramp up the advertising budget and attract more users.
Media buyers sourced traffic from the US, promoting the app on Snapchat. Other sources were also tested, but Snapchat demonstrated better performance in comparison.
To balance the project and advance traffic acquisition investments, developers decided to optimize conversions into the first purchase — so the attracted traffic pays off in the first months. Everyone in the team realized that test-driving of various approaches was necessary. By that time, we already had our first private beta of A/B testing on the Appbooster platform and we decided to integrate it into the project.
We launched a bundle of small experiments inside the application, but they did not bring any significant improvements. After that, we decided to test the subscription screen and designed 3 new paywalls. As of the beginning of the test, all three alternatives generated equal amounts of traffic, so we chose the Conversion Rate as the key metric.
The experiment lasted one week. During the procedure, our product support specialist Evgeny was tracking the conversions with the analytical tools and calculated confidence intervals in the other tool. Confidence intervals reflect possible fluctuations of the assessed values with the preset probability. In our situation, they help calculate in which interval the paywall conversion will fall in 95% of cases. They are used in classic A/B tests to find out which option is better in terms of statistics. Following these indicators, Evgeny would change traffic distribution between alternatives every day, giving more traffic to the best-converting paywall.
This method differs from conventional A/B tests. While the traditional procedure implies that no changes are made during the experiment, ad hoc improvements can help a mobile project pick the best alternative faster and cut test-related costs.
Down the road, we stayed with the top-performing paywall that delivered the highest conversion rate and applied it to 100% of users. During the test, we generated 745 purchases, with a 6.33% conversion rate on the best alternative. Thus, the first-month ARPU totalled $1.06. In terms of acquiring traffic on Snapchat, CPI amounted to around $1, and this allowed buying traffic with profit in the first month. And thanks to high payment recurrence, the application secured good CPI growth prospects and reliable, stable unit economics.
As a result, the customer managed to balance the sheets and we started advancing our A/B testing service. A total win-win.
Evgeny employed the method that is the cornerstone of the Multi-Armed Bandit. Remember those slot machines in a casino? Every time you pull the lever, you win — or do not win — some money. You have a limited number of attempts. According to the law of probability, some machines can give you more money than others. And you want to win more, right? If you knew in advance which machines can yield more winnings, you would probably hit a jackpot.
The same is true for the mobile industry. We have several alternatives of a paywall to use in the app. But we don’t know which one would deliver more purchases in the context of limited traffic. In this case, paywalls are slot machine arms, and our mission is to pull the most money-filled lever: i.e., to find the most efficient paywall.
The Multi-Armed Bandit algorithm was invented to solve this task. This is the same bandit but with more levers to pull. We tested 4 paywalls, so our bandit has 4 arms. When the user reaches the stage of buying a subscription, the algorithm chooses which paywall to show. If the algorithm realizes that one of the options brings more conversions, it starts showing this potential top performer more frequently than other alternatives. The bandit kind of gives us a hint on which lever to pull to squeeze out the maximum conversions in terms of uncertainty and limited traffic.
In our case, Evgeny worked for the algorithm: he redistributed traffic manually. What are the drawbacks of such an approach?
● Product manager/analyst/marketer has to waste time on test monitoring and traffic redistribution
● Redistribution is performed manually, so the final result may not be accurate every time. Evgeny sleeps 8 hours a day: he cannot redistribute traffic during this period.
Possible remedies? Automation!
Part 2. Multi-Armed Bandit Comes Into Play
Developing the A/B testing service, we complemented it with the Thompson Sampling algorithm that can solve the problem of the multi-armed bandit. Such a bandit will be a flawless version of Evgeny: it will switch distribution more frequently and take more factors into account. We decided to verify its consistency on real historic data of the Sweet app, so we exported and fed them to the algorithm to see how it would redistribute traffic flows between alternatives.
Here’s what we got:
● Perfect: 1,070 purchases (100%). We could get these purchases if we picked the most efficient paywall at the beginning and applied it without any testing.
● Historic: 745 purchases (69%). We got these purchases thanks to Evgeny’s work.
● Thompson Sampling: 935 (87%). This is the number of purchases the bandit could deliver.
It turned out the algorithm could have demonstrated better results, choosing the same alternative that Evgeny picked. But CAC could be lower since the customer would get more purchases at the same cost. In terms of numbers, the project missed about $7,500 for 5 days.
The efficiency of such algorithms is evaluated using the regret metric that demonstrates a deviation from the perfect scenario. In this case, Thompson Sampling gets the minimum regret, losing just 135 purchases to the perfect option. Manual distribution got 325 purchases fewer.
And here’s the illustration of Thompson Sampling’s advantage over conventional A/B tests:
Answering the question of whether to carry out a test if your project is only starting its way, we confidently state that you should conduct testing since it can reduce the costs and facilitate the whole process. This is why we implemented the Thompson algorithm on our platform and now develop projects powered by it.
More about Proba.ai
So far, the algorithm employs AppsFlyer (Enterprise or Business rate) and Amplitude; more trackers are coming soon.
Who will benefit from our service:
● Those who want to start A/B tests for their app.
● Those who carry out many experiments and want to automate them. The algorithm does everything autonomously, relieving the marketer/analyst/project owner from this routine.
● Those who want to trim test-related costs.
Even if you use another rate or tracker, we will come up with an integration solution. We are planning to engineer some algorithms to handle other tasks, optimize other metrics (ARPU, ARPPU, LTV, etc.), and decrease regret. If you are interested in mobile A/B testing — try Proba!