Decoding Your Recommendations Performance

Product recommendations, also known as “recs,” are a cornerstone to an effective ecommerce merchandising strategy. When fully optimized, recs typically increase retailer revenues by up to 5%. That’s a substantial contribution when extrapolated across annual sales in the hundreds of millions or even billions of dollars. However, one thing I’ve noticed while navigating the world of ecommerce personalization is that the word “performance” is often misused in the context of recs. The reality is that there are multiple meanings to this word, and for each meaning, there is a discrete method for assessing its relationship with your shoppers and your revenue. So, let’s decode your recs performance.

At RichRelevance, we group performance into three analytical categories: value, engagement, and cohort analyses. While each type is useful, each tells us vastly different things. Misappropriating the type, like confusing an engagement analysis for a value analysis, can lead to disastrous business decisions. I’m here to protect you from making such mistakes.

First, let me summarize the three types of analysis:

  1. Value Analysis: this indicates the incremental revenue impact of recs, and is usually the report of most interest to retailers. The key metrics are typically revenue per session (RPS), conversion, average order value (AOV), and the lift contribution is determined by running an A/B or multivariate test that compares key performance indicators (KPIs) of populations based on their exposure to recs.
  2. Engagement Analysis: this quantifies shopper utilization of recs, and can be a signal for relevance. The key metrics are recs sales and clickthrough rate (CTR). Recs sales represents the sale of items clicked on in recommendations.
  3. Cohort Analysis: this profiles the spend patterns of shoppers that choose to use recommendations vs. those that do not—comparing the RPS, conversion, and AOV of shoppers based on their engagement with recs.

Most retailers stumble with using engagement or cohort analyses interchangeably as indicators of value. I get it; these reports are readily available, whereas conducting a true value analysis requires an A/B test which can take weeks and expose the retailer to undue opportunity costs. However, the reality is that these reports don’t convey the incremental benefit of recommendations, and here’s why:

Disassociating Engagement and Value Analyses
An engagement analysis tells us how much shoppers use recs and, to some extent, indicates relevance. If the recommendations are random, shoppers won’t click or buy from the retailer. That said, striving to maximize recs sales or CTR is equal to saying that recommendations are the most important piece of content on your site—and we all know that they’re not. What’s most important is getting shoppers to convert with the highest level of spend. Recs merely support that objective.

In fact, certain kinds of recs engagement can have a neutral or negative impact on your business. It’s the responsibility of the recommendation technology and how it’s configured to mitigate these instances. To shed more light on the matter, here are three events you should be aware of that explain why value is not a function of recs engagement or sales:

  • Overlap: Sale of items that would have happened anyway, even in the absence of recommendations. Frankly, the vast majority of sales that happen through recommendations aren’t incremental, so indiscriminately increasing recs sales does not guarantee more cash in your coffers.
  • Cannibalization: Reduction in AOV due to the presence of recs. If recommendations cause a shopper to buy a cheaper SKU than they otherwise would have purchased, that takes money out your pocket as a retailer.
  • Shopper Distraction: Reduction in conversion due to recs. If recommendations are optimized to attract as many clicks as possible, they can bait shoppers to click in perpetuity until they become fatigued and leave your site without converting. No one wants this.

And of course, we have copious amounts of data to demonstrate that maximizing recs engagement has a detrimental impact on revenue. We’ve plotted recs CTR and recs sales against RPS lift for a multitude of RichRelevance A/B tests, and the resulting scatterplot shows no positive correlation between engagement and incremental revenue. In fact, the data suggests that at extreme levels of recs engagement, RPS lift can be severely compromised. For more details, view this interesting TED Talk on the “paradox of choice”, which implies that oversaturating a shopping experience with product options and baiting shoppers into excessive exploration can result in non-conversion.

Quite simply, it’s not about maximizing recs engagement; it’s about driving the right level of engagement that minimizes the aforementioned events and maximizes your per session revenue.

Disassociating Cohort and Value Analyses
A cohort analysis tells us the RPS, conversion and AOV of shoppers that choose to use recs versus those that do not. Most often the shoppers that choose to use recommendations have substantially higher key performance indicators (KPIs), which can lead one to presume an extreme level of benefit from recs. That said, please consider that although recommendations do generate substantial revenue lift when properly optimized, a cohort analysis does very little to quantify that impact.

In the business-to-consumer world, shoppers that utilize recommendations tend to have higher spend propensities, so it’s not uncommon for recs users to have 2x the RPS of non-recs users. Intuitively, this makes sense. When dealing with household consumers, willingness to engage with your catalog is a primer for purchase intent. In the business-to-business (B2B) world, the inverse is true. B2B buyers have predictable requisition lists that they rarely deviate from. Consider an office manager that’s restocking his company’s office supplies or an IT manager that’s purchasing specific laptop models for her business. These professionals know exactly what they want, order in bulk, and do not use recommendations.

So, these non-recs users have high RPS, conversion, and AOV. Conversely, the recs users represent those shoppers that are less committed—those that may or may not know what they want, and if they do convert, it’s with much less spend. For sites like Staples, Office Depot, or Dell, these are most usually your household consumers, and a cohort analysis would report much lower KPIs all around. Therefore, due to a self-selection bias, whether a cohort analysis shows higher or lower KPIs for recs users is independent of the value the recs technology actually delivers. Rather, it’s indicative of the types of users that decide to use recommendations.

Are recs impacting your best shoppers? Do recs help monetize individuals least likely to convert, or do they have broad resonance across all levels of spend in your shopper base?

So, there you have it—your performance analyses decoded to help you learn a bunch more about your recs and answer these questions: Who is using recs (Cohort Analysis)? How much are they using it (Engagement Analysis)? What incremental value does this utilization provide (Value Analysis)?

Retail Gazette – LFW: project technology

“Gone are the days where stores are the forefront of the marketing stage and mobile devices are used predominantly for text messages and playing snake”, says Matthieu Chouard at RichRelevance, an omnichannel personalisation specialist. This couldn’t more true than at this season’s ‘Fashion Month’, where technology dominated the catwalk.

Read more

Information Age – What the retail sector can learn from London Fashion Week's tech innovation

This season’s London Fashion Week showcased some flamboyant fashion- and equally dazzling examples of omnichannel, personalised digital marketing strategies.
Every season, the illustrious London Fashion Week gets more high tech as retailers seek to make the show an interactive, omnichannel brand experience.

Read more

Entrepreneur – Data Driven: What Amazon's Jeff Bezos Taught Me About Running a Company

My experience with Jeff Bezos changed me forever.

In 2003, Amazon hired me directly out of Stanford. I initially turned down six separate offers until management coaxed me into running the company’s Customer Behavior Research group focused on data-mining research and development. Upon arrival in Seattle, I was bounced around from one manager to another, including working directly with Bezos himself.

Read more

Journal du Net – Le point critique des valeurs aberrantes dans le test A/B

Malcolm Gladwell a récemment vulgarisé le terme « outlier » (valeur aberrante) en l’utilisant pour désigner des personnes performantes. Toutefois, dans le contexte des données, les valeurs aberrantes sont des points de données très éloignés d’autres points de données, c’est-à-dire atypiques… Read more

The Tipping Point for Outliers in A/B Testing

Malcolm Gladwell recently popularized the term ‘outlier’ when referring to successful individuals. In data terms, however, outliers are data points that are far removed from other data points, or flukes. Though they will make up a small portion of your total data population, ignoring their presence can jeopardize the validity of your findings. So, what exactly are outliers, how do you define them, and why are they important?

A common A/B test we like to perform here at RichRelevance is comparing a client’s site without our recommendations against with our recommendations to determine the value. A handful of observations (or even a single observation) in this type of experiment can skew the outcome of the entire test. For example, if the recommendation side of an A/B test has historically been winning by $500/day on average, an additional $500 order on the No Recommendation side will single-handedly nullify the apparent lift of the recommendations for that day.

This $500 purchase is considered an outlier. Outliers are defined as data points that strongly deviate from the rest of the observations in an experiment – the threshold for “strongly deviating” can be open to interpretation, but is typically three standard deviations away from the mean, which (for normally distributed data) are the highest/lowest 0.3% of observations.

Variation is to be expected in any experiment, but outliers deviate so far from expectations, and happen so infrequently, that they are not considered indicative of the behavior of the population. For this reason, we built our A/B/MVT reports to automatically remove outliers, using the three standard deviations from the mean method, before calculating results, mitigating possible client panic or anger caused by skewed test results from outliers.
At first glance, it may seem odd to proactively remove the most extreme 0.3% of observations in a test. Our product is designed to upsell, cross-sell, and generally increase basket size as much as possible. So, in an A/B test like the above, if recommendations drive an order from $100 to $200, that’s great news for the recommendations side of the test – but if the recommendations are so effective that they drive an order from $100 to $1,000, that’s bad news because a $100 order has become an outlier and now gets thrown out.

In order for a test to be statistically valid, all rules of the testing game should be determined before the test begins. Otherwise, we potentially expose ourselves to a whirlpool of subjectivity mid-test. Should a $500 order only count if it was directly driven by attributable recommendations? Should all $500+ orders count if there are an equal number on both sides? What if a side is still losing after including its $500+ orders? Can they be included then?

By defining outlier thresholds prior to the test (for RichRelevance tests, three standard deviations from the mean) and establishing a methodology that removes them, both the random noise and subjectivity of A/B test interpretation is significantly reduced. This is key to minimizing headaches while managing A/B tests.

Of course, understanding outliers is useful outside of A/B tests as well. If a commute typically takes 45 minutes, a 60-minute commute (i.e. a 15-minute-late employee) can be chalked up to variance. However, a three-hour commute would certainly be an outlier. While we’re not suggesting that you use hypothesis testing as grounds to discipline late employees, differentiating between statistical noise and behavior not representative of the population can aid in understanding when things are business as usual or when conditions have changed.

More posts