SKIP TO CONTENT

engineering

Warehouse-Native Performance Optimization: Query-Time Sampling

Priyendra Deshwal
Apr 06, 2023 Priyendra Deshwal

The NetSpring Performance Pack now allows Enterprise users to run explorations over sampled data. As one might expect, such explorations run orders of magnitude faster than their unsampled variants. NetSpring’s sampling feature is implemented to ensure that the results produced by sampled explorations are statistically reliable and can be used to make critical business decisions.

Background

Warehouse-native product analytics provides a compelling trio of benefits to customers:

  • Access to all business data: Customers no longer need to decide what subset of data to send to first-generation product analytics tools. All the data in the warehouse is available for analysis, all the time. 
  • Security & governance: No data ever leaves the customer’s warehouse.
  • Cost & scale: The pricing models of first-generation vendors scale poorly with high event volumes. Warehouse-native product analytics is significantly cheaper because of the elastic pay-for-what-you-use pricing models of modern data warehouses.

Beneath these strong, immediately apparent benefits, there lurks an underlying question that’s on everyone’s mind.

Will my warehouse have enough juice to be able to support these queries?

And the answer is yes. NetSpring brings together a collection of techniques to make interactive warehouse-native product analytics possible. Query-time sampling is one such technique.

How Sampling Works

Much of the statistical ideas behind sampling are well-understood. At a high-level, you look at a sample of the data (say 20%), calculate the result and then upscale the result by 5x to compensate for the 20% sample rate. Pollsters do this all the time when they run opinion polls over a sample of the population and predict election outcomes. Just like elections are about counting people who vote for their preferred candidates, product analytics is all about counting users who belong to certain cohorts, reach certain product milestones, etc.

There are two key rules to remember while sampling:

  1. The sample must be unbiased. In the election analogy, every voter must have an equal chance of being sampled. Aside: This is where most of the inaccuracy in election predictions comes from, but that is a topic for another day.
  2. A user must either be fully included or fully excluded from the sample. It should not be the case that a user’s event history is only partially included in the sample.

NetSpring’s sampling implementation respects both these requirements. There are several underlying technical details regarding data layout, selection of sample size etc that are out of scope of this discussion. However, all those details come together into a beautiful end-user experience that is fast and intuitive.

The user only has to select from three options:

  1. Enabled – Faster Response: This setting favors fast query execution over result accuracy. This is great while iterating over a query.
  2. Enabled – Higher Precision: Queries with this setting are still significantly faster than unsampled, but they favor result accuracy over speed of query execution.
  3. Disabled: This disables sampling altogether and queries run in normal unsampled fashion.

For most explorations, this means there is a high degree of confidence (say 99%) that the true results are very close (say within 0.5%) to the sampled results. This kind of strong statistical guarantee allows customers to make critical business decisions based on sampled explorations.

An important detail here is that NetSpring does not sample events at ingestion time. This is markedly different from other vendors where customers are forced to sample events during ingestion to avoid high event-ingestion costs. An event that is sampled at ingestion time is lost forever.

With NetSpring, your sampling strategy does not dictate your data strategy.

In our warehouse native model, we recommend that customers store every last event in the warehouse and dynamically adjust the sampling ratio to suit the needs of their use case.

NetSpring is architected on top of our proprietary analytics scripting language called NetScript. NetSpring users interact with point-and-click explorations. Those explorations produce NetScript which is compiled into the SQL that is sent to the warehouse. NetScript has been a key enabler for our sampling approach. The current version of sampling works from the majority of exploration templates including funnel, path analysis, retention, and event segmentation. The next version of this feature will be able to apply sampling over a much broader class of queries and bring even greater cost-savings to our customers.

Witness the power of warehouse-native product analytics for yourself. Request a demo today.

Getting set up is easy.

Connect to Snowflake, BigQuery, Redshift or Databricks. Be up and running in hours.

Get Started