Operational intelligence (OI) lets businesses use streaming event data that happened in the past seconds, minutes, or hours to drive insights and make decisions. OI has become critical as businesses want to understand what’s happening now and why, including analyzing complex event patterns and flows. This new type of analytics lets you quickly react to problems and opportunities in the moment — but from a technical perspective, it’s very hard to get right.
The Googles and Facebooks of the world have built sprawling home-grown systems to power their businesses with OI. These systems, manned by software teams with hundreds of engineers, are out of reach for most enterprises.
We built NetSpring so any business can stand up a world-class operational intelligence stack. All they need is a Cloud store like AWS S3, a stream processing platform like Apache Kafka, and the NetSpring Operational Intelligence Platform. I’m Priyendra Deshwal, CTO and co-founder of NetSpring. Building this platform has been the adventure of a lifetime, and today I want to give you a behind-the-scenes look at the secret sauce that makes NetSpring work.
The technical challenges with operational intelligence
Operational intelligence is very different from traditional analytics and demands a unique set of capabilities. Let’s take a look at these requirements, and how they create challenges that we have been hard at work solving at NetSpring.
- Data ingestion: Operational intelligence is about knowing what’s happening now vs. in the past. Therefore, the platform must support the constant ingestion of event data at high throughput and low latency. NetSpring is designed to ingest millions of events per second with sub-second ingestion latency.
- Sub-second alerts: The consumption model of an operational intelligence platform is very different from traditional analytics. Traditional analytics relies on users pulling data from the system, while operational use cases (like monitoring) require a push model. An alert must fire within seconds of an adverse event happening. Live dashboards must always display the latest state of the system. NetSpring natively supports continuous, incremental, streaming SQL queries to address these use cases.
- Interactive SQL queries: While the primary interaction with an operational intelligence system is push-based, users will still need to interact with the system using the pull model. For example, when an alert fires, it’s essential that users can follow up with ad hoc diagnostic explorations of the data. To provide true visibility into the business, the platform must make it easy to join static reference data with streaming event data. NetSpring has a state-of-the-art query execution engine that allows interactive SQL queries over large volumes of data, cutting across data-in-motion and data-at-rest.
- Time series optimizations: The vast majority of operational data consists of timestamped event streams, which requires special handling of time semantics — to consider data completeness, the late arrival of data, windowed aggregations and joins, and more. NetSpring is built to handle this complexity and support efficient storage and retrieval of time series datasets.
- Advanced event flow analytics: True operational intelligence empowers users to develop a deep understanding of event sequences and paths of event states. Analytics on event streams (funnels, Sankey charts, etc.) often involves studying sequences of events and paths taken by event states. Yet these analyses are hard to express using SQL or to compute using traditional systems. NetSpring has native support for advanced event flow analytics.
- Petabyte scale datasets: As businesses have embraced telemetry across a wide range of industries, petabyte scale datasets have become commonplace and any operational intelligence platform must scale to those sizes. NetSpring is designed to scale to the needs of the biggest enterprises in the world.
There are traditional BI platforms, monitoring tools, product analytics systems, and a host of other specialized options for addressing one, or occasionally several, of these requirements. However, NetSpring is the first platform with all of the capabilities needed for OI in one place.
NetSpring’s key technical innovations
The NetSpring Operational Intelligence Platform is offered as a vertically integrated cloud service to address these requirements. It’s built on top of three key innovations:
- Converged analytical processing (CAP): Operational intelligence platforms must support a wide variety of analytical workloads. Our CAP data platform brings about a true convergence of streaming and batch query processing, with integrated, optimized storage. It’s designed for streaming ingestion, efficient incremental computations, event flow computations, time travel, and continuous SQL queries — at extreme scale.
- Relational event streams: With relational event streams, NetSpring enables complex event flow analytics on top of the familiar and powerful relational model. You don’t have to use specialized product analytic tools for certain questions and relational databases for others.
- NetScript: We wanted to make it easier to ask complex analytic questions that join data-at-rest and data-in-motion. So we developed a new analytics language called NetScript that compiles into SQL but provides a higher level of abstraction, composability, and reusability. For business users, no code is needed — we offer a library of generic and use-case specific templates for building operational intelligence applications.
You can think of CAP as the compute layer, relational event streams as the business entity modeling layer, and NetScript as how you easily express complex analytical questions. We’ll next talk about each of these innovations in more detail and explain how they help us solve key problems with implementing operational intelligence today.
Converged analytical processing
Converged analytical processing, or CAP, is a converged batch and streaming engine that runs on top of a highly optimized, integrated storage layer. It allows you to query all of your data in one place with full join support across data-in-motion and data-at-rest. Powerful and flexible, CAP provides a low-latency, high-throughput backbone for doing everything from real-time monitoring to advanced event flow analytics.
CAP was designed to fill the need for handling batch and streaming compute and storage in a single product. Data warehouses have traditionally done batch compute and storage, but not streaming (continuous and incremental) compute. Platforms like Apache Spark and Apache Flink have done batch and streaming compute (but not storage). NetSpring does all three.
The problem it solves
If you look inside a data-savvy enterprise today, you’ll find a tangle of data warehouses, data lakes, message buses, stream processors, time series databases, real-time OLAP engines, low-latency in-memory databases, event flow analytics, and more.
There are two problems with this architecture:
- High cost of ownership: For every system you need a highly skilled team to manage it, and the complexity grows exponentially as more systems are added and the relationships between them need to be handled as well. You may be paying for a vendor, or paying with the engineering hours of implementing an open source solution, but either way, each additional system is an additional cost.
- Siloed data: In a complex data stack with many analytical systems, all datasets are obviously not visible to all systems, and the ability to join datasets tends to be weak. This creates data silos and hinders the discovery of cross-dataset insights. For example, if the data about clicks on your site lives in an event analytics tool, and data about your customer relationships lives in a data warehouse, then you can’t analyze product usage by total customer spend. In other words, this architecture doesn’t actually let a product manager answer a question as basic as “How are my high-value customers using my product?”
How it works
The core pieces of the CAP engine and storage layer are:
- Query optimization: Queries are parsed and optimized into a physical query plan. The physical plan consists of a graph of operators that are scheduled to execute on nodes across a NetSpring cluster. These operators are compiled into highly optimized machine code fragments, allowing queries to scan hundreds of millions of rows per second per core.
- Unified batch and streaming: The query execution engine has a streaming-first architecture where Apache Arrow encoded data packets flow through an operator graph. These operators release data to downstream operators upon receiving special control packets containing event-time watermark updates. The execution layer can support both batch and streaming computations simply by controlling the frequency of watermark updates.
- Custom storage layer: We use a three-tier columnar storage architecture where newly arrived data is written directly to memory, to allow ingestion at a throughput of millions of events per second with sub-second ingestion latency. This is a very good use of scarce memory resources, because the majority of operational workloads favor fresh data and greatly benefit from that data already being in memory. Older, less recently used data is delegated to local NVMe and cloud blob/object storage.
- Declarative cost-performance tradeoffs: We’ve built “knobs” into the system that allow users to control the relative cache priorities of different datasets, build pre-aggregate indices over datasets, maintain additional copies of datasets using different partitioning and clustering keys, and more. Our vision is that users should be able to achieve any desirable point on the cost-performance tradeoff curve.
Relational event streams
By modeling business data as what we call relational event streams, NetSpring allows complex event flow analytics over data stored in its original relational form. You no longer have to choose between investigating funnels, event segmentation, or cohorts vs. running “normal” slice-and-dice SQL queries on your event streams.
The problem it solves
It’s very difficult for traditional analytics systems to represent the event data that is so core to operational intelligence. The existing approaches fall into two categories:
- Represent event data in a flattened form. Specialized systems for doing event flow analytics, like product analytics tools, require you to coerce your business data into rigid, pre-defined models. As a result, you lose the richness of the data. These tools also lack full SQL support, so they only address a very narrow slice of analytical use cases.
- Represent event data in its original relational form. In this form, most systems struggle to do event flow analytics. Computations on event streams are not easy to express in SQL.
With relational event streams, NetSpring bridges these two approaches to achieve the best of both worlds. Event data is represented in its original relational form, but the underlying CAP architecture is able to support advanced event flow analytics directly on this relational form.
How it works
Relational event streams are powered by several key features:
- Storage optimizations: Event data is columnar-compressed and approximately-sorted by time. This enables efficient support for advanced event flow analytics because events are required to be processed in time-stamp sorted order.
- SQL extensions for event flow analytics: MATCH_RECOGNIZE was added to the SQL standard to allow complex event flow analytics within SQL. NetSpring’s MATCH_RECOGNIZE implementation uses JIT code generation and leverages its knowledge of the underlying storage layout to process hundreds of millions of events per second per core.
- Enhanced expressibility: Computations for event flow analytics are very cumbersome to express in SQL. NetSpring’s application layer (UI templates + NetScript) make it possible to easily express these analyses. All the complexity of framing the right SQL query, technicalities associated with the correct use of MATCH_RECOGNIZE, etc. are handled by the application layer.
NetScript is a programming language that allows users to express complex analytical computations in an easy and natural fashion. NetScript programs don’t manipulate data — rather, they manipulate SQL queries, providing affordances to apply common analytical operations over these queries: applying filters, aggregating by some dimension, joining intermediate results on some shared dimension, etc.
Every interaction in the NetSpring UI is powered by NetScript under the hood. While we’ve found it to be intuitive and powerful, we’d like to note that NetScript is completely optional for users — you can query the platform with SQL, or draw from our large library of no-code templates.
The problem it solves
We invented NetScript because SQL has three major drawbacks for representing analytical computations:
- SQL is verbose: If you have any amount of real-world exposure to analytical SQL, you’re acutely aware that the SQL queries for these computations can become quite verbose, often running multiple pages in length.
- SQL is repetitive: In a typical enterprise setting, a large number of queries use a very small number of join relationships. These join conditions end up getting repeated in every query.
- SQL is not composable: What is the most natural way to define the total revenue metric in SQL? A reasonable answer is: select sum(revenue) from Txns. This relatively trivial example highlights an important deficiency of SQL. Ideally, we would like to define this metric in some central place so that all our reports can use it without duplicating its logic. However, it’s not clear how to reuse this metric. Very rarely would we want a report with total revenue across all products or all geographies. Actual reports would filter this calculation by various criteria and slice the final number by various dimensional attributes (possibly inducing joins with other dimensional tables). There is no straightforward way to derive this final SQL query from the original SQL query that served as the definition of the total revenue metric. SQL views do provide some primitive composability — but they fall far short of enabling the kind of use cases that NetScript is designed for.
How it works
A NetScript program consists of query expressions. The NetScript compiler takes the query expression as input and uses the underlying schema (tables, columns, and relationships) to convert the expression into a SQL query that’s sent to the compute layer.
A NetScript operation, such as a filter, does not merely filter the data coming out of an operator. Rather, it understands the semantics of the input and does deep surgery to the input SQL query to support the semantics of the filter being applied. This surgery may push the filter through aggregates, introduce joins to bring in columns referenced by the filter condition, and so on.
We are not making any claim that NetScript can replace SQL — after all, it compiles into SQL! But we believe the NetScript model is closer to how users (and analytic applications) represent computations, allowing you to write programs at a higher level of abstraction.
The light at the end of the analytics tunnel
A solution for operational intelligence has been many decades in the making. Now, the timing is finally right. From a technology standpoint, two factors are enabling the rise of operational intelligence today:
- Stream processing: Stream processing platforms like Kafka are now standard in most enterprises. There is also an increasing understanding of the nuances of stream processing such as exactly-once semantics, out of order and late arrival of data, watermarks, transactional guarantees, etc. As a result, more and more data is being consumed in a streaming fashion, opening up the potential for streaming analytics on that data.
- Cloud data lakes: The ability to instrument and collect event data in real time must be accompanied by efficient ways to store and manage that data. This is where modern data lakes come in. Cloud blob/object stores such as AWS S3 are providing efficient and cost-effective ways of storing, securing, and managing an ever-increasing, massive volume of event data. As more and more enterprises move to the cloud and build these modern data lakes, all raw data at the lowest granularity is becoming available for analytics.
The third missing piece is a platform that makes it possible, and actually easy, to ask complex analytical questions across a convergence of your streaming and batch data.
That’s where NetSpring comes in to complete what we call the modern data intelligence stack. With simplified analytics, businesses will have actionable insights that cost less and drive more value. As a result, everyone benefits — from product managers, to business analysts, to data engineers, customer support, and ultimately customers themselves. I hope you’ll give NetSpring a try and see what it can do for your business. Contact us at [email protected] or contact us to try for free.