The last decade has seen significant changes in the area of data and analytics. Big Data initiatives have resulted in the collection of a lot more data in enterprises. Hadoop failed to deliver on its promises and is on its way out. The movement from on-premise to Cloud has taken solid ground. The next generation of Cloud data lakes have emerged. Transactional data, the source of data for analytics, is being scattered in an ever increasing number of SaaS applications. There are massive new streams of data from sources such as product instrumentation, phones, wearables, IoT sensors etc. Stream processing platforms such as Kafka have become mainstream. Cloud data warehouses have displaced legacy on-premise data warehouses. Machine Learning of some kind is now commonly used in every enterprise.
What are the key trends for the next five years? The following are what we believe.
Cloud
The center of gravity of transactional data has clearly moved to the Cloud. The same is rapidly happening to analytics data. While there is still a lot of on-premise data, it is likely that in the next few years, the majority of data for analytics will be in the Cloud.
Streaming
A big change in data integration is the shift from batch-oriented jobs to streaming jobs. Streaming data integration offers a better programming model with loose coupling between producers and various stages of consumers. It reduces the problem of data staleness and the lost opportunity of acting on recent data. A vast majority of data will be ingested in streaming fashion, accelerating the rate at which data is ingested and made available for analytics. This will present new opportunities for next-generation analytic tools that can extract insights from this data faster.
Lakehouse
A core tenet of data analytics is that all data has to be brought into a central repository for scalable, meaningful analyses. The Cloud object/blob stores such as Amazon S3 will be the dominant such central repositories, and the foundation of next-generation data lakes. These data lakes will fully replace Hadoop-based data lakes. These repositories offer the ability to store massive volumes of diverse data at a very low cost, using simple APIs. They offer enterprise-class reliability, availability, durability, archiving and security. We will see a rich ecosystem of data processing systems evolving around these next-generation data lakes, with open metadata and data format standards. These new systems, based on the emerging Lakehouse architectures, will replace traditional Data Warehouse architectures.
ELT
The traditional Extract, Transform, Load (ETL) model will increasingly be replaced by Extract, Load, Transform (ELT) models. It is now easier and more efficient to extract and load raw data into an object/blob store with very little transformations. Once all data is in the object/blob store, it can be discovered easily using simple query tools. Discovery makes data transformations easier. Data transformations can be done efficiently in the Cloud leveraging elasticity of compute resources, scaling independent of storage. A new breed of tools will emerge around ELT with Cloud object/blob stores.
Operational Intelligence
As a result of the above trends, a massive volume of rich, fine-grained, transactional data originating in many disparate systems, will become easily available for analytics. This event data will become available as it is being created or immediately after it is created. A new breed of Operational Intelligence data and application platforms will emerge that help enterprises continuously leverage insights from recent, actionable event data combined with curated and historical data. This will result in the infusion of analytics in every day-to-day operational decision, providing unprecedented data-driven operational business agility. These next-generation analytics platforms, built on top of Cloud data lakes, will bring about a true convergence of streaming and batch processing at scale, with integrated storage. They will service a variety of analytic workloads (real-time, batch, reporting, OLAP, alerting, ML etc.) in a single converged system, eliminating multiple silo-ed systems. This will result in a massive reduction in TCO of analytics infrastructure.