In Simple terms, Data streaming is the continuous transfer of large amount of data from source (could be one or more ) to target (could be one or more) at a steady and high-speed rate.
Streaming data is ideally points to the data that has no discrete beginning or end. The data is generated by thousands of sources and these sources are running always. For example, log files of server. Once server started, it is continuously running and producing log files.
Data streaming is the process of sending data records continuously rather than in batches. Streaming is useful for all types of data sources that send data in small sizes (in kilobytes) as well in big sizes. Generally, data streaming used to transfer data as continuous flow as the data is generated.
Example for some of the data sources below. However this is not an exhaustive list and real time data could come from n-numbers of sources.
- Log files generated by customers using your mobile or web applications
- Player activity in a real time game
- User tracking data from e-commerce websites
- Telemetry/Sensor data from connected sources
- Information from social networks, financial trading floors, and/or geospatial services
Need for Data Streaming
Data is moved in batches traditionally and often processes large volumes of data at the scheduled time, for example every 2 hours. Due to the large amounts of data being processed, batch processing has long periods of latency. While this can be an efficient way to handle large volumes of data, it doesn’t work with data that is meant to be processed with low latency because that data can be stale by the time it is processed. For example, a sensor monitoring a hardware device need to send anomalies data immediately to processing unit to notify the issue and make corrective action.
Streaming of data allows you to analyze the data in real time and gives you insights into a wide range of activities
There are many such use cases which requires data processing in real time, such as:
- Sentiment analysis
- Fraud detection
- Log monitoring
- Instruments monitoring
Data streaming is a powerful tool. however there are a few challenges that are common when working with different sources.
- Fault Tolerance
- Guaranteed data processing
Data Streaming Benefits
Streaming data from source to target has become an essential data infrastructure for many organizations. This new infrastructure enables us
- to deal with never-ending streams of events
- to do real time or near real time processing
- detecting patterns in time series data
Data Streaming Tools
The following list shows a few popular tools:
- Apache Spark – is a unified analytics engine for large-scale data processing.
- Amazon Kinesis Firehose – is a managed, scalable, cloud-based service which allows real-time processing of large data streams.
- Apache Kafka – is a distributed publish-subscribe messaging system which integrates applications and data streams
- Apache Flink – is a streaming data flow engine which provides facilities for distributed computation over data streams
- Apache Storm – is a distributed real-time computation system. Storm is used for distributed machine learning, real-time analytics, and numerous other cases, especially with high data velocity
- Hortonworks Streaming Analytics Manager – is an open source tool used to design, develop, deploy and manage streaming analytics applications using a drag drop visualize paradigm. Now users can build streaming analytics applications that do event correlation, context enrichment, complex pattern matching, analytical aggregations and create alerts/notifications when insights are discovered without writing code
- Apache NiFi – is an easy to use, powerful, and reliable distributed system to transform and distribute data
Hope this article is informative and you enjoyed reading. Please let us know your thoughts in comments. Thank you.