AWS Kinesis data streams in real-time

INTRODUCTION


The purpose of the article is to describe how to collect and process large stream data in real-time using AWS service. Today we will introduce Amazon Kinesis Data Streams service that can do it easily and effectively with rapid and continuous process, high performance, and durability.


WHAT IS THE KINESIS DATA STREAMS?

Definition

Kinesis Data Streams is a fully managed, serverless data streaming service that stores and ingests various streaming data in real time at any scale. 

Use cases



INFRASTRUCTURE DIAGRAM

The following infrastructures are some common infrastructures that are recommendations from AWS.

Producers → Kinesis Data Streams → Consumers

Producers → Kinesis Data Streams → Consumers

Users → ALB → Producer (Fargate) → Kinesis Data Streams → Consumers (Fargate)


LIMITATION


- Put records: 1 record can up to 1 MB (PutRecord API), multiple records (PutRecords API) can up to 5MB per request (1MB / 1 record).

- Pull records: can retrieve up to 10 MB of data per call from a single shard, and up to 10,000 records per call.



PERFORMANCE


Kinesis Data Streams relies on shards, which are units of throughput and represent a parallelism. One shard provides an ingest throughput of 1 MB / second or 1000 records / second. A shard also has an outbound throughput of 2 MB / second. As you ingest more data, Kinesis Data Streams can add more shards. Customers often ingest thousands of shards in a single stream.

When a consumer uses enhanced fan-out, it gets its own 2 MB/sec allotment of read throughput, allowing multiple consumers to read data from the same stream in parallel, without contending for read throughput with other consumers.

Comparison between consumers without enhanced fan-out and consumers with enhanced fan-out


PERFORMANCE MEASUREMENT


To assume the requirements are as below information.


As seen in the above chart, we can see that each of the enhanced fan-out functions processed the 4000 records in under 2 seconds, but each of standard at just over 2.5 seconds. If we process millions of records in real time, the latency between standard and enhanced fan-out becomes much more significant.



CONCLUSION


In conclusion, using Kinesis Data Stream to collect and analyze massive amounts of data is the one of the best choices because it provides the best performance when enabling enhanced fan-out, durability in a long time, scale up to create multiple shards to adapt for streaming the large data and provide the best cost when using this service. Hence, it is an easy way to stream large data in real-time in the era where performance is very important.

Let’s get started and try to implement this service if you have a chance!!!