Streaming data
1. Streaming data
In this video you will learn about streaming data.2. Streaming data fundamentals
Streaming applications process records continuously as they arrive. Producers write to a stream while consumers read independently and in parallel, absorbing traffic spikes without waiting for scheduled batch jobs.3. What makes up streaming data?
A streaming system has four parts: Producers, applications that emit records. Streams, a durable, ordered buffers that decouples producers from consumers. Consumers, applications that read and process records. Storage and retention define how long records remain available for replay.4. Shards
Shards are units of throughput and parallel processing. Records are distributed across shards using partition keys. Ordering is guaranteed only within an individual shard. Consumers use check-pointing to track processing progress and recover from failures.5. AWS managed services for streaming data
Kinesis Data Streams and Kinesis Data Firehose are the core building blocks for building streaming data applications.6. Kinesis Data Streams
Amazon Kinesis Data Streams provides low-latency streaming (typically ~200ms) with configurable retention from 24 hours (default) up to 365 days. It supports durable, replayable streams that multiple consumers can read from in parallel. Developers manage stream scaling using shards.7. When to use Kinesis Data Streams
Choose it when you need real-time latency, replay, multiple independent consumers, or custom transformation logic.8. Writing records
Producers write records to Kinesis using a partition key, which controls shard placement and ordering. PutRecords supports up to 500 records or 5 MB per request. Failures are reported individually for selective retry. Use the `FailedRecordCount` parameter to help you identify any failed records.9. Consuming records: classic
Consumers poll on demand against Kinesis API. It is worth knowing the read and write limits, which are per shard. Writes, 1 MB/sec or 1,000 records/sec. Reads, 2 MB/sec or 5 `GetRecords` calls/sec.10. Consuming records: stream start
Consumers must specify where in the stream to begin reading from. `TRIM_HORIZON` starts from the oldest available record in the shard. `LATEST` starts from the next new record. `AT_SEQUENCE_NUMBER` `AFTER_SEQUENCE_NUMBER` starts reading from or after a specific sequence number. `AT_TIMESTAMP` starts reading records at or after a specified timestamp.11. Consuming records: enhanced fan-out
Enhanced Fan-Out provides each registered consumer a dedicated 2 MB/sec per shard using HTTP/2 push delivery. Use when multiple independent consumers need low-latency, high-throughput access to the same stream.12. Lambda event source mapping
Lambda can poll Kinesis and invoke with batches. Key tunables: `BatchSize` and `MaximumBatchingWindow`. `ParallelizationFactor`: concurrent invocations per shard while preserving per-key order. `BisectBatchOnFunctionError`: splits a failing batch to isolate poison pills. `ReportBatchItemFailures`: allows successfully processed records to avoid unnecessary retries. `OnFailure` destination: SQS or SNS DLQ for batches that cannot be processed successfully. Shard count caps Lambda concurrency unless `ParallelizationFactor` is increased.13. Data Firehose
Data Firehose is a managed service that delivers streaming data to destinations like S3, with optional buffering, format conversion, and Lambda-based transformation. It supports higher latency (buffered for seconds to minutes before delivery) but doesn't support replay or long-term retention. Choose when the goal is "get this data into a destination" without building a consumer.14. Pattern: hot-cold
Hot-cold pattern combines Data Streams to enable real-time processing for Producers, while a Firehose delivery stream attached to the same source, archives records into S3.15. Kinesis vs SQS
Use Kinesis for ordered, replayable streams with multiple consumers; use SQS for task distribution. Kinesis is better for time-series and ordered data; SQS is better for task distribution.16. Scaling streaming data
Kinesis Data Streams are divided into shards. Partition keys route records to a shard. The same key sends data to the same shard, preserving ordering within that shard. Hot shards occur when partition keys are uneven. Mitigate with high-cardinality keys or resharding (split to scale out, merge to scale in).17. Capacity modes: provisioned
Two capacity modes exist. Provisioned mode uses a fixed shard count managed manually. It provides predictable cost and capacity but requires monitoring to avoid throttling or over-provisioning.18. Capacity modes: on-demand
On-Demand mode automatically scales based on traffic without manual shard management. It simplifies operations for unpredictable workloads but may cost more under sustained heavy traffic.19. Kinesis Producer Library (KPL)
AWS provides developer libraries that simplify building scalable Kinesis producers and consumers. Kinesis Producer Library (KPL) batches, aggregates, compresses, retries, and buffers asynchronously to maximize producer throughput.20. Kinesis Consumer Library (KCL)
Kinesis Consumer Library handles shard coordination, worker leasing, and check-pointing. State is stored in a DynamoDB table, the first place to look when consumers fail to resume after a redeploy.21. Failure handling: idempotency
Kinesis delivery is at-least-once so duplicates are possible. Design consumers for idempotency using the Kinesis sequence number or a business key stored in DynamoDB. Kinesis consumers can reread retained records from a previous sequence number or timestamp.22. Failure handling: throughput errors
`ProvisionedThroughputExceededException` errors are resolved by increasing shards, improving key distribution, retrying with backoff, or switching capacity mode.23. Logging and monitoring
Key CloudWatch metrics to watch `IncomingBytes`, `IncomingRecords`, `WriteProvisionedThroughputExceeded`. Consumer lag occurs when consumers cannot process records as quickly as they arrive. High iterator age `GetRecords.IteratorAgeMilliseconds` means consumers are lagging.24. Security
Server-side data is encrypted at rest with KMS. TLS encrypts data in transit. IAM policies control producer and consumer permissions.25. Let's practice!
Now let's test your knowledge of streaming data.Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.