AWS Kinesis Streams - Getting Started

In December 2013, AWS released Kinesis, a managed, dynamically scalable service for the processing of streaming big data in real-time. Since that time, Amazon has been steadily expanding the regions in which Kinesis is available, and as of this writing, it is possible to integrate Amazon’s Kinesis producer and client libraries into a variety of custom applications to enable real-time processing of streaming data from a variety of sources.

Kinesis acts as a highly available conduit to stream messages between data producers and data consumers. Data producers can be almost any source of data: system or web log data, social network data, financial trading information, geospatial data, mobile app data, or telemetry from connected IoT devices. Data consumers will typically fall into the category of data processing and storage applications such as Apache Hadoop, Apache Storm, and Amazon Simple Storage Service (S3) and ElasticSearch.

Understanding Key Concepts in Kinesis

It is helpful to understand some key concepts when working with Kinesis Streams.

Kinesis Stream Shards

The basic unit of scale when working with streams is a shard. A single shard is capable of ingesting up to 1MB or 1,000 PUTs per second of streaming data, and emitting data at a rate of 2MB per second.

Shards scale linearly, so adding shards to a stream will add 1MB per second of ingestion, and emit data at a rate of 2MB per second for every shard added. Ten shards will scale a stream to handle 10MB (10,000 PUTs) of ingress, and 20MB of data egress per second. You choose the number of shards when creating a stream, and it is not possible to change this via the AWS Console once you’ve created a stream.

It is possible to dynamically add or remove shards from a stream using the AWS Streams API. This is called resharding. Resharding cannot be done via the AWS Console, and is considered an advanced strategy when working with Kinesis. A solid understanding of the subject is required prior to attempting these operations.

Adding shards essentially splits shards in order to scale the stream, and removing shards merges them. Data is not discarded when adding (splitting) or removing (merging) shards. It is not possible to split a single shard into more than two, nor to merge more than two shards into a single shard at a time.

Adding and removing shards will increase or decrease the cost of your stream accordingly. Per the Amazon Kinesis Streams FAQ, there is a default limit of 10 shards per region. This limit can be increased by contacting Amazon Support and requesting a limit increase. There is no limit to the number of shards or streams in an account.

Types of Shards

Records are units of data stored in a stream and are made up of a sequence number, partition key, and a data blob. Data blobs are the payload of data contained within a record. The maximum size of a data blob before Base64-encoding is 1MB, and is the upper limit of data that can be placed into a stream in a single record. Larger data blobs must be broken into smaller chunks before putting them into a Kinesis stream.

Partition keys are used to identify different shards in a stream, and allow a data producer to distribute data across shards.

Sequence numbers are unique identifiers for records inserted into a shard. They increase monotonically, and are specific to individual shards.

Amazon Kinesis Offerings

Amazon Kinesis is currently broken into three separate service offerings.

Kinesis Streams is capable of capturing large amounts of data (terabytes per hour) from data producers, and streaming it into custom applications for data processing and analysis. Streaming data is replicated by Kinesis across three separate availability zones within AWS to ensure reliability and availability of your data.

Kinesis Streams is capable of scaling from a single megabyte up to terabytes per hour of streaming data. You must manually provision the appropriate number of shards for your stream to handle the volume of data you expect to process. Amazon helpfully provides a shard calculator when creating a stream to correctly determine this number. Once created, it is possible to dynamically scale up or down the number of shards to meet demand, but only with the AWS Streams API at this time.

It is possible to load data into Streams using a number of methods, including HTTPS, the Kinesis Producer Library, the Kinesis Client Library, and the Kinesis Agent.

By default, data is available in a stream for 24 hours, but can be made available for up to 168 hours (7 days) for an additional charge.

Monitoring is available through Amazon Cloudwatch. If you want to add more verbose visualizations of that data, you can use Sumo Logic’s open source Kinesis Connector to fetch data from the Kinesis Stream and send it to the Sumo Logic service. Kinesis Connector is a Java connector that acts as a pipeline between an [Amazon Kinesis] stream and a [Sumologic] Collection. Data gets fetched from the Kinesis Stream, transformed into a POJO and then sent to the Sumologic Collection as JSON.

Kinesis Firehose is Amazon’s data-ingestion product offering for Kinesis. It is used to capture and load streaming data into other Amazon services such as S3 and Redshift. From there, you can load the streams into data processing and analysis tools like Elastic Map Reduce, and Amazon Elasticsearch Service. It is also possible to load the same data into S3 and Redshift at the same time using Firehose.

Firehose can scale to gigabytes of streaming data per second, and allows for batching, encrypting and compressing of data. It should be noted that Firehose will automatically scale to meet demand, which is in contrast to Kinesis Streams, for which you must manually provision enough capacity to meet anticipated needs.

As with Kinesis Streams, it is possible to load data into Firehose using a number of methods, including HTTPS, the Kinesis Producer Library, the Kinesis Client Library, and the Kinesis Agent. Currently, it is only possible to stream data via Firehose to S3 and Redshift, but once stored in one of these services, the data can be copied to other services for further processing and analysis.

Monitoring is available through Amazon Cloudwatch.

Kinesis Analytics is Amazon’s forthcoming product offering that will allow running of standard SQL queries against data streams, and send that data to analytics tools for monitoring and alerting.

This product has not yet been released, and Amazon has not published details of the service as of this date.

Kinesis Pricing

Here’s a pricing guide for the various Kinesis offerings.

Kinesis Streams

There are no setup or minimum costs associated with using Amazon Kinesis Streams. Pricing is based on two factors — shard hours, and PUT Payload Units, and will vary by region. US East (Virginia), and US West (Oregon) are the least expensive, while regions outside the US can be significantly more expensive depending on region.

At present, shard hours in the US East (Virginia) region are billed at $0.015 per hour, per shard. If you have 10 shards, you would be billed at a rate of $0.15 per hour.

PUT Payload Units are counted in 25KB chunks. If a record is 50KB, then you would be billed for two units. If a record is 15KB, you will be billed for a single unit. Billing per 1 million units in the US East (Virginia) region is $0.014.

Extended Data Retention for up to 7 days in the US East (Virginia) region is billed at $0.020 per shard hour. By default, Amazon Kinesis stores your data for 24 hours. You must enable Extended Data Retention via the Amazon API.

Kinesis Streams is not available in the AWS Free Tier. For more information and pricing examples, see Amazon Kinesis Streams Pricing.

Kinesis Firehose

There are also no setup or minimum costs associated with using Amazon Kinesis Firehose. Pricing is based on a single factor — data ingested per GB. Data ingested by Firehose in the US East (Virginia) region is billed at $0.035 per GB.

You will also be charged separately for data ingested by Firehose and stored in S3 or Redshift. Kinesis Firehose is not available in the AWS Free Tier. For more information and pricing examples, see Amazon Kinesis Firehose Pricing.

Kinesis vs SQS

Amazon Kinesis is differentiated from Amazon’s Simple Queue Service (SQS) in that Kinesis is used to enable real-time processing of streaming big data. SQS, on the other hand, is used as a message queue to store messages transmitted between distributed application components.

Kinesis provides routing of records using a given key, ordering of records, the ability for multiple clients to read messages from the same stream concurrently, replay of messages up to as long as seven days in the past, and the ability for a client to consume records at a later time. Kinesis Streams will not dynamically scale in response to increased demand, so you must provision enough streams ahead of time to meet the anticipated demand of both your data producers and data consumers.

SQS provides for messaging semantics so that your application can track the successful completion of work items in a queue, and you can schedule a delay in messages of up to 15 minutes. Unlike Kinesis Streams, SQS will scale automatically to meet application demand. SQS has lower limits to the number of messages that can be read or written at one time compared to Kinesis, so applications using Kinesis can work with messages in larger batches than when using SQS.

Competitors to Kinesis

There are a number of products and services that provide similar feature sets to Kinesis. Three well-known options are summarized below.

Apache Kafka is a high performance message broker originally developed by LinkedIn, and now a part of the Apache Software Foundation. It is downloadable software written in Scala. There are quite a few opinions as to whether one should choose Kafka or Kinesis, but there are some simple use cases to help make that decision.

If you are looking for an on-prem solution, consider Kafka since you can set up and manage it yourself. Kafka is generally considered more feature-rich and higher- performance than Kinesis, and offers the flexibility that comes with maintaining your own software. On the other hand, you must set up and maintain your own Kafka cluster(s), and this can require expertise that you may not have available in-house.

It is possible to set up Kafka on EC2 instances, but again, that will require someone with Kafka expertise to configure and maintain. If your use case requires a turnkey service that is easy to set up and maintain, or integrate with other AWS services such as S3 or Redshift, then you should consider Kinesis instead.

There are a number of comparisons on the web that go into more detail about features, performance, and limitations if you’re inclined to look further.

Microsoft Azure Event Hubs is Microsoft’s entry in the streaming messaging space. Event Hubs is a managed service offering similar to Kinesis. It supports AMQP 1.0 in addition to HTTPS for reading and writing of messages. Currently, Kinesis only supports HTTPS and not AMQP 1.0. (There is an excellent comparison of Azure Event Hubs vs Amazon Kinesis if you are looking to see a side-by-side comparison of the two services.)

Google Cloud Pub/Sub is Google’s offering in this space. Pub/Sub supports both HTTP access, as well as gRPC (alpha) to read and write streaming data.

At the moment, adequate comparisons of this service to Amazon Kinesis (or Azure Event Hubs) are somewhat lacking on the web. This is expected; Google only released version 1 of this product in June of 2015. Expect to see more sometime in the near future.

Google provides excellent documentation on using the service in their Getting Started guide.
Beginner Resources for Kinesis

Amazon has published an excellent tutorial on getting started with Kinesis in their blog post Building a Near Real-Time Discovery Platform with AWS. It is recommended that you give this a try first to see how Kinesis can integrate with other AWS services, especially S3, Lambda, Elasticsearch, and Kibana.

Once you’ve taken Kinesis for a test spin, you might consider integrating with an external service such as SumoLogic to analyze log files from your EC2 instances using their Amazon Kinesis Connector. (The code has been published in the SumoLogic Github repository.)

Getting Started with AWS Kinesis Streams is published by the Sumo Logic DevOps Community. If you’d like to learn more or contribute, visit devops.sumologic.com. Also, be sure to check out Sumo Logic Developers for free tools and code that will enable you to monitor and troubleshoot applications from code to production.

Resources

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Start free trial

AWS Kinesis Streams - Getting Started

Understanding Key Concepts in Kinesis

Kinesis Stream Shards

Types of Shards

Amazon Kinesis Offerings

Kinesis Pricing

Kinesis Streams

Kinesis Firehose

Kinesis vs SQS

Competitors to Kinesis

Resources

Complete visibility for DevSecOps

Categories

Spotlight

Sumo Logic cloud-native SaaS analytics

Steve Tidwell

People who read this also enjoyed

Crossing the machine learning pilot to product chasm through MLOps

Splunk second thoughts? It’s time for the cloud-native alternative

What is AWS CloudTrail?

Understanding Key Concepts in Kinesis

Kinesis Stream Shards

Types of Shards

Amazon Kinesis Offerings

Kinesis Pricing

Kinesis Streams

Kinesis Firehose

Kinesis vs SQS

Competitors to Kinesis

Resources

Complete visibility for DevSecOps

Categories

Spotlight

Share

Sumo Logic cloud-native SaaS analytics

Steve Tidwell

You're in good company