BDC Output 10
BDC Output 10
Output
Big Data Analytics Options on AWS
Whitepaper
• How quickly do you need analytic results: in real time, in seconds, or is an hour a more
appropriate time frame?
• How much value will these analytics provide your organization and what budget
constraints exist?
• How large is the data and what is its growth rate?
• How is the data structured?
• What integration capabilities do the producers and consumers have?
• How much latency is acceptable between the producers and consumers?
• What is the cost of downtime or how available and durable does the solution need to be?
• Is the analytic workload consistent or elastic?
Each one of these questions helps guide you to the right tool. In some cases, you can
simply map your big data analytics workload into one of the services based on a set of
requirements. However, in most real-world, big data analytic workloads, there are many
different, and sometimes conflicting, characteristics and requirements on the same data
set.
For example, some result sets may have real-time requirements as a user interacts with a
system, while other analytics could be batched and run on a daily basis. These different
requirements over the same data set should be decoupled and solved by using more than
one tool. If you try to solve both of these examples using the same toolset, you end up
either over-provisioning or therefore overpaying for unnecessary response time, or you
have a solution that does not respond fast enough to your users in real time. Matching the
best-suited tool to each analytical problem results in the most cost-effective use of your
compute and storage resources.
Big data doesn’t need to mean “big costs”. So, when designing your applications, it’s important
to make sure that your design is cost efficient. If it’s not, relative to the alternatives, then it’s
probably not the right design. Another common misconception is that using multiple tool
sets to solve a big data problem is more expensive or harder to manage than using one big
tool. If you take the same example of two different requirements on the same data set, the
real-time request may be low on CPU but high on I/O, while the slower processing request
may be very compute intensive.
Decoupling can end up being much less expensive and easier to manage, because you can
build each tool to exact specifications and not overprovision. With the AWS pay-as-you-go
model, this equates to a much better value because you could run the batch analytics in just
one hour and therefore only pay for the compute resources for that hour. Also, you may find
this approach easier to manage rather than leveraging a single system that tries to meet all
of the requirements. Solving for different requirements
Big Data Analytics Options on AWS
with one tool results in attempting to fit a square peg (real-time requests) into a round hole
Whitepaper
(a large data warehouse).
The AWS platform makes it easy to decouple your architecture by having different tools
analyze the same data set. AWS services have built-in integration so that moving a subset of
data from one tool to another can be done very easily and quickly using parallelization.
Following are some real world, big data analytics problem scenarios, and an AWS architectural
solution for each.
Big Data Analytics Options on AWS
Example 1: Queries against an Amazon S3
Whitepaper
data lake
1. An AWS Glue crawler connects to a data store, progresses through a prioritized list of
classifiers to extract the schema of your data and other statistics, and then populates the
AWS Glue Data Catalog with this metadata. Crawlers can run periodically to detect the
availability of new data as well as changes to existing data, including table definition
changes. Crawlers automatically add new tables, new partitions to existing table, and new
versions of table definitions. You can customize AWS Glue crawlers to classify your own file
types.
2. The AWS Glue Data Catalog is a central repository to store structural and operational
metadata for all your data assets. For a given data set, you can store its table definition,
physical location, add business relevant attributes, as well as track how this data has
changed over time. The AWS Glue Data Catalog is Apache Hive Metastore compatible and is
a drop-in replacement for the Apache Hive Metastore for Big Data applications running on
Amazon EMR. For more information on setting up your EMR cluster to use AWS Glue Data
Catalog as an Apache Hive Metastore, see AWS Glue documentation.
3. The AWS Glue Data Catalog also provides out-of-box integration with Amazon Athena,
Amazon EMR, and Amazon Redshift Spectrum. After you add your table definitions to the
AWS Glue Data Catalog, they are available for ETL and also readily available for querying
in Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum so that you can have a
common view of your data between these services.
4. Using a BI tool like Amazon QuickSight enables you to easily build visualizations, perform
ad hoc analysis, and quickly get business insights from your data. Amazon QuickSight
supports data sources such as Amazon Athena, Amazon Redshift Spectrum, Amazon S3
and many others. See Supported Data Sources.
Example 2: Capturing and analyzing sensor
data
An international air conditioner manufacturer has many large air conditioners that it sells
to various commercial and industrial companies. Not only do they sell the air conditioner
units but, to better position themselves against their competitors, they also offer add-on
services where you can see real-time dashboards in a mobile app or a web browser. Each
unit sends its sensor information for
processing and analysis. This data is used by the manufacturer and its customers. With this
capability, the manufacturer can visualize the dataset and spot trends.
Currently, they have a few thousand pre-purchased air conditioning (A/C) units with this
capability. They expect to deliver these to customers in the next couple of months and are
hoping that, in time, thousands of units throughout the world will use this platform. If
successful, they would like to expand this offering to their consumer line as well, with a
much larger volume and a greater market share. The solution needs to be able to handle
massive amounts of data and scale as they grow their business without interruption. How
should you design such a system?
First, break it up into two work streams, both originating from the same data:
• A/C unit’s current information with near-real-time requirements and a large number of
customers consuming this information
• All historical information on the A/C units to run trending and analytics for
internal use The data-flow architecture in the following figure shows how to
1. The process begins with each A/C unit providing a constant data stream to Amazon
Kinesis Data Streams. This provides an elastic and durable interface the units can talk
to that can be scaled seamlessly as more and more A/C units are sold and brought
online.
][
2. Using the Amazon Kinesis Data Streams-provided tools such as the Kinesis Client Library
or SDK, a simple application is built on Amazon EC2 to read data as it comes into
Amazon Kinesis Data Streams, analyze it, and determine if the data warrants an update
to the real-time dashboard. It looks for changes in system operation, temperature
fluctuations, and any errors that the units encounter.
3. This data flow needs to occur in near real time so that customers and
maintenance teams can be alerted quickly if there is an issue with the unit. The
data in the dashboard does have some
aggregated trend information, but it is mainly the current state as well as any system errors.
So, t
][
Big Data Analytics Options on AWS AWS
Whitepaper Example 3: sentiment analysis
of social media
Capturing the data from various social networks is relatively easy but the challenge is building
the intelligence programmatically. After the data is ingested, the company wants to be able
to analyze and classify the data in a cost-effective and programmatic way. To do this, they
can use the architecture in the following figure.
1. First, deploy an Amazon EC2 instance in an Amazon VPC that ingests tweets from Twitter.
2. Next, create an Amazon Kinesis Data Firehose delivery stream that loads the streaming
tweets into the raw prefix in the solution's S3 bucket.
3. S3 invokes a Lambda function to analyze the raw tweets using Amazon Translate to
translate non- English tweets into English, and Amazon Comprehend to use natural
language-processing (NLP) to perform entity extraction and sentiment analysis.
4. A second Kinesis Data Firehose delivery stream loads the translated tweets and
sentiment values into the sentiment prefix in the S3 bucket. A third delivery stream loads
entities in the entities prefix in the S3 bucket.
5. This architecture also deploys a data lake that includes AWS Glue for data
transformation, Amazon Athena for data analysis, and Amazon QuickSight for data
visualization. AWS Glue Data Catalog contains a logical database used to organize the
tables for the data in S3. Athena uses these table definitions to query the data stored in
S3 and return the information to an Amazon QuickSight dashboard.
By using ML and BI services from AWS including Amazon Translate, Amazon Comprehend,
Amazon Kinesis, Amazon Athena, and Amazon QuickSight, you can build meaningful, low-
cost social media dashboards to analyze customer sentiment, which can lead to better
opportunities for acquiring leads, improve website traffic, strengthen customer relationships, and
improve customer service.
This example solution automatically provisions and configures the AWS services necessary
to capture multi-language tweets in near-real-time, translate them, and display them on a
dashboard powered by Amazon QuickSight. You can also capture both the raw and
enriched datasets and durably store them in the solution's data lake. This enables data
analysts to quickly and easily perform new types of analytics and ML on this data. For
more information, see the AI-Driven Social Media Dashboard solution.
Conclusion
As more and more data is generated and collected, data analysis requires scalable, flexible, and
high performing tools to provide insights in a timely fashion. However, organizations are
facing a growing big data environment, where new tools emerge and become outdated very
quickly. Therefore, it can be very difficult to keep pace and choose the right tools.
This whitepaper offers a first step to help you solve this challenge. With a broad set of
managed services to collect, process, and analyze big data, AWS makes it easier to build,
deploy, and scale big data applications. This enables you to focus on business problems
instead of updating and managing these tools.
AWS provides many solutions to address your big data analytic requirements. Most big data
architecture solutions use multiple AWS tools to build a complete solution. This approach
helps meet stringent business requirements in the most cost-optimized, performant, and
resilient way possible. The result is a flexible big data architecture that is able to scale along
with your business