Lecture 2
Lecture 2
Types of attributes
Specialized Specialized
services services
for domain A for domain B
Streams
Varying quality
Big data solutions typically involve one or more of the following types of
workload:
Data sources: All big data solutions start with one or more data sources.
Examples include:
process data files using long-running batch jobs to filter, aggregate, and otherwise
prepare the data for analysis. Usually these jobs involve reading source files, processing
them, and writing the output to new files. Options include running U-SQL jobs in Azure
Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop
Real-time message ingestion: If the solution includes real-time sources, the architecture
must include a way to capture and store real-time messages for stream processing. This
might be a simple data store, where incoming messages are dropped into a folder for
processing. However, many solutions need a message ingestion store to act as a buffer for
messages, and to support scale-out processing, reliable delivery, and other message
queuing semantics. Options include Azure Event Hubs, Azure IoT Hubs, and Kafka.
BDA system architecture
Stream processing: After capturing real-time messages, the solution must process them by filtering,
aggregating, and otherwise preparing the data for analysis. The processed stream data is then written
to an output sink. Azure Stream Analytics provides a managed stream processing service based on
perpetually running SQL queries that operate on unbounded streams. You can also use open source
Apache streaming technologies like Storm and Spark Streaming in an HDInsight cluster.
Analytical data store: Many big data solutions prepare data for analysis and then serve the processed
data in a structured format that can be queried using analytical tools. The analytical data store used
to serve these queries can be a Kimball-style relational data warehouse, as seen in most traditional
business intelligence (BI) solutions. Alternatively, the data could be presented through a low-latency
NoSQL technology such as HBase, or an interactive Hive database that provides a metadata abstraction
over data files in the distributed data store. Azure Synapse Analytics provides a managed service for
large-scale, cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL,
Orchestration: Most big data solutions consist of repeated data processing operations,
encapsulated in workflows, that transform source data, move data between multiple sources and
sinks, load the processed data into an analytical data store, or push the results straight to a report
or dashboard. To automate these workflows, you can use an orchestration technology such Azure
Data Factory or Apache Oozie and Sqoop.
When to use this architecture
Massive Parallelism
Huge Data Volumes Storage
Data Distribution
High-Speed Networks
High-Performance Computing
Task and Thread Management
Data Mining and Analytics
Data Retrieval
Machine Learning
Data Visualization
Types of data
A key form of document data is creating metadata, or in other words “data about
data”. Metadata are characteristics describing the data, which facilitates
cataloguing and discovery of the data. When depositing your data into a trusted
data repository, the repository generates machine-readable metadata.
Example :
Qualitative Attributes
Binary Attributes: Binary data has only 2 values/states. For Example yes or
no, affected or unaffected, true or false.
An interval-scaled attribute has values, whose differences are interpretable, but the
numerical attributes do not have the correct reference point, or we can call zero points.
Data can be added and subtracted at an interval scale but can not be multiplied or
divided. Consider an example of temperature in degrees Centigrade. If a day’s
temperature of one day is twice of the other day we cannot say that one day is twice as
hot as another day.
Discrete : Discrete data have finite values it can be numerical and can also
be in categorical form. These attributes has finite or countably infinite set of
values.
Example:
Quantitative Attributes
Example :
Business Use Cases
Product Recommendation
Customer Segmentation
Sentiment Analysis
Fraud Detection
Predictive Maintenance