We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2
5 V's of Big Data
1. Volume: The sheer amount of data generated.
2. Velocity: The speed at which data is generated and processed. 3. Variety: The diverse types of data, including structured, semi-structured, and unstructured. 4. Veracity: The quality and accuracy of the data. 5. Value: The potential insights and value that can be derived from the data. Two Applications of Big Data 1. Healthcare: Analyzing large datasets of patient records to identify trends, predict diseases, and improve treatment plans. 2. Financial Services: Detecting fraud, assessing risk, and personalizing financial products. Convergence of Key Trends in Big Data ● IoT: The increasing number of connected devices generating vast amounts of data. ● Cloud Computing: Enabling scalable and cost-effective storage and processing of big data. ● AI and Machine Learning: Leveraging advanced algorithms to extract insights from complex datasets. ● Data Science and Analytics: Applying statistical and computational techniques to uncover patterns and trends. How Big Data Works in Credit Cards ● Fraud Detection: Analyzing transaction patterns to identify anomalies and potential fraudulent activity. ● Customer Segmentation: Grouping customers based on their spending habits and preferences to offer personalized services. ● Risk Assessment: Evaluating creditworthiness and predicting default risk. Different Types of Data and Examples ● Structured Data: Organized data with a predefined format (e.g., databases, spreadsheets). ● Semi-Structured Data: Data with some structure but not strictly adhering to a predefined schema (e.g., XML, JSON). ● Unstructured Data: Data without a predefined structure (e.g., text, images, audio, video). Firewall Analytics Big Data Analyzing firewall logs to identify security threats, detect intrusions, and optimize security policies. NoSQL A database model that does not rely on the traditional tabular relational structure. It offers flexibility and scalability for handling large and diverse datasets. Aggregate Data Models Data models that summarize and combine data from multiple sources to provide a higher-level view. Shredding A technique for breaking down large data files into smaller, more manageable chunks. Schema-less Database A database that does not require a predefined schema, allowing for flexible data storage and retrieval. Master-Slave Replication A replication technique where a master database updates multiple slave databases. Peer-to-Peer Replication A replication technique where multiple databases replicate data with each other. JSON Files Text-based files that store data in a hierarchical structure. MongoDB A popular NoSQL database that uses a flexible JSON-like document model. Hadoop Streaming and Pipes Tools for processing large datasets using custom code written in programming languages like Java, Python, or C++. HDFS (Hadoop Distributed File System) A distributed file system designed to store and process large datasets across multiple nodes. HDFS Concepts ● NameNode: Manages the file system namespace. ● DataNode: Stores data blocks. ● Block: A fixed-size chunk of data. ● Replication: Storing multiple copies of data blocks for redundancy. Data Integrity, Compression, and Serialization Ensuring data accuracy, reducing data size, and converting data into a format suitable for storage and transmission. Avro, Map, Reduce Phase ● Avro: A data serialization system for efficient data exchange. ● MapReduce: A programming model for processing large datasets in parallel. ● Map Phase: Processes input data and generates key-value pairs. ● Reduce Phase: Combines key-value pairs with the same key and performs aggregations. Job Scheduling The process of managing and executing data processing jobs in a distributed environment. HBase, Hive, Cassandra Data Model ● HBase: A NoSQL database built on top of HDFS, designed for real-time, random access to large datasets. ● Hive: A data warehouse infrastructure built on top of Hadoop, enabling SQL-like queries on large datasets. ● Cassandra: A distributed NoSQL database designed for high availability and scalability. Additional Topics ● Spark: A fast and general-purpose cluster computing system. ● Kafka: A distributed streaming platform for real-time data processing. ● YARN: A resource management system for Hadoop clusters. ● ZooKeeper: A distributed coordination service for managing large-scale distributed systems. ● https://github.com/prakashumbc/603_BigData