Big Data Intro
Big Data Intro
Characteristics of Data:
1. Composition: The composition of data deals with the structure of data
• the sources of data
• the granularity
• the types
• the nature of data as to whether it is static or real-time streaming.
2. Condition: The condition of data deals with the state of data
• exploding volume
• the speed at which the data is being generated
• the speed at which it needs to be processed
• the variety of data (internal or external, behavioral or social) that is being generated.
Volume:
• We have seen it grow from bits to bytes to petabytes and exabytes.
• Data storage: File systems, SQL (RDBMSs - Oracle, MS SQL Server, DB2, MySQL,
PostgreSQL,etc.), NoSQL (MongoDB, Cassandra, etc.), and so on.
• Archives: Archives of scanned documents, paper archives, customer
correspondence records, patients' health records, students' admission records,
students' assessment records, and so on.
2. External data sources: Data residing outside an organization's firewall.
• Public Web: Wikipedia, weather, regulatory, compliance, census, etc.
3. Both (internal + external data sources)
• Sensor data: Car sensors, smart electric meters, office buildings, air conditioning
units, refrigerators, and so on.
• Machine log data: Event logs, application logs, Business process logs, audit logs,
clickstream.
• Social media: Twitter, blogs, Facebook, LinkedIn, YouTube, Instagram, etc.
Velocity:
• We have moved from the days of batch processing to real-time processing.
• Batch -› Periodic -› Near real time -› Real-time processing
Variety:
Variety deals with a wide range of data types and sources of data.
Types of Data
Digital data is classified into the following categories:
• Structured data
• Semi-structured data
• Unstructured data
Approximate percentage distribution of digital data
Structured Data
• This is the data which is in an organized form (e.g., in rows and columns) and can be
easily used by a computer program.
• Relationships exist between entities of data, such as classes and their objects.
• Data stored in databases is an example of structured data.
Input / Update /
Databases such as Delete
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc Security
Scalability
OLTP Systems
Transaction
Processing
Semi-structured Data
• This is the data which does not conform to a data model but has some structure.
• However, it is not in a form which can be used easily by a computer program.
Self-describing
(lable/value pairs)
Semi-structured data
Often Schema information is
blended with data values
Unstructured Data
• This is the data which does not conform to a data model or is not in a form which can be
used easily by a computer program.
• About 80–90% data of an organization is in this format.
Images
Free-Form
Text
Audios
Unstructured data
Videos
Body of
Email
Text
Messages
Chats
Social
Media data
Word
Document
Data Mining
• So, marketers can communicate with consumers when they are emotionally engaged,
regardless of the channel.
Fraud and Big Data
• Fraud is intentional deception made for personal gain or to damage another individual.
• One of the most common forms of fraudulent activity is credit card fraud.
• Even though fraud detection is improving, the rate of incidents is rising.
• This means banks need more proactive approaches to prevent fraud.
• Social media and mobile phones are forming the new frontiers for fraud.
• In order to prevent the fraud, credit card transactions are monitored and checked in
near real time.
• If the checks identify pattern inconsistencies and suspicious activity, the transaction is
identified for review and escalation.
• The Capgemini Financial Services team believes that due to the nature of data streams
and processing required, Big Data technologies provide an optimal technology solution
based on the following three Vs:
1. High volume. Years of customer records and transactions (150 billion records per year)
2. High velocity. Dynamic transactions and social media information
3. High variety. Social media plus other unstructured data such as customer emails, call center
conversations, as well as transactional structured data
• Capgemini ’s new fraud Big Data initiative focuses on flagging the suspicious credit card
transactions to prevent fraud in near real-time via multi-attribute monitoring.
• Real-time inputs involving transaction data and customers records are monitored via
validity checks and detection rules.
• Pattern recognition is performed against the data to score and weight individual
transactions across each of the rules and scoring dimensions.
• A cumulative score is then calculated for each transaction record and compared against
thresholds to decide if the transaction is potentially suspicious or not.
• Elastic Search-
• a distributed, free/open-source search server.
• Using this tool, large historical data sets can be used in conjunction with real-
time data to identify deviations from typical payment patterns.
• This Big Data component allows overall historical patterns to be compared and
contrasted, and allows the number of attributes and characteristics about
consumer behavior to be very wide, with little impact on overall performance.
• Percolator is a system for incrementally processing updates to large data sets.
• Percolator query can handle both structured and unstructured data.
• This provides scalability to the event processing framework, and allows specific
suspicious transactions to be enriched with additional unstructured information -
phone location/geospatial records, customer travel schedules, and so on.
• For example, “Is this person likely to default on their Rs.300,000 mortgage?”
• Market risk analytics focus on understanding the likelihood that the value of a portfolio
will decrease due to the change in stock prices, interest rates, foreign exchange rates,
and commodity prices.
• For example, “Should we sell this holding if the price drops another 10 percent?”