IoT-notes
IoT-notes
MODULE 4
By
Divya K S
Dept. of CSE
Module 4 IoT(15CS81)
Structured data means that the data follows a model or schema that defines how the data
is represented or organized, meaning it fits well with a traditional relational database
management system (RDBMS).
In many cases we will find structured data in a simple tabular form—for example, a
spreadsheet where data occupies a specific cell and can be explicitly defined and
referenced.
IoT sensor data often uses structured values, such as temperature, pressure, humidity, and
so on, which are all sent in a known format.
Structured data is easily formatted, stored, queried, and processed.
Example commercial software like Microsoft Excel and Tableau
Unstructured data lacks a logical schema for understanding and decoding the data
through traditional programming means.
Examples of this data type include text, speech, images, and video.
Any data that does not fit neatly into a predefined data model is classified as unstructured
data.
According to some estimates, around 80% of a business’s data is unstructured.
Data analytics methods that can be applied to unstructured data, such as cognitive
computing and machine learning, are deservedly garnering a lot of attention.
With machine learning applications, such as natural language processing (NLP), we can
decode speech.
With image/facial recognition applications, we can extract critical information from still
images and video.
Smart objects in IoT networks generate both structured and unstructured data
Module 4 IoT(15CS81)
Data in IoT networks is either in transit (“data in motion”) or being held or stored (“data at
rest”).
Examples of data in motion include traditional client/server exchanges, such as web
browsing and file transfers, and email.
Data saved to a hard drive, storage array, or USB drive is data at rest.
From an IoT perspective, the data from smart objects is considered data in motion as it
passes through the network en route to its final destination.
This is often processed at the edge, using fog computing.
When data is processed at the edge, it may be filtered and deleted or forwarded on for
further processing and possible storage at a fog node or in the data center.
Data does not come to rest at the edge.
When data arrives at the data center, it is possible to process it in real-time, just like at the
edge, while it is still in motion.
Tools with this sortofcapability,such as Spark, Storm, and Flink, arerelatively nascent
compared to the tools for analyzing stored data.
Data at rest in IoT networks can be typically found in IoT brokers or in some sort of storage
array at the data center.
The best known tool is Hadoop.
Hadoop not only helps with data processing but also data storage.
1 . Descriptive:
Descriptive data analysis tells us what is happening, either now or in the past.
For example,a thermometer in a truck engine reports temperaturevaluesevery second.
From a descriptive analysis perspective, we can pull this data at any moment to gain insight
into the current operating condition of the truck engine.
If the temperature value is too high, then there maybe a cooling problem or the engine may
be experiencing too much. lo ad
2 . Diagnostic:
When we are interested in the “why,” diagnostic data analysis can provide the answer.
The example of the temperature sensor in the truck engine, we might wonder why the truck
engine failed.
Diagnostic analysis might show that the temperature of the engine was too high, and the
engine overheated.
Applying diagnostic analysis acrossthedatagenerated by a wide range ofsmart objects can
provide a clear picture of why a problem or an event occurred.
3 . Predictive:
Predictive analysis aims to foretell problems or issues before they occur.
For example, with historical values of temperatures for the truck engine, predictive
analysis could provide an estimate on the remaining life of certain components in the
engine.
These components could then be proactively replaced before failure occurs. Or perhaps if
temperature values of the truck engine start to rise slowly over time, this could indicate
the need for an oil change or some other sort of engine cooling maintenance.
4 . Prescriptive:
Prescriptive analysis goes a step beyond predictive and recommends solutions for
upcoming problems.
A prescriptive analysis of the temperature data from a truck engine might calculate various
alternatives to cost-effectively maintain our truck.
These calculations could range from the cost necessary for more frequent oil changes and
cooling maintenance to installing new cooling equipment on the engine or upgrading to a
lease on a model with a more powerful engine.
2.Volatility of data:
With relational databases, it is critical that the schema be designed correctly from the
beginning. Changing it later can slow or stop the database from operating.
Due to the lack of flexibility, revisions to the schema must be kept at a minimum.
IoT data, however, is volatile , the data model is likely to change and evolve over time.
A dynamic schema is often required so that data model changes can be made daily or
even hourly
4. Another challenge that IoT brings to analytics is in the area of network data, which is
referred to as network analytics.
5. With the large numbers of smart objects in IoT networks that are communicating and
streaming data, it can be challenging to ensure that these data flows are effectively
managed, monitored, and secure.
Network analytics tools such as Flexible NetFlow and IPFIX provide the capability to detect
irregular patterns or other problems in the flow of IoT data through a network
4. Machine Learning:
Machine learning, deep learning, neural networks, and convolutional networks are related
to big data and IoT.
ML is central to IoT.
Data collected by smart objects needs to be analyzed, and intelligent actions need to be
taken based on these analyses.
Performing this kind of operation manually is almost impossible (or very, very slow
and inefficient).
Machines are needed to process information fast and react instantly when thresholds
are met.
For example, every time a new advance is made in the field of self-driving vehicles,
abnormal pattern recognition in a crowd, or any other automated intelligent and machine-
assisted decision system, ML is named as the tool that made the advance possible.
Machine learning is part of a larger set of technologies commonly grouped under the
term artificial intelligence (AI).
AI includes any technology that allows a computing system to mimic human intelligence
using any technique, from very advanced logic to basic “if-then else” decision loops.
Any computer that uses rules to make decisions belongs to this realm.
Module 4 IoT(15CS81)
A simple example is an app that can help us find your parked car.
A GPS reading of our position at regular intervals calculates your speed.
A basic threshold system determines whether we are driving (for example, “if speed >
20 mph or 30 kmh, then start calculating speed”).
A typical example is a dictation program that runs on a computer.
The program is configured to recognize theaudio patternof eachword in a dictionary, but it
does not know your voice’s specifics—your accent, tone, speed, and so on.
Unsupervised Learning:
For example, we may decide to group the engines by the sound they make at a given
temperature.
Module 4 IoT(15CS81)
A standard function to operate this grouping, K-means clustering, finds the mean values
for a group of engines (for example, mean value for temperature, mean frequency for
sound).
Grouping the engines this way can quickly reveal several types of engines that all belong
to the same category (for example, small engine of chainsaw type, medium engine of
lawnmower type).
All engines of the same type produce sounds and temperatures in the same range as
the other members of the same group.
There will occasionally be an engine in the group that displays unusual characteristics
(slightly out of expected temperature or sound range).
This is the engine that you send for manual evaluation.
5. Neural Networks
Neural networks are ML methods that mimic the way the human brain works.
When we look at a human figure, multiple zones of our brain are activated to recognize
colors, movements, facial expressions, and so on.
our brain combines these elements to conclude that the shape we are seeing is human.
Neural networks mimic the same logic.
The information goes through different algorithms (called units), each of which is in
charge of processing an aspect of the information
The resulting value of one unit computation can be used directly or fed into another unit
for further processing to occur.
In this case, the neural network is said to have several layers.
For example, a neural network processing human image recognition may have two units
in a first layer that determines whether the image has straight lines and sharp angles—
because vehicles commonly have straight lines and sharp angles, and human figures do
not.
If the image passes the first layer successfully (because there are no or only a small
percentage of sharp angles and straight lines), a second layer may look for different
features (presence of face, arms, and so on), and then a third layer might compare the image
to images of various animals and conclude that the shape is a human (or not).
The great efficiency of neural networks is that each unit processes a simple test, and
therefore computation is quite fast.
When the result of layer is fed into another layer, the process is called deep learning
Module 4 IoT(15CS81)
One advantage of deeplearning is that having more layers allows for richer intermediate
processing and representation of the data.
At each layer, the data can be formatted to be better utilized by the next layer.
This process increases the efficiency of the overall result.
1. Monitoring:
Smart objects monitor the environment where they operate.
Data is processed to better understand the conditions of operations.
These conditions can refer to external factors, such as air temperature, humidity, or
presence of carbon dioxide in a mine, or to operational internal factors, such as the pressure
of a pump, the viscosity of oil flowing in a pipe, and so on.
ML can be used with monitoring to detect early failure conditions or to better evaluate the
environment (such as shape recognition for a robot automatically sorting material or
picking goods in a warehouse or a supply chain).
2. ctnBehavior control:
Module 4 IoT(15CS81)
3.Operations optimization:
4. Self-healing, self-optimizing:
1. Velocity
2. Variety
3. Volume
1. Velocity:
Velocity refers to how quickly data is being collected and analyzed.
Hadoop Distributed File System is designed to ingest and process data very quickly.
Smart objects can generate machine and sensor data at a very fast rate and require database
or file systems capable of equally fast ingest functions.
2. Variety:
Variety refers to different types of data and data is categorized as structured, semi-
structured, or unstructured.
Different database technologies may only be capable of accepting one of these types.
Hadoop is able to collect and store all three types.
3. Volume:
Volume refers to the scale of the data.
This is measured from gigabytes on the very low end to petabytes or even exabytes of data
on the other extreme.
Big data implementations scale beyond what is available on locally attached storage disks
on a single node.
It is common to see clusters of servers that consist of dozens, hundreds, or even thousands
of nodes for some large deployments.
Massively parallel processing (MPP) databases were built on the concept of the relational
data warehouses but are designed to be much faster, to be efficient, and to support reduced
query times.
To accomplish this, MPP databases take advantage of multiple nodes (computers)
designed in a scale out architecture such that both data and processing are distributed
across multiple systems.
MPPs are sometimes referred to as analytic databases because they are designed to allow
for fast query processing and often have built-in analytic functions.
These database types process massive data sets in parallel across many processors and
nodes.
An MPP architecture typically contains a single master node that is responsible for the
coordination of all the data storage and processing across the cluster.
It operates in a “shared-nothing” fashion, with each node containing local processing,
memory, and storage and operating independently
Module 4 IoT(15CS81)
Data storage is optimized across the nodes in a structured SQL-like format that allows data
analysts to work with the data using common SQL tools and applications.
NoSQL Databases:
NoSQL (“not only SQL”) is a class of databases that support semi-structured and
unstructured data, in addition to the structured data handled by data warehouses and
MPPs.
NoSQL is not a specific database technology, it encompasses several different types of
databases, including the following:
1. Document stores
2. Key-value stores
3. Wide-column stores
4.Graph stores
1. Document stores:
This type of database stores semi-structured data, such as XML or JSON.
Document stores generally have query engines and indexing features that allow for many
optimized queries.
2. Key-value stores:
This type of database stores associative arrays where a key is paired with an associated
value.
These databases are easy to build and easy to scale.
Module 4 IoT(15CS81)
3. Wide-column stores:
This type of database stores similar to a key value store, but the formatting of the values
can vary from row to row, even in the same table.
4. Graph stores:
This type of database is organized based on the relationships between elements.
Graph stores are commonly used for social media or natural language processing,
NoSQL was developed to support the high-velocity, urgent data requirements of modern
web applications that typically do not require much repeated use.
NoSQL is built to scale horizontally, allowing the database to span multiple hosts.
Hadoop:
NameNodes:
These are a critical piece in data adds, moves, deletes, and reads on HDFS.
They coordinate where the data is stored, and maintain a map of where each block of
data is stored and where it is replicated.
All interaction with HDFS is coordinated through the primary (active) NameNode, with
a secondary (standby) NameNode notified of the changes in the event of a failure of the
primary.
The NameNode takes write requests from clients and distributes those files across the
available nodes in configurable block sizes, usually 64 MB or 128 MB blocks.
The NameNode is also responsible for instructing the DataNodes where replication
should occur.
Module 4 IoT(15CS81)
DataNodes:
These are the servers where the data is stored at the direction of the NameNode.
It is common to have many DataNodes in a Hadoop cluster to store the data.
Data blocks are distributed across several nodes and often are replicated three, four, or
more times across nodes for redundancy.
Once data is written to one of the DataNodes, the DataNode selects two (or more)
additional nodes, based on replication policies, to ensure data redundancy across the
cluster.
Disk redundancy techniques such as Redundant Array of Independent Disks (RAID) are
generally not used for HDFS because the NameNodes and DataNodes coordinate block-
level redundancy with this replication technique.
3. Time sensitivity:
When timely response to data is required, passing data to the cloud for future processing
results in unacceptable latency.
Streaming analytics at the edge can be broken down into three simple stages :
1. This is the raw data coming from the sensors into the analytics processing unit.
2. Analytics processing unit (APU):
3. The APU filters and combines data streams (or separates the streams, as
necessary), organizes them by time windows, and performs various analytical
functions.
4. It is at this point that the results may be acted on by micro services running in the
APU.
Output streams:
1. The data that is output is organized into insightful streams and is used to influence
the behavior of smart objects, and passed on for storage and further processing in
the cloud.
2. Communication with the cloud often happens through a standard
publisher/subscriber messaging protocol, such as MQTT.
1. Filter:
The streaming data generated by IoT endpoints is likely to be very large, and most of it
is irrelevant. For example, a sensor may simply poll on a regular basis to confirm that it
is still reachable.
The filtering function identifies the information that is considered important.
Module 4 IoT(15CS81)
2.Transform:
In the data warehousing world, Extract, Transform, and Load (ETL) operations are used
to manipulate the data structure into a form that can be used for other purposes.
Analogous to data warehouse ETL operations, in streaming analytics, once the data is
filtered, it needs to be formatted for processing.
3.Time:
As the real-time streaming data flows, a timing context needs to be established. This
could be to correlated average temperature readings from sensors on a minute-by-
minute basis.
For example, an APU that takes input data from multiple sensors reporting temperature
fluctuations. In this case, the APU is programmed to report the average temperature
every minute from the sensors, based on an average of the past two minutes.
4.Correlate:
5. Match patterns:
Module 4 IoT(15CS81)
Once the data streams are properly cleaned, transformed, and correlated with other live
streams as well as historical data sets, pattern matching operations are used to gain
deeper insights to the data.
For example, say that the APU has been collecting the patient’s vitals for some time and
has gained an understanding of the expected patterns for each variable being monitored.
If an unexpected event arises, such as a sudden change in heart rate or respiration, the
pattern matching operator recognizes this as out of the ordinary and can take certain
actions, such as generating an alarm to the nursing staff.
The patterns can be simple relationships, or they may be complex, based on the criteria
defined by the application.
Streaming analytics may be performed directly at the edge, in the fog, or in the cloud
data center.
Fog analytics allows, to see beyond one device, giving visibility into an aggregation of
edge nodes and allowing to correlate data from a wider set.
Figure shows an example of an oil drilling company that is measuring both pressure and
temperature on an oil rig.
Sensors communicate via MQTT through a message broker to the fog analytics node,
allowing a broader data set.
The fog node is located on the same oil rig and performs streaming analytics from
several edge devices, giving it better insights due to the expanded data set.
Module 4 IoT(15CS81)
Network Analytics:
Network analytics has the power to analyze details of communications patterns made by
protocols and correlate this across the network.
It quickly identifies anomalies that suggest network problems due to sub optimal paths,
intrusive malware, or excessive congestion.
Network analytics offer capabilities to cope with capacity planning for scalable IoT
deployment as well as security monitoring in order to detect abnormal traffic volume and
patterns (such as an unusual traffic spike for a normally quiet protocol) for both
centralized or distributed architectures, such as fog computing.
Flow collection from the network layer provides global and distributed near-real- time
monitoring capabilities.
IPv4 and IPv6 network wide traffic volume and pattern analysis helps administrators
proactively detect problems and quickly troubleshoot and resolve problems when they
occur.
Monitoring and profiling can be used to gain a detailed time-based view of IoT access
services, such as the application-layer protocols, including MQTT, CoAP, and DNP3, as
well as the associated applications that are being used over the network.
3. Capacity planning:
Flow analytics can be used to track and anticipate IoT traffic growth and help in the
planning of upgrades when deploying new locations or services by analyzing captured
data over a long period of time.
This analysis affords the opportunity to track and anticipate IoT network growth on a
continual basis.
4. Security analysis:
Module 4 IoT(15CS81)
Because most IoT devices typically generate a low volume of traffic and always send
their data to the same server(s), any change in network traffic behavior may indicate a
cyber security event, such as a denial of service (DoS) attack.
Security can be enforced by ensuring that no traffic is sent outside the scope of the IoT
domain.
For example, with a LoRaWAN gateway, there should be no reason to see traffic sent or
received outside the LoRaWAN network server and network management system.
4. Accounting:
In field area networks, routers or gateways are often physically isolated and leverage
public cellular services and VPNs for backhaul.
Deployments may have thousands of gateways connecting the last-mile IoT
infrastructure over a cellular network.
Flow monitoring can thus be leveraged to analyze and optimize the billing, in
complement with other dedicated applications, such as Cisco Jasper, with a broader
scope than just monitoring data flow.
Flow data (or derived information) can be warehoused for later retrieval and analysis in
support of proactive analysis of multiservice IoT infrastructures and applications.
Two of the major challenges in securing industrial environments have been initial design
and ongoing maintenance.
The initial design challenges arose from the concept that networks were safe due to
physical separation from the enterprise with minimal or no connectivity to the outside
world, and the assumption that attackers lacked sufficient knowledge to carry out security
attacks
From a security design perspective, it is better to know that communication paths are
insecure than to not know the actual communication paths.
This kind of organic growth has led to miscalculations of expanding networks and the
introduction of wireless communication in a standalone fashion, without consideration of
the impact to the original security design.
These uncontrolled or poorly controlled OT network evolutions have, in many cases, over
time led to weak or inadequate network and systems security.
In many industries, the control systems consist of packages, skids, or components that are
self-contained and may be integrated as semi-autonomous portions of the network.
Module 4 IoT(15CS81)
These packages may not be as fully or tightly integrated into the overall control system,
network management tools, or security applications, resulting in potential risk.
Due to the static nature and long lifecycles of equipment in industrial environments, many
operational systems may be deemed legacy systems.
For example, in a power utility environment, it is not uncommon to have racks of old
mechanical equipment still operating alongside modern intelligent electronic devices
(IEDs).
From a security perspective, this is potentially dangerous as many devices may have
historical vulnerabilities or weaknesses that have not been patched and updated, or it may
be that patches are not even available due to the age of the equipment.
Communication methods and protocols may be generations old and must be interoperable
with the oldest operating entity in the communications path.
This includes switches, routers, firewalls, wireless access points, servers, remote access
systems, patch management, and network management tools.
All of these may have exploitable vulnerabilities and must be protected.
Three examples of this are a frequent lack of authentication between communication endpoints,
no means of securing and protecting data at rest or in motion, and insufficient granularity of
control to properly specify recipients or avoid default broadcast approaches.
Modbus:
This could open up the potential for protocol abuse in the system.
ICCP is a common control protocol in utilities across North America that is frequently
used to communicate between utilities.
Given that it must traverse the boundaries between different networks, it holds an extra
level of exposure and risk that could expose a utility to cyber attack.
Initial versions of ICCP had several significant gaps in the area of security.
One key vulnerability is that the system did not requireauthentication for
communication.
Second, encryption across the protocol was not enabled as a default condition, thus
exposing connections to man-in-the-middle (MITM) and replay attacks.