Data Serialization in Big Data
Data Serialization in Big Data
Data serialization is the process of converting an object into a stream of bytes to more
easily save or transmit it.
The reverse process—constructing a data structure or object from a series of bytes—
is deserialization. The deserialization process recreates the object, thus making the
data easier to read and modify as a native structure in a programming language.
Serialization enables us to save the state of an object and recreate the object in a new
location. Serialization encompasses both the storage of the object and exchange of
data. Since objects are composed of several components, saving or delivering all the
parts typically requires significant coding effort, so serialization is a standard way to
capture the object into a sharable format. With serialization, we can transfer objects:
In some distributed systems, data and its replicas are stored in different partitions on
multiple cluster members. If data is not present on the local member, the system will
retrieve that data from another member. This requires serialization for use cases such
as:
Adding key/value objects to a map
Putting items into a queue, set, or list
Sending a lambda functions to another server
Processing an entry within a map
Locking an object
Sending a message to a topic
Data formats such as JSON and XML are often used as the format for storing
serialized data. Customer binary formats are also used, which tend to be more space-
efficient due to less markup/tagging in the serialization.
Big data systems often include technologies/data that are described as “schemaless.”
This means that the managed data in these systems are not structured in a strict
format, as defined by a schema. Serialization provides several benefits in this type of
environment:
Serialization in Java
Serialization in Hadoop
Interprocess Communication
RPC used internal serialization to convert the message into binary format before
sending it to the remote node via network. At the other end the remote system
deserializes the binary stream into the original message.
Compact − To make the best use of network bandwidth, which is the most scarce
resource in a data center.
Fast − Since the communication between the nodes is crucial in distributed systems,
the serialization and deserialization process should be quick, producing less overhead.
Interoperable − The message format should support the nodes that are written in
different languages.
Persistent Storage
Persistent Storage is a digital storage facility that does not lose its data with the loss of
power supply. Files, folders, databases are the examples of persistent storage.