FDS qb cie-2
FDS qb cie-2
Graph databases are useful for data science because they can handle complex and dynamic data that
involve many-to-many relationships, such as social networks, recommendation systems, fraud
detection, knowledge graphs, etc. Graph databases allow data scientists to perform efficient and flexible
graph analytics, such as finding shortest paths, clustering, centrality, community detection, etc. Graph
databases also enable natural and intuitive data modeling, as the data is stored in the same way as it is
conceptualized.
Some examples of graph databases are Neo4j, Amazon Neptune, Microsoft Azure Cosmos DB, and
ArangoDB
2 Briefly explain types of graph databases
A There are two main types of graph databases based on their data model:
RDF graphs and property graphs .
• RDF graphs use the concept of a triple, which is a statement composed of three elements: subject-
predicate-object. For example, “Alice-knows-Bob” is a triple that represents a relationship between
two entities. RDF graphs are useful for data integration, as they can link data from different sources
using common vocabularies and standards.
• Property graphs use the concept of a node and an edge, where nodes represent entities and edges
represent relationships between them. Both nodes and edges can have properties, which are key-
value pairs that store additional information. For example, a node representing a person can have
properties such as name, age, gender, etc. Property graphs are useful for queries and analytics, as
they can perform complex operations on the graph structure and properties.
1
| MD Riyan Nazeer | 160921748036 | CSM3A |
Some examples of RDF graph databases are Apache Jena, Stardog, and GraphDB. Some examples of
property graph databases are Neo4j, TigerGraph, and AWS Neptune
OLAP is implemented in data mining by integrating online analytical processing with data mining
functionalities, such as classification, clustering, association, etc. This integration is called Online
Analytical Mining (OLAM) and it allows users to perform data mining on different subsets of data and at
different levels of abstraction. OLAM can achieve this by drilling, pivoting, filtering, dicing, and slicing on
a data cube and intermediate data mining outcomes.
2. Hierarchical databases: These are databases that store data in a tree-like structure, where each
record has one parent record and zero or more child records. They are useful for representing data
that has a natural hierarchy, such as organizational charts, file systems, etc. They are fast and
efficient for retrieving data, but they are rigid and difficult to modify. Some examples of hierarchical
databases are IMS, Windows Registry, and LDAP.
3. Network databases: These are databases that store data in a network-like structure, where each
record can have multiple parent and child records. They are useful for representing data that has
complex and many-to-many relationships, such as social networks, transportation networks, etc.
They are more flexible and powerful than hierarchical databases, but they are also more complex and
harder to maintain. Some examples of network databases are IDMS, RDM, and CODASYL.
4. Object-oriented databases: These are databases that store data as objects, which have attributes
and methods. They are useful for representing data that has complex structures and behaviors, such
as multimedia, computer-aided design, etc. They are compatible with object-oriented programming
languages, such as Java, C++, etc. They are more expressive and reusable than relational databases,
but they are also less standardized and less efficient for simple queries. Some examples of object-
oriented databases are ObjectDB, db4o, and Versant.
2
| MD Riyan Nazeer | 160921748036 | CSM3A |
A transactional database is a type of database that supports online transaction processing (OLTP),
which is the real-time processing of online transactions, such as e-commerce sales, banking, insurance,
etc. A transactional database ensures data accuracy and reliability by following the ACID properties
(Atomicity, Consistency, Isolation, Durability), which guarantee that each transaction is complete, valid,
independent, and persistent. A transactional database can be a relational database or a NoSQL
database, depending on the data model and the application needs
UNIT - 4
Short Answer Questions
1 Write the values of Jaccard’s Index value , Cosine similarity.
A Jaccard's Index value and Cosine similarity are two measures of similarity between two sets or vectors of
data. They are calculated as follows:
|𝐱𝐱 ∩ 𝐲𝐲|
Jaccard's Index value: 𝑱𝑱(𝒙𝒙, 𝒚𝒚) = |𝐱𝐱
∪ 𝐲𝐲|
𝐱𝐱⋅𝐲𝐲
Cosine similarity: 𝐜𝐜𝐜𝐜𝐜𝐜(𝒙𝒙, 𝒚𝒚) =
|𝒙𝒙||𝒚𝒚|
these measures can be used to compare different data objects, such as documents, images, or clusters.
For example, Jaccard’s Index value can be used to measure the overlap between two sets of keywords or
tags, while Cosine similarity can be used to measure the angle between two vectors of word frequencies
or pixel values.
2 Define Data Visualization? How data visualization is implemented?
A Data visualization is the graphical representation of information and data using visual elements like
charts, graphs, maps, and other tools. It helps us to see and understand trends, patterns, and outliers in
data, and to communicate data insights effectively to others.
Data visualization can be implemented using various software tools, such as Tableau, Power BI, Google
Charts, D3.js, and more. These tools allow us to create different types of data visualizations, such as bar
charts, pie charts, scatter plots, line charts, heat maps, histograms, etc. Depending on the data and the
purpose, we can choose the most suitable type of visualization to display our data
3
| MD Riyan Nazeer | 160921748036 | CSM3A |
• Extract: The first step of ETL is to extract raw data from different sources, such as databases, APIs,
files, or web pages.
For example, a data warehouse for an e-commerce company might extract data from the online
store, the inventory system, the payment gateway, and the customer feedback platform.
• Transform: The second step of ETL is to transform the extracted data according to specific
requirements, such as filtering, aggregating, joining, or converting.
For example, the e-commerce data warehouse might transform the data by removing duplicates,
calculating sales metrics, merging product and customer information, and converting currencies and
dates.
• Load: The final step of ETL is to load the transformed data into a target system, such as a data
warehouse, a data lake, or a database.
For example, the e-commerce data warehouse might load the data into a relational database with a
star schema, where each table represents a dimension or a fact
4 How do you measure the fitness of data set? What are data objects?
A Fitness of data set: The fitness of a data set is the degree to which it meets the requirements and
expectations of the data user for a specific purpose. It can be measured by using data quality metrics,
such as accuracy, completeness, consistency, timeliness, validity, duplication, and uniqueness.
For example, a data set that is accurate, complete, consistent, timely, valid, non-duplicated, and unique
can be considered fit for use for most data analysis tasks.
Data objects: Data objects are collections of one or more data points that create meaning as a whole.
They are the units of data that can be manipulated, stored, or exchanged by data systems. Data objects
can have different types, such as tables, arrays, pointers, records, files, sets, and scalar types.
For example, a data table is a data object that consists of rows and columns of data points, and it can be
queried, updated, or exported by a database system.
4
| MD Riyan Nazeer | 160921748036 | CSM3A |
It lowers errors and raises the caliber of the data It enables analysts to perform comprehensive
analysis and derive insights from combined data
sources
It can be accomplished using a variety of data It can be implemented using various techniques,
mining approaches, such as clustering, outlier such as schema matching, entity resolution, or
detection, or data quality mining data fusion
2 List the applications of Data Transformation.
A Data Transformation is a technique used to transform raw data into a more appropriate format that
enables efficient data mining and model building.
Here are some brief explanations and formulas for the three data similarity coefficients you mentioned:
1. Euclidean similarity coefficient is based on the Euclidean distance between two data objects, which
is the length of the straight line connecting them. It is calculated as the square root of the sum of the
squared differences between the corresponding attributes of the two objects. The smaller the
Euclidean distance, the higher the similarity. The formula is:
𝟏𝟏 𝟏𝟏
𝒔𝒔𝒔𝒔𝒔𝒔(𝒙𝒙, 𝒚𝒚) = =
𝟏𝟏 + 𝒅𝒅(𝒙𝒙, 𝒚𝒚) 𝟏𝟏 + �∑𝒊𝒊=𝟏𝟏
𝒏𝒏 (𝒙𝒙
𝒊𝒊 − 𝒚𝒚𝒊𝒊 )
𝟐𝟐
Where x and y are two data objects with attributes, and d(x, y) is the Euclidean distance between
them.
2. Jaccard’s index is a similarity coefficient that measures the overlap between two data objects, which
are usually represented as sets. It is calculated as the ratio of the size of the intersection of the two
sets to the size of their union. The larger the intersection, the higher the similarity. The formula is:
|𝒙𝒙 ∩ 𝒚𝒚|
𝒔𝒔𝒔𝒔𝒔𝒔(𝒙𝒙, 𝒚𝒚) =
|𝒙𝒙 ∪ 𝒚𝒚|
Where x and y are two data sets, and |x| denotes the size of the set x.
5
| MD Riyan Nazeer | 160921748036 | CSM3A |
• Cosine similarity coefficient is based on the angle between two data objects, which are usually
represented as vectors. It is calculated as the dot product of the two vectors divided by the product of
their magnitudes. The smaller the angle, the higher the similarity. The formula is:
Hierarchy generation is a process of creating a concept hierarchy for a given attribute or data set. A
concept hierarchy is a sequence of mappings from a set of low-level concepts to a set of high-level
concepts, based on some criteria of importance or abstraction
For example, a concept hierarchy for the attribute “city” can be generated by mapping it to higher-level
concepts such as “state”, “country”, and “continent”.
Data Discretization and Hierarchy generation are often used together to provide a hierarchical or multi-
resolution partitioning of the data values, which can enable mining at different levels of abstraction
2. Binning: This is a technique that groups the data values into smaller bins, based on some similarity
measure, such as mean, median, or boundary values.
3. Cluster analysis: This is a technique that partitions the data values into clusters, based on some
distance measure, such as Euclidean, Manhattan, or Jaccard.
4. Decision tree analysis: This is a technique that splits the data values into disjoint intervals, based on
some splitting criterion, such as entropy, information gain, or gini index.
5. Correlation analysis: This is a technique that merges the data values into overlapping intervals,
based on some correlation measure, such as linear regression, Pearson, or Spearman.
5 Explain some important data discretization techniques.
A Data Discretization is a technique used to transform continuous or numerical data into discrete or
categorical data. It can help reduce the complexity and dimensionality of the data, as well as enhance
the features and performance of the data analysis and machine learning models.
6
| MD Riyan Nazeer | 160921748036 | CSM3A |
UNIT - 5
Short Answer Questions
1 How R Variables , R- Data types are declared in R-Studio?
A In R, variables and data types are declared using the assignment operator <-
You can check the data type of an R object using functions such as type of, mode, storage. Mode, class,
and str.
2 Write a short note on vectors. Write the syntax for implementation of vectors in R.
A Vectors are a fundamental data structure in R. A vector is an ordered collection of values that are of the
same data type. R has five main data types: numeric, integer, complex, character, and logical.
To create a vector, you can use the c() function and separate the items by a comma.
To create a vector of numerical values, you can use the : operator to create a sequence of numbers:
numbers <- 1:10
7
| MD Riyan Nazeer | 160921748036 | CSM3A |
• Download the latest version of RStudio from their official website using the following command:
wget https://download1.rstudio.org/desktop/bionic/amd64/rstudio-1.4.1717-amd64.deb
• If there are any missing dependencies, install them by running the following command:
sudo apt-get -f install
2 How can we implement different Matrix operations in R? Explain with examples.
A Matrices are a fundamental data structure in R. You can perform various operations on matrices, such as
addition, subtraction, multiplication, calculating the power, the rank, the determinant, the diagonal, the
eigenvalues and eigenvectors, the transpose and decomposing the matrix by different methods.
1. Addition: The `+` operator is used to add two matrices. For example, to add two matrices `A` and
`B`, you would write:
```
A <- matrix(c(10, 8, 5, 12), ncol = 2, byrow = TRUE)
B <- matrix(c(5, 3, 15, 6), ncol = 2, byrow = TRUE)
A+B
```
2. Multiplication: The `%*%` operator is used to multiply two matrices. For example, to multiply two
matrices `A` and `B`, you would write: A %*% B
3. Transpose: The `t()` function is used to find the transpose of a matrix. For example, to find the
transpose of a matrix `A`, you would write: t(A)
8
| MD Riyan Nazeer | 160921748036 | CSM3A |
4. Determinant: The `det()` function is used to find the determinant of a matrix. For example, to find the
determinant of a matrix `A`, you would write: det(A)
5. Rank: The `qr()` function is used to find the rank of a matrix. For example, to find the rank of a matrix
`A`, you would write: qr(A)$rank
# AND operator
if (a > 3 & b < 15) {
print("Both conditions are true")
}
#output: "Both conditions are true"
# OR operator
if (a > 3 | b < 5) {
print("At least one condition is true")
}
#output: "At least one condition is true"
# NOT operator
if (!d) {
print("d is false")
}
#output: "d is false"
# XOR operator
if (c & !d | !c & d) {
print("One condition is true and the other is false")
}
#output: "One condition is true and the other is false"
4 Elaborate the concept of data frames in R using an example.
A A data frame is a two-dimensional data structure in R that stores data in tabular format. It is similar to a
matrix, but unlike a matrix, it can store different data types in each column. A data frame has rows and
columns, and each column can be a different vector. You can think of a data frame as a spreadsheet or a
SQL table.
9
| MD Riyan Nazeer | 160921748036 | CSM3A |
output:
name age married
1 Riyan 21 TRUE
2 Bilal 19 FALSE
3 faisal 20 TRUE
5 Show the implementation of functions in R. Explain with example
A • Functions are a fundamental concept in R programming.
• A function is a block of code that performs a specific task. It takes input, performs some operations
on the input, and returns output.
• Functions are used to break down complex problems into smaller, more manageable parts.
To write a function in R, we use the "function" keyword followed by a pair of parentheses, inside which we
specify the input arguments. The body of the function is contained within a pair of curly braces, and
within this body, we can perform any operations we want on the input arguments and return the desired
output.
Example of a simple function that takes a numeric vector as input and returns the sum of its
elements:
# Define the function
sum_vector <- function(x) {
# Calculate the sum of the vector elements
sum_x <- sum(x)
# Return the sum
return(sum_x)
}
The function first calculates the sum of the elements in the input vector using the "sum()" function and
then returns this value using the "return()" function. When we call the function with a sample vector, it
will output the sum of the vector elements.
10