0% found this document useful (0 votes)

25 views

Big Data Analysis

The document discusses linear regression using the Weka machine learning tool. It explains that linear regression can be used for regression problems to estimate coefficients for a line or hyperplane that best fits the training data. It describes how to select the linear regression algorithm in Weka and configure it, including feature selection and regularization options. Running linear regression on the Boston housing dataset using Weka's default configuration achieved a root mean squared error of 4.9. The summary also mentions k-nearest neighbors algorithm, which can be used for both classification and regression in Weka.

Uploaded by

rathodrohit2121

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Big Data Analysis

Uploaded by

rathodrohit2121

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Experiment no 01 : Basic big data operations using Numpy, SciPy & Pandas.

What is Pandas?

Pandas is defined as an open-source library that provides high-performance data manipulation in

Python. It is built on top of the NumPy package, which means Numpy is required for operating
the Pandas. The name of Pandas is derived from the word Panel Data, which means an
Econometrics from Multidimensional data. It is used for data analysis in Python and
developed by Wes McKinney in 2008.

Before Pandas, Python was capable for data preparation, but it only provided limited support for
data analysis. So, Pandas came into the picture and enhanced the capabilities of data analysis. It
can perform five significant steps required for processing and analysis of data irrespective of the
origin of the data, i.e., load, manipulate, prepare, model, and analyze.

Example

Load a CSV file into a Pandas DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.to_string())

Duration Pulse Maxpulse Calories

0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.5
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
10 60 103 147 329.3
11 60 100 120 250.7
12 60 106 128 345.3

What is NumPy?

NumPy is mostly written in C language, and it is an extension module of Python. It is defined as

a Python package used for performing the various numerical computations and processing of the
multidimensional and single-dimensional array elements. The calculations using Numpy arrays
are faster than the normal Python array.

The NumPy package is created by the Travis Oliphant in 2005 by adding the functionalities of
the ancestor module Numeric into another module Numarray. It is also capable of handling a
vast amount of data and convenient with Matrix multiplication and data reshaping.

Both the Pandas and NumPy can be seen as an essential library for any scientific computation,
including machine learning due to their intuitive syntax and high-performance matrix
computation capabilities. These two libraries are also best suited for data science applications.

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)

print(type(arr))

output:

[1 2 3 4 5]
<class 'numpy.ndarray'>

What is SciPy?

SciPy is a scientific computation library that uses NumPy underneath.

SciPy stands for Scientific Python.

It provides more utility functions for optimization, stats and signal processing.

Like NumPy, SciPy is open source so we can use it freely.

SciPy was created by NumPy's creator Travis Olliphant.

Why Use SciPy?

If SciPy uses NumPy underneath, why can we not just use NumPy?

SciPy has optimized and added functions that are frequently used in NumPy and Data Science.
Experiment No 02 Implementation of Plotting, Filtering and Cleaning a CSV File Data
Using NumPy & Pandas.

Data cleaning in Python

Python NumPy and Pandas modules provide some methods for data cleaning in Python. Data
cleaning is a process where all of the data that needs to be passed into a database or used for data
analysis is cleaned by either updating or removing missing, inaccurate, incorrectly formatted,
duplicated, or irrelevant information. Regular data cleansing should be practiced regularly in
order to avoid pilling up uncleaned data over the years.

Why do we need to clean data in Python?

If data is not cleaned properly it can result in a great loss including a reduction in marketing
effectiveness. Hence cleaning of data becomes really important to avoid all the inaccuracy in
major results.

Efficient data cleaning implies fewer errors which results in happier customers and fewer
frustrated employees. It also leads to an increase in productivity and better decisions.

Steps to clean data in a Python dataset

1. Data Loading

Now let’s perform data cleaning on a random csv file that I have downloaded from the internet.
The name of the dataset is ‘San Francisco Building Permits’. Before any processing of the data,
it is first loaded from the file. The code for data loading is shown below:
1 import numpy as np

2 import pandas as pd

3 data = pd.read_csv('Building_Permits.csv',low_memory=False)

First, all the required modules are imported and then the CSV file is loaded. I have added an
additional parameter named low_memory whose purpose is to make sure the program doesn’t
run into any memory errors due to the huge dataset.
The dataset contains 198900 permit details and 43 columns.

2. Dropping Unnecessary columns

When we looked at the dataset, we saw that there were so many columns in the dataset. But for
processing, we can skip some of the columns during processing.

For now let’s drop some random columns namely TIDF Compliance, Fire Only Permit, Unit
Suffix, Block, and Lot.
1 columns_to_drop=['TIDF Compliance', 'Fire Only Permit', 'Unit Suffix', 'Block','Lot']

2 data_dropcol=data.drop(columns_to_drop,axis=1)

We will first create a list storing all the column names to drop from the dataset.

In the next line, we made use of the drop function and pass the list created into the function. We
also pass the axis parameter whose value can be either 0 (row-wise drop) or 1 (column-wise
drop).

After the execution of the code, the new data contains only 38 columns, not 43.

3. Remove Missing Value Rows

Before moving to directly removing the rows with missing values, let’s first analyze how many
missing values are there in the dataset. For the same purpose, we use the code mentioned below.
1 no_missing = data_dropcol.isnull().sum()

2 total_missing=no_missing.sum()

On the code execution, we found out that there are 1670031 missing values in the dataset. Since
there are so many missing values so instead of dropping the rows with missing data, we drop the
columns with maximum missing values instead. The code for the same is shown below.

1 drop_miss_value=data_dropcol.dropna(axis=1)

The code resulted in the dropping of maximum columns and only 10 columns remained in the
resulting dataset. Yes, most of the information is dropped from the dataset but at least now the
dataset is adequately cleaned.

Summary

Data analysis is a resource-intensive operation. So it makes sense to clean the raw data before the
analysis to save time and effort. Data cleaning also makes sure that our analysis is more accurate.
Python pandas and NumPy modules are best suited for CSV data cleaning.
Experiment no 03: Linear Regression using WEKA.

WEKA - an open source software provides tools for data preprocessing, implementation of
several Machine Learning algorithms, and visualization tools so that you can develop machine
learning techniques and apply them to real-world data mining problems.

Classify Tab

The Classify tab provides you several machine learning algorithms for the classification of your
data. To list a few, you may apply algorithms such as Linear Regression, Logistic Regression,
Support Vector Machines, Decision Trees, RandomTree, RandomForest, NaiveBayes, and so on.
The list is very exhaustive and provides both supervised and unsupervised machine learning
algorithms

Linear Regression

Linear regression only supports regression type problems.

It works by estimating coefficients for a line or hyperplane that best fits the training data. It is a
very simple regression algorithm, fast to train and can have great performance if the output
variable for your data is a linear combination of your inputs.

It is good idea to evaluate linear regression on your problem before moving onto more complex
algorithms in case it performs well.

Choose the linear regression algorithm:

Click the “Choose” button and select “LinearRegression” under the “functions” group.

Click on the name of the algorithm to review the algorithm configuration.

Weka Configuration of Linear Regression

The performance of linear regression can be reduced if your training data has input attributes that
are highly correlated. Weka can detect and remove highly correlated input attributes
automatically by setting eliminateColinearAttributes to True, which is the default.

Additionally, attributes that are unrelated to the output variable can also negatively impact
performance. Weka can automatically perform feature selection to only select those relevant
attributes by setting the attributeSelectionMethod. This is enabled by default and can be disabled.

Finally, the Weka implementation uses a ridge regularization technique in order to reduce the
complexity of the learned model. It does this by minimizing the square of the absolute sum of the
learned coefficients, which will prevent any specific coefficient from becoming too large (a sign
of complexity in regression models).

Click “OK” to close the algorithm configuration.

Click the “Start” button to run the algorithm on the Boston house price dataset.

You can see that with the default configuration that linear regression achieves an RMSE of 4.9.

Weka Results for Linear Regression

k-Nearest Neighbors

The k-nearest neighbors algorithm supports both classification and regression. It is also called
kNN for short. It works by storing the entire training dataset and querying it to locate the k most
similar training patterns when making a prediction.
As such, there is no model other than the raw training dataset and the only computation
performed is the querying of the training dataset when a prediction is requested.

It is a simple algorithm, but one that does not assume very much about the problem other than
that the distance between data instances is meaningful in making predictions. As such, it often
achieves very good performance.

When making predictions on regression problems, KNN will take the mean of the k most similar
instances in the training dataset. Choose the KNN algorithm:

Click the “Choose” button and select “IBk” under the “lazy” group.

Click on the name of the algorithm to review the algorithm configuration.

In Weka KNN is called IBk which stands for Instance Based k.

Weka k-Nearest Neighbors Configuration

The size of the neighborhood is controlled by the k parameter. For example, if set to 1, then
predictions are made using the single most similar training instance to a given new pattern for
which a prediction is requested. Common values for k are 3, 7, 11 and 21, larger for larger
dataset sizes. Weka can automatically discover a good value for k using cross validation inside
the algorithm by setting the crossValidate parameter to True.

Another important parameter is the distance measure used. This is configured in the
nearestNeighbourSearchAlgorithm which controls the way in which the training data is stored
and searched. The default is a LinearNNSearch. Clicking the name of this search algorithm will
provide another configuration window where you can choose a distanceFunction parameter. By
default, Euclidean distance is used to calculate the distance between instances, which is good for
numerical data with the same scale. Manhattan distance is good to use if your attributes differ in
measures or type.

It is a good idea to try a suite of different k values and distance measures on your problem and
see what works best.

Click “OK” to close the algorithm configuration.

Click the “Start” button to run the algorithm on the Boston house price dataset.

You can see that with the default configuration that KNN algorithm achieves an RMSE of 4.6.

Weka Regression Results for the k-Nearest Neighbors Algorithm

Decision Tree

Decision trees can support classification and regression problems.

Decision trees are more recently referred to as Classification And Regression Trees or CART.
They work by creating a tree to evaluate an instance of data, start at the root of the tree and
moving town to the leaves (roots because the tree is drawn with an inverted prospective) until a
prediction can be made. The process of creating a decision tree works by greedily selecting the
best split point in order to make predictions and repeating the process until the tree is a fixed
depth.

After the tree is construct, it is pruned in order to improve the model’s ability to generalize to
new data.

Choose the decision tree algorithm:

Click the “Choose” button and select “REPTree” under the “trees” group.

Click on the name of the algorithm to review the algorithm configuration.

Weka Configuration for Decision Tree Algorithm

The depth of the tree is defined automatically, but can specify a depth in the maxDepth attribute.

You can also choose to turn off pruning by setting the noPruning parameter to True, although
this may result in worse performance.

The minNum parameter defines the minimum number of instances supported by the tree in a leaf
node when constructing the tree from the training data.

Click “OK” to close the algorithm configuration.

Click the “Start” button to run the algorithm on the Boston house price dataset.

You can see that with the default configuration that decision tree algorithm achieves an RMSE of
4.8.
Experiment no 04: Implement multidimensional visualization by adding variables such as
color, size, shape, and label by using Tableau.

Control color, size, shape, detail, text, and tooltips for marks in the view using the Marks card.
Drag fields to buttons on the Marks card to encode the mark data. Click the buttons on the Marks
card to open Mark properties. For related information on marks, see Change the Type of Mark in
the View and Marks card.

Note: The order of dimension fields on the Marks card is hierarchical from top to bottom, and
affects sorting in the view. Tableau first considers the topmost dimension field when ordering
marks in the view, and then considers the dimensions beneath it on the Marks card.

Assign colors to marks

To assign a color to marks in the view, do one of the following:

 On the Marks card, click Color, and then select a color from the menu.

This updates all marks in the view to the color you choose. All marks have a default
color, even when there are no fields on Color on the Marks card. For most marks, blue is
the default color; for text, black is the default color.

 From the Data pane, drag a field to Color on the Marks card.

Tableau applies different colors to marks based on the field’s values and members. For
example, if you drop a discrete field (a blue field), such as Category, on Color, the marks
in the view are broken out by category, and each category is assigned a color.

If you drop a continuous field, such as SUM(sales), on Color, each mark in the view is
colored based on its sales value.
Edit colors

To change the color palette or customize how color is applied to your marks:

 On the Marks card, click Color > Edit Colors.

For more information, see Color Palettes and Effects.

Change the size of marks

To change the size of marks in the view, do one of the following:

 On the Marks card, click Size, and then move the slider to the left or right.

The Size slider affects different marks in different ways, as described in the following
table.

MARK TYPE DESCRIPTION

Circle, Square, Shape, Text Makes the mark bigger or smaller.

Bar, Gantt Bar Makes bars wider or narrower.

Line Makes lines thicker or thinner.

Polygon You cannot change the size of a polygon.

Pie Makes the overall size of the pie bigger and smaller.

The size of your data view is not modified when you change marks using the Size slider.
However, if you change the view size, the mark size might change to accommodate the
new formatting. For example, if you make the table bigger, the marks might become
bigger as well.
 From the Data pane, drag a field to Size on the Marks card.

When you place a discrete field on Size on the Marks card, Tableau separates the marks
according to the members in the dimension, and assigns a unique size to each member. Because
size has an inherent order (small to big), categorical sizes work best for ordered data like years or
quarters.

Size-encoding data with a discrete field separates the marks in the same way as
the Detail property does, and then provides additional information (a size) for each mark. For
more information, see Separate marks in the view by dimension members . When you add
categorical size encoding to a view, Tableau displays a legend showing the sizes assigned to each
member in the field on the Size target. You can modify how these sizes are distributed using the
Edit Sizes dialog box.

When you place a continuous field on Size on the Marks card, Tableau draws each mark with a
different size using a continuous range. The smallest value is assigned the smallest sized mark
and the largest value is represented by the largest mark.

When you add quantitative size encoding to the view, Tableau displays a legend showing the
range of values over which sizes are assigned. You can modify how these sizes are distributed
using the Edit Sizes dialog box.

Edit marks sizes

To edit the size of marks, or change how size is being applied to marks in the view:
1. On the Size legend card (which appears when you add a field to Size on the Marks card),
click the drop-down arrow in the right-hand corner and select Edit Sizes.

For more information about legends, see Legends.

2. In the Edit Sizes dialog box that appears, make your changes and then click OK.

The options available depend on whether the field being applied to Size is a continuous
or discrete field.

For continuous fields, you can do the following:

o For Sizes vary, click the drop-down box and select one of the following:
 Automatically - Selects the mapping that best fits your data. If the data is
numeric and does not cross zero (all positive or all negative), the From
zero mapping is used. Otherwise, the By range mapping is used.
 By range - Uses the minimum and maximum values in the data to
determine the distribution of sizes. For example, if a field has values from
14 to 25, the sizes are distributed across this range.
 From zero - Sizes are interpolated from zero, assigning the maximum
mark size to the absolute value of the data value that is farthest from zero.
o Use the range slider to adjust the distribution of sizes. When the From zero
mapping is selected from the Sizes vary drop-down menu, the lower slider is
disabled because it is always set to zero.
o Select Reversed to assign the largest mark to the smallest value and the smallest
mark to the largest value. This option is not available if you are mapping sizes
from zero because the smallest mark is always assigned to zero.
o To modify the distribution of sizes, select the Start value in legend and End
value for range check boxes and type beginning and end values for the range.

For discrete fields, you can do the following:

 Use the range slider to adjust the distribution of sizes.

 Select Reversed to assign the largest mark to the smallest value and the
smallest mark to the largest value.

Continuous axis mark sizing

For views where the mark type is Bar and there are continuous (green) fields on
both Rows and Columns, Tableau supports additional options and defaults for sizing the bar
marks on the axis where the bars are anchored.
 The bar marks in histograms are continuous by default (with no spaces between the
marks), and are sized to match the size of the bins. See Build a Histogram for an
example.
 When there is a field on Size, you can determine the width of the bar marks on the axis
where the bars are anchored by using the field on Size. To do this, click the Size card and
select Fixed.

 When there is no field on Size, you can specify the width of the bar marks on the axis
where the bars are anchored in axis units. To do this, click the Size card, choose Fixed,
and then type a number in the Width in axis units field.
 When there is a continuous date field on the axis where the bars are anchored, the width
of the marks is set to match the level of the date field. For example, if the level of the
continuous date field is MONTH, the bars are exactly one month wide—that is, slightly
wider for 31-day months than for 30-day months. You can configure the width of the bars
by clicking the Size card, choosing Fixed, and then typing a number in the Width in
days field, but the resulting bar widths don't take into account the varying lengths of time
units such as months and years.

Add labels or text for marks

To add mark labels or text to the visualization:

 From the Data pane, drag a field to Label or Text on the Marks card.

When working with a text table, the Label shelf is replaced with Text, which allows you to view
the numbers associated with a data view. The effect of text-encoding your data view depends on
whether you use a dimension or a measure.
 Dimension – When you place a dimension on Label or Text on the Marks card, Tableau
separates the marks according to the members in the dimension. The text labels are
driven by the dimension member names.
 Measure – When you place a measure on Label or Text on the Marks card, the text labels
are driven by the measure values. The measure can be either aggregated or disaggregated.
However, dis-aggregating the measure is generally not useful because it often results in
overlapping text.

Text is the default mark type for a text table, which is also referred to as a cross-tab or a
PivotTable.

Separate marks in the view by dimension members

To separate marks in the view (or add more granularity):

 From the Data pane, drag a dimension to Detail on the Marks card.
When you drop a dimension on Detail on the Marks card, the marks in a data view are separated
according to the members of that dimension. Unlike dropping a dimension on
the Rows or Columns shelf, dropping it on Detail on the Marks card is a way to show more data
without changing the table structure.

Change the shape of marks

To change the shapes of marks:

 From the Data pane, drag a field to Shape on the Marks card.

When you place a dimension on Shape on the Marks card, Tableau separates the marks
according to the members in the dimension, and assigns a unique shape to each member. Tableau
also displays a shape legend, which shows each member name and its associated shape. When
you place a measure on Shape on the Marks card, the measure is converted to a discrete
measure.

Shape-encoding data separates the marks in the same way as the Detail property does, and then
provides additional information (a shape) for each mark. Shape is the default mark type when
measures are the inner most fields for both the Rows shelf and the Columns shelf.
In the view below, the marks are separated into different shapes according to the members of
the Customer Segment dimension. Each shape reflects the customer segment’s contribution to
profit and sales.

Edit shapes

By default, ten unique shapes are used to encode dimensions. If you have more than 10
members, the shapes repeat. In addition to the default palette, you can choose from a variety of
shape palettes, including filled shapes, arrows, and even weather symbols.

1. Click Shape on the Marks card, or select Edit Shape on the legend’s card menu.
2. In the Edit Shape dialog box, select a member on the left and then select the new shape in
the palette on the right. You can also click Assign Palette to quickly assign the shapes to
the members of the field.

Select a different shape palette using the drop-down menu in the upper right.

Note: Shape encodings are shared across multiple worksheets that use the same data source. For
example, if you define Furniture products to be represented by a square, they will automatically
be squares in all other views in the workbook. To set the default shape encodings for a field,
right-click (control-click on Mac) the field in the Data pane and select Default
Properties > Shape.

Use Custom shapes

You can add custom shapes to a workbook by copying shape image files to the Shapes folder in
your Tableau Repository, which is located in your Documents folder. When you use custom
shapes, they are saved with the workbook. That way the workbook can be shared with others.

1. Create your shape image files. Each shape should be saved as its own file and can be in
any of several image formats including bitmap (.bmp), portable network graphic (.png),
.jpg, and graphics interchange format (.gif).
2. Copy the shape files to a new folder in the My Tableau Repository\Shapes folder in your
Documents folder. The name of the folder will be used as the name of the palette in
Tableau. In the example below, two new palettes are created: Maps and My Custom
Shapes.

3. In Tableau, click the drop-down arrow on the shape legend, and select Edit Shape.
4. Select the new custom palette in the drop-down list. If you modified the shapes while
Tableau was running, you may need to click Reload Shapes.

5. You can either assign members shapes one at a time, or click Assign Palette to
automatically assign the shapes to the members.
Note: You can return to the default palette by clicking the Reset button. If you open a workbook
that uses custom shapes that you don’t have, the workbook will show the custom shapes because
the shapes are saved as part of the workbook. However, you can click Reload Shapes in the Edit
Shapes dialog box to use the ones in your repository instead.

Below are some examples of views that use both the default and custom shape palettes.
Tips for creating custom shapes

When you create custom shapes there are a few things that you can do to improve how your
shapes look and function in the view. If you are creating your own shapes, we recommend
following general guidelines for making icons or clip art.

 Suggested size - Unless you plan on using Size to make the shapes really large, you
should try to make your original shape size close to 32 pixels by 32 pixels. However, the
original size depends on the range of sizes you want available in Tableau. You can resize
the shapes in Tableau by clicking Size on the Marks card, or by using the cell size
options on the Format menu.
 Adding color encoding - If you plan to also use Color to encode shapes, you should use
a transparent background. Otherwise, the entire square of the image will be colored rather
than just the symbol. GIF and PNG file formats both support transparency. GIF files
support transparency for a single color that is 100% transparent, while PNG files support
alpha channels with a range of transparency levels available on every pixel in the image.
When Tableau color encodes a symbol, the amount of transparency for each pixel won't
be modified, so you can maintain smooth edges

Note: Avoid including too much transparency around an image. Make the size of the
custom shape as close to the size of the image as possible. Extra transparent pixels around
the edges of the image can negatively effect the hover or click behavior near the image,
especially when custom shapes overlap each other. When the actual shape area is bigger
than what is visible, it can make hovering and clicking the shape more difficult and less
predictable for users.

 File formats - Tableau doesn't support symbols that are in the Enhanced Meta File
format (.emf). The shape image files can be in one of the following formats: .png, .gif,
.jpg, .bmp, and .tiff.
Experiment no 05 Simple MongoDB and its CRUD Operations

Create Operations

For MongoDB CRUD, if the specified collection doesn’t exist, the create operation will create
the collection when it’s executed. Create operations in MongoDB target a single collection, not
multiple collections. Insert operations in MongoDB are atomic on a single document level.

MongoDB provides two different create operations that you can use to insert documents into a
collection:

 db.collection.insertOne()
 db.collection.insertMany()

insertOne()

As the namesake, insertOne() allows you to insert one document into the collection. For this
example, we’re going to work with a collection called RecordsDB. We can insert a single entry
into our collection by calling the insertOne() method on RecordsDB. We then provide the
information we want to insert in the form of key-value pairs, establishing the schema.
db.RecordsDB.insertOne({
name: "Marsh",
age: "6 years",
species: "Dog",
ownerAddress: "380 W. Fir Ave",
chipped: true
})

If the create operation is successful, a new document is created. The function will return an
object where “acknowledged” is “true” and “insertID” is the newly created “ObjectId.”
> db.RecordsDB.insertOne({
... name: "Marsh",
... age: "6 years",
... species: "Dog",
... ownerAddress: "380 W. Fir Ave",
... chipped: true
... })
{
"acknowledged" : true,
"insertedId" : ObjectId("5fd989674e6b9ceb8665c57d")
}

insertMany()

It’s possible to insert multiple items at one time by calling the insertMany() method on the
desired collection. In this case, we pass multiple items into our chosen collection (RecordsDB)
and separate them by commas. Within the parentheses, we use brackets to indicate that we are
passing in a list of multiple entries. This is commonly referred to as a nested method.
db.RecordsDB.insertMany([{
name: "Marsh",
age: "6 years",
species: "Dog",
ownerAddress: "380 W. Fir Ave",
chipped: true},
{name: "Kitana",
age: "4 years",
species: "Cat",
ownerAddress: "521 E. Cortland",
chipped: true}])

db.RecordsDB.insertMany([{ name: "Marsh", age: "6 years", species: "Dog",

ownerAddress: "380 W. Fir Ave", chipped: true}, {name: "Kitana", age: "4 years",
species: "Cat", ownerAddress: "521 E. Cortland", chipped: true}])
{
"acknowledged" : true,
"insertedIds" : [
ObjectId("5fd98ea9ce6e8850d88270b4"),
ObjectId("5fd98ea9ce6e8850d88270b5")
]
}

Read Operations

The read operations allow you to supply special query filters and criteria that let you specify
which documents you want. The MongoDB documentation contains more information on the
available query filters. Query modifiers may also be used to change how many results are
returned.

MongoDB has two methods of reading documents from a collection:

 db.collection.find()
 db.collection.findOne()

find()

In order to get all the documents from a collection, we can simply use the find() method on our
chosen collection. Executing just the find() method with no arguments will return all records
currently in the collection.
db.RecordsDB.find()
{ "_id" : ObjectId("5fd98ea9ce6e8850d88270b5"), "name" : "Kitana", "age" : "4 years",
"species" : "Cat", "ownerAddress" : "521 E. Cortland", "chipped" : true }
{ "_id" : ObjectId("5fd993a2ce6e8850d88270b7"), "name" : "Marsh", "age" : "6 years",
"species" : "Dog", "ownerAddress" : "380 W. Fir Ave", "chipped" : true }
{ "_id" : ObjectId("5fd993f3ce6e8850d88270b8"), "name" : "Loo", "age" : "3 years", "species" :
"Dog", "ownerAddress" : "380 W. Fir Ave", "chipped" : true }
{ "_id" : ObjectId("5fd994efce6e8850d88270ba"), "name" : "Kevin", "age" : "8 years", "species"
: "Dog", "ownerAddress" : "900 W. Wood Way", "chipped" : true }

Here we can see that every record has an assigned “ObjectId” mapped to the “_id” key.

If you want to get more specific with a read operation and find a desired subsection of the
records, you can use the previously mentioned filtering criteria to choose what results should be
returned. One of the most common ways of filtering the results is to search by value.
db.RecordsDB.find({"species":"Cat"})
{ "_id" : ObjectId("5fd98ea9ce6e8850d88270b5"), "name" : "Kitana", "age" : "4 years",
"species" : "Cat", "ownerAddress" : "521 E. Cortland", "chipped" : true }
findOne()

In order to get one document that satisfies the search criteria, we can simply use
the findOne() method on our chosen collection. If multiple documents satisfy the query, this
method returns the first document according to the natural order which reflects the order of
documents on the disk. If no documents satisfy the search criteria, the function returns null. The
function takes the following form of syntax.
db.{collection}.findOne({query}, {projection})

Let's take the following collection—say, RecordsDB, as an example.

{ "_id" : ObjectId("5fd98ea9ce6e8850d88270b5"), "name" : "Kitana", "age" : "8 years",
"species" : "Cat", "ownerAddress" : "521 E. Cortland", "chipped" : true }
{ "_id" : ObjectId("5fd993a2ce6e8850d88270b7"), "name" : "Marsh", "age" : "6 years",
"species" : "Dog", "ownerAddress" : "380 W. Fir Ave", "chipped" : true }
{ "_id" : ObjectId("5fd993f3ce6e8850d88270b8"), "name" : "Loo", "age" : "3 years", "species" :
"Dog", "ownerAddress" : "380 W. Fir Ave", "chipped" : true }
{ "_id" : ObjectId("5fd994efce6e8850d88270ba"), "name" : "Kevin", "age" : "8 years", "species"
: "Dog", "ownerAddress" : "900 W. Wood Way", "chipped" : true }

And, we run the following line of code:

db.RecordsDB.find({"age":"8 years"})

We would get the following result:

{ "_id" : ObjectId("5fd98ea9ce6e8850d88270b5"), "name" : "Kitana", "age" : "8 years",
"species" : "Cat", "ownerAddress" : "521 E. Cortland", "chipped" : true }

Notice that even though two documents meet the search criteria, only the first document that
matches the search condition is returned.

Update Operations

Like create operations, update operations operate on a single collection, and they are atomic at a
single document level. An update operation takes filters and criteria to select the documents you
want to update.

You should be careful when updating documents, as updates are permanent and can’t be rolled
back. This applies to delete operations as well.

For MongoDB CRUD, there are three different methods of updating documents:

 db.collection.updateOne()
 db.collection.updateMany()
 db.collection.replaceOne()
updateOne()

We can update a currently existing record and change a single document with an update
operation. To do this, we use the updateOne() method on a chosen collection, which here is
“RecordsDB.” To update a document, we provide the method with two arguments: an update
filter and an update action.

The update filter defines which items we want to update, and the update action defines how to
update those items. We first pass in the update filter. Then, we use the “$set” key and provide the
fields we want to update as a value. This method will update the first record that matches the
provided filter.
db.RecordsDB.updateOne({name: "Marsh"}, {$set:{ownerAddress: "451 W. Coffee St.
A204"}})
{ "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 1 }
{ "_id" : ObjectId("5fd993a2ce6e8850d88270b7"), "name" : "Marsh", "age" : "6 years",
"species" : "Dog", "ownerAddress" : "451 W. Coffee St. A204", "chipped" : true }

updateMany()

updateMany() allows us to update multiple items by passing in a list of items, just as we did
when inserting multiple items. This update operation uses the same syntax for updating a single
document.
db.RecordsDB.updateMany({species:"Dog"}, {$set: {age: "5"}})
{ "acknowledged" : true, "matchedCount" : 3, "modifiedCount" : 3 }
> db.RecordsDB.find()
{ "_id" : ObjectId("5fd98ea9ce6e8850d88270b5"), "name" : "Kitana", "age" : "4 years",
"species" : "Cat", "ownerAddress" : "521 E. Cortland", "chipped" : true }
{ "_id" : ObjectId("5fd993a2ce6e8850d88270b7"), "name" : "Marsh", "age" : "5", "species" :
"Dog", "ownerAddress" : "451 W. Coffee St. A204", "chipped" : true }
{ "_id" : ObjectId("5fd993f3ce6e8850d88270b8"), "name" : "Loo", "age" : "5", "species" :
"Dog", "ownerAddress" : "380 W. Fir Ave", "chipped" : true }
{ "_id" : ObjectId("5fd994efce6e8850d88270ba"), "name" : "Kevin", "age" : "5", "species" :
"Dog", "ownerAddress" : "900 W. Wood Way", "chipped" : true }

replaceOne()

The replaceOne() method is used to replace a single document in the specified

collection. replaceOne() replaces the entire document, meaning fields in the old document not
contained in the new will be lost.
db.RecordsDB.replaceOne({name: "Kevin"}, {name: "Maki"})
{ "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 1 }
> db.RecordsDB.find()
{ "_id" : ObjectId("5fd98ea9ce6e8850d88270b5"), "name" : "Kitana", "age" : "4 years",
"species" : "Cat", "ownerAddress" : "521 E. Cortland", "chipped" : true }
{ "_id" : ObjectId("5fd993a2ce6e8850d88270b7"), "name" : "Marsh", "age" : "5", "species" :
"Dog", "ownerAddress" : "451 W. Coffee St. A204", "chipped" : true }
{ "_id" : ObjectId("5fd993f3ce6e8850d88270b8"), "name" : "Loo", "age" : "5", "species" :
"Dog", "ownerAddress" : "380 W. Fir Ave", "chipped" : true }
{ "_id" : ObjectId("5fd994efce6e8850d88270ba"), "name" : "Maki" }

Delete Operations

Delete operations operate on a single collection, like update and create operations. Delete
operations are also atomic for a single document. You can provide delete operations with filters
and criteria in order to specify which documents you would like to delete from a collection. The
filter options rely on the same syntax that read operations utilize.

MongoDB has two different methods of deleting records from a collection:

 db.collection.deleteOne()
 db.collection.deleteMany()

deleteOne()

deleteOne() is used to remove a document from a specified collection on the MongoDB server. A
filter criteria is used to specify the item to delete. It deletes the first record that matches the
provided filter.
db.RecordsDB.deleteOne({name:"Maki"})
{ "acknowledged" : true, "deletedCount" : 1 }
> db.RecordsDB.find()
{ "_id" : ObjectId("5fd98ea9ce6e8850d88270b5"), "name" : "Kitana", "age" : "4 years",
"species" : "Cat", "ownerAddress" : "521 E. Cortland", "chipped" : true }
{ "_id" : ObjectId("5fd993a2ce6e8850d88270b7"), "name" : "Marsh", "age" : "5", "species" :
"Dog", "ownerAddress" : "451 W. Coffee St. A204", "chipped" : true }
{ "_id" : ObjectId("5fd993f3ce6e8850d88270b8"), "name" : "Loo", "age" : "5", "species" :
"Dog", "ownerAddress" : "380 W. Fir Ave", "chipped" : true }

deleteMany()
deleteMany() is a method used to delete multiple documents from a desired collection with a
single delete operation. A list is passed into the method and the individual items are defined with
filter criteria as in deleteOne().
db.RecordsDB.deleteMany({species:"Dog"})
{ "acknowledged" : true, "deletedCount" : 2 }
> db.RecordsDB.find()
{ "_id" : ObjectId("5fd98ea9ce6e8850d88270b5"), "name" : "Kitana", "age" : "4 years", "specie
Experiment no 06 : Performing import, export and aggregation in MongoDB.

In MongoDB, aggregation operations process the data records/documents and return computed
results. It collects values from various documents and groups them together and then performs
different types of operations on that grouped data like sum, average, minimum, maximum, etc to
return a computed result. It is similar to the aggregate function of SQL.
MongoDB provides three ways to perform aggregation
 Aggregation pipeline
 Map-reduce function
 Single-purpose aggregation

Aggregation pipeline

In MongoDB, the aggregation pipeline consists of stages and each stage transforms the
document. Or in other words, the aggregation pipeline is a multi-stage pipeline, so in each state,
the documents taken as input and produce the resultant set of documents now in the next stage(id
available) the resultant documents taken as input and produce output, this process is going on till
the last stage. The basic pipeline stages provide filters that will perform like queries and the
document transformation modifies the resultant document and the other pipeline provides tools
for grouping and sorting documents. You can also use the aggregation pipeline in sharded
collection.
Let us discuss the aggregation pipeline with the help of an example:
In the above example of a collection of train fares in the first stage. Here, the $match stage filters
the documents by the value in class field i.e. class: “first-class” and passes the document to the
second stage. In the Second Stage, the $group stage groups the documents by the id field to
calculate the sum of fare for each unique id.
Here, the aggregate() function is used to perform aggregation it can have three operators stages,
expression and accumulator.

Stages: Each stage starts from stage operators which are:

 $match: It is used for filtering the documents can reduce the amount of documents that are
given as input to the next stage.
 $project: It is used to select some specific fields from a collection.
 $group: It is used to group documents based on some value.
 $sort: It is used to sort the document that is rearranging them
 $skip: It is used to skip n number of documents and passes the remaining documents
 $limit: It is used to pass first n number of documents thus limiting them.
 $unwind: It is used to unwind documents that are using arrays i.e. it deconstructs an array
field in the documents to return documents for each element.
 $out: It is used to write resulting documents to a new collection
Expressions: It refers to the name of the field in input documents for e.g. { $group : { _id :
“$id“, total:{$sum:”$fare“}}} here $id and $fare are expressions.
Accumulators: These are basically used in the group stage
 sum: It sums numeric values for the documents in each group
 count: It counts total numbers of documents
 avg: It calculates the average of all given values from all documents
 min: It gets the minimum value from all the documents
 max: It gets the maximum value from all the documents
 first: It gets the first document from the grouping
 last: It gets the last document from the grouping
Note:
 in $group _id is Mandatory field
 $out must be the last stage in the pipeline
 $sum:1 will count the number of documents and $sum:”$fare” will give the sum of total fare
generated per id.

Data Analytics Using Python Lab Manual
50% (2)
Data Analytics Using Python Lab Manual
8 pages
final dev record
No ratings yet
final dev record
49 pages
ML Aml Cse It Lab Manual Final
No ratings yet
ML Aml Cse It Lab Manual Final
22 pages
ML MANUAL
No ratings yet
ML MANUAL
21 pages
Teks DATA SCIENCE Syllabus - QR
No ratings yet
Teks DATA SCIENCE Syllabus - QR
26 pages
L6 and 7-Data Preprocessing-coding
No ratings yet
L6 and 7-Data Preprocessing-coding
34 pages
FDS RECORD-1-4
No ratings yet
FDS RECORD-1-4
18 pages
Report
No ratings yet
Report
18 pages
Usage of NumPy for Numerical Data in Detail
No ratings yet
Usage of NumPy for Numerical Data in Detail
52 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
No ratings yet
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
76 pages
MLC Practical
No ratings yet
MLC Practical
51 pages
CS3361 - Data Science Laboratory
No ratings yet
CS3361 - Data Science Laboratory
31 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
75 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
sowmi DS
No ratings yet
sowmi DS
27 pages
Fds Mannual
No ratings yet
Fds Mannual
39 pages
549608474 Data Analytics Using Python Lab Manual
No ratings yet
549608474 Data Analytics Using Python Lab Manual
8 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Q-Step WS 06112019 Data Analysis and Visualisation With Python
No ratings yet
Q-Step WS 06112019 Data Analysis and Visualisation With Python
76 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
IML Lab Manual
No ratings yet
IML Lab Manual
31 pages
Ilovepdf Merged (2) Merged
No ratings yet
Ilovepdf Merged (2) Merged
65 pages
index
No ratings yet
index
4 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Fundamentals of Data Science Students
No ratings yet
Fundamentals of Data Science Students
52 pages
EXP1-siddhant gupta (23_SE_148)
No ratings yet
EXP1-siddhant gupta (23_SE_148)
17 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
DS-DS Lab-1
No ratings yet
DS-DS Lab-1
4 pages
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
No ratings yet
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
91 pages
Ty B Tech - Bda - Ai315 - Lab Manual
No ratings yet
Ty B Tech - Bda - Ai315 - Lab Manual
52 pages
Data Analytics and Visualization Lab
No ratings yet
Data Analytics and Visualization Lab
81 pages
data science
No ratings yet
data science
42 pages
CO-367 Machine Learning Lab File: Submitted To: Submitted by
No ratings yet
CO-367 Machine Learning Lab File: Submitted To: Submitted by
12 pages
DS FINAL
No ratings yet
DS FINAL
46 pages
Jashan ML
No ratings yet
Jashan ML
20 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Sarthak Python
No ratings yet
Sarthak Python
6 pages
PythonForMachineLearning
No ratings yet
PythonForMachineLearning
66 pages
Advanced Python Lab
No ratings yet
Advanced Python Lab
17 pages
ML File Updated
No ratings yet
ML File Updated
60 pages
DA 8th Sem
No ratings yet
DA 8th Sem
32 pages
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
PDS_Exp_7_to_9
No ratings yet
PDS_Exp_7_to_9
10 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
72 pages
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
No ratings yet
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
8 pages
(Ebook) Effective Pandas: Patterns for Data Manipulation by Matt Harrison ISBN 9798772692936, 8772692936 - The full ebook version is ready for instant download
100% (2)
(Ebook) Effective Pandas: Patterns for Data Manipulation by Matt Harrison ISBN 9798772692936, 8772692936 - The full ebook version is ready for instant download
72 pages
Python
No ratings yet
Python
32 pages
DSBDAL
No ratings yet
DSBDAL
87 pages
ML(sudhanshu)
No ratings yet
ML(sudhanshu)
24 pages
PMI - Modules and Data Structures
No ratings yet
PMI - Modules and Data Structures
23 pages
DEV RECORD AIDS
No ratings yet
DEV RECORD AIDS
24 pages
Microsoft Ai Automate
No ratings yet
Microsoft Ai Automate
259 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
fdsa lab manual final
No ratings yet
fdsa lab manual final
70 pages
Python Lab Manual
No ratings yet
Python Lab Manual
33 pages

Big Data Analysis

Uploaded by

Big Data Analysis

Uploaded by

Experiment no 01 : Basic big data operations using Numpy, SciPy & Pandas.

Pandas is defined as an open-source library that provides high-performance data manipulation in

Load a CSV file into a Pandas DataFrame:

Duration Pulse Maxpulse Calories

NumPy is mostly written in C language, and it is an extension module of Python. It is defined as

arr = np.array([1, 2, 3, 4, 5])

SciPy is a scientific computation library that uses NumPy underneath.

SciPy stands for Scientific Python.

Like NumPy, SciPy is open source so we can use it freely.

SciPy was created by NumPy's creator Travis Olliphant.

Why Use SciPy?

Data cleaning in Python

Why do we need to clean data in Python?

Steps to clean data in a Python dataset

2. Dropping Unnecessary columns

3. Remove Missing Value Rows

Linear regression only supports regression type problems.

Choose the linear regression algorithm:

Click on the name of the algorithm to review the algorithm configuration.

Click “OK” to close the algorithm configuration.

Weka Results for Linear Regression

Click on the name of the algorithm to review the algorithm configuration.

In Weka KNN is called IBk which stands for Instance Based k.

Click “OK” to close the algorithm configuration.

Weka Regression Results for the k-Nearest Neighbors Algorithm

Decision trees can support classification and regression problems.

Choose the decision tree algorithm:

Click on the name of the algorithm to review the algorithm configuration.

Click “OK” to close the algorithm configuration.

Assign colors to marks

To assign a color to marks in the view, do one of the following:

 On the Marks card, click Color > Edit Colors.

For more information, see Color Palettes and Effects.

Change the size of marks

To change the size of marks in the view, do one of the following:

MARK TYPE DESCRIPTION

Circle, Square, Shape, Text Makes the mark bigger or smaller.

Bar, Gantt Bar Makes bars wider or narrower.

Line Makes lines thicker or thinner.

Polygon You cannot change the size of a polygon.

Edit marks sizes

For more information about legends, see Legends.

For continuous fields, you can do the following:

For discrete fields, you can do the following:

 Use the range slider to adjust the distribution of sizes.

Continuous axis mark sizing

Add labels or text for marks

To add mark labels or text to the visualization:

Separate marks in the view by dimension members

To separate marks in the view (or add more granularity):

Change the shape of marks

To change the shapes of marks:

Use Custom shapes

db.RecordsDB.insertMany([{ name: "Marsh", age: "6 years", species: "Dog",

MongoDB has two methods of reading documents from a collection:

Let's take the following collection—say, RecordsDB, as an example.

And, we run the following line of code:

We would get the following result:

The replaceOne() method is used to replace a single document in the specified

MongoDB has two different methods of deleting records from a collection:

Stages: Each stage starts from stage operators which are:

You might also like