Big Data Analysis
Big Data Analysis
What is Pandas?
Before Pandas, Python was capable for data preparation, but it only provided limited support for
data analysis. So, Pandas came into the picture and enhanced the capabilities of data analysis. It
can perform five significant steps required for processing and analysis of data irrespective of the
origin of the data, i.e., load, manipulate, prepare, model, and analyze.
Example
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
What is NumPy?
The NumPy package is created by the Travis Oliphant in 2005 by adding the functionalities of
the ancestor module Numeric into another module Numarray. It is also capable of handling a
vast amount of data and convenient with Matrix multiplication and data reshaping.
Both the Pandas and NumPy can be seen as an essential library for any scientific computation,
including machine learning due to their intuitive syntax and high-performance matrix
computation capabilities. These two libraries are also best suited for data science applications.
import numpy as np
print(arr)
print(type(arr))
output:
[1 2 3 4 5]
<class 'numpy.ndarray'>
What is SciPy?
It provides more utility functions for optimization, stats and signal processing.
If SciPy uses NumPy underneath, why can we not just use NumPy?
SciPy has optimized and added functions that are frequently used in NumPy and Data Science.
Experiment No 02 Implementation of Plotting, Filtering and Cleaning a CSV File Data
Using NumPy & Pandas.
Python NumPy and Pandas modules provide some methods for data cleaning in Python. Data
cleaning is a process where all of the data that needs to be passed into a database or used for data
analysis is cleaned by either updating or removing missing, inaccurate, incorrectly formatted,
duplicated, or irrelevant information. Regular data cleansing should be practiced regularly in
order to avoid pilling up uncleaned data over the years.
If data is not cleaned properly it can result in a great loss including a reduction in marketing
effectiveness. Hence cleaning of data becomes really important to avoid all the inaccuracy in
major results.
Efficient data cleaning implies fewer errors which results in happier customers and fewer
frustrated employees. It also leads to an increase in productivity and better decisions.
1. Data Loading
Now let’s perform data cleaning on a random csv file that I have downloaded from the internet.
The name of the dataset is ‘San Francisco Building Permits’. Before any processing of the data,
it is first loaded from the file. The code for data loading is shown below:
1 import numpy as np
2 import pandas as pd
3 data = pd.read_csv('Building_Permits.csv',low_memory=False)
First, all the required modules are imported and then the CSV file is loaded. I have added an
additional parameter named low_memory whose purpose is to make sure the program doesn’t
run into any memory errors due to the huge dataset.
The dataset contains 198900 permit details and 43 columns.
When we looked at the dataset, we saw that there were so many columns in the dataset. But for
processing, we can skip some of the columns during processing.
For now let’s drop some random columns namely TIDF Compliance, Fire Only Permit, Unit
Suffix, Block, and Lot.
1 columns_to_drop=['TIDF Compliance', 'Fire Only Permit', 'Unit Suffix', 'Block','Lot']
2 data_dropcol=data.drop(columns_to_drop,axis=1)
We will first create a list storing all the column names to drop from the dataset.
In the next line, we made use of the drop function and pass the list created into the function. We
also pass the axis parameter whose value can be either 0 (row-wise drop) or 1 (column-wise
drop).
After the execution of the code, the new data contains only 38 columns, not 43.
Before moving to directly removing the rows with missing values, let’s first analyze how many
missing values are there in the dataset. For the same purpose, we use the code mentioned below.
1 no_missing = data_dropcol.isnull().sum()
2 total_missing=no_missing.sum()
On the code execution, we found out that there are 1670031 missing values in the dataset. Since
there are so many missing values so instead of dropping the rows with missing data, we drop the
columns with maximum missing values instead. The code for the same is shown below.
1 drop_miss_value=data_dropcol.dropna(axis=1)
The code resulted in the dropping of maximum columns and only 10 columns remained in the
resulting dataset. Yes, most of the information is dropped from the dataset but at least now the
dataset is adequately cleaned.
Summary
Data analysis is a resource-intensive operation. So it makes sense to clean the raw data before the
analysis to save time and effort. Data cleaning also makes sure that our analysis is more accurate.
Python pandas and NumPy modules are best suited for CSV data cleaning.
Experiment no 03: Linear Regression using WEKA.
WEKA - an open source software provides tools for data preprocessing, implementation of
several Machine Learning algorithms, and visualization tools so that you can develop machine
learning techniques and apply them to real-world data mining problems.
Classify Tab
The Classify tab provides you several machine learning algorithms for the classification of your
data. To list a few, you may apply algorithms such as Linear Regression, Logistic Regression,
Support Vector Machines, Decision Trees, RandomTree, RandomForest, NaiveBayes, and so on.
The list is very exhaustive and provides both supervised and unsupervised machine learning
algorithms
Linear Regression
It works by estimating coefficients for a line or hyperplane that best fits the training data. It is a
very simple regression algorithm, fast to train and can have great performance if the output
variable for your data is a linear combination of your inputs.
It is good idea to evaluate linear regression on your problem before moving onto more complex
algorithms in case it performs well.
Click the “Choose” button and select “LinearRegression” under the “functions” group.
The performance of linear regression can be reduced if your training data has input attributes that
are highly correlated. Weka can detect and remove highly correlated input attributes
automatically by setting eliminateColinearAttributes to True, which is the default.
Additionally, attributes that are unrelated to the output variable can also negatively impact
performance. Weka can automatically perform feature selection to only select those relevant
attributes by setting the attributeSelectionMethod. This is enabled by default and can be disabled.
Finally, the Weka implementation uses a ridge regularization technique in order to reduce the
complexity of the learned model. It does this by minimizing the square of the absolute sum of the
learned coefficients, which will prevent any specific coefficient from becoming too large (a sign
of complexity in regression models).
You can see that with the default configuration that linear regression achieves an RMSE of 4.9.
k-Nearest Neighbors
The k-nearest neighbors algorithm supports both classification and regression. It is also called
kNN for short. It works by storing the entire training dataset and querying it to locate the k most
similar training patterns when making a prediction.
As such, there is no model other than the raw training dataset and the only computation
performed is the querying of the training dataset when a prediction is requested.
It is a simple algorithm, but one that does not assume very much about the problem other than
that the distance between data instances is meaningful in making predictions. As such, it often
achieves very good performance.
When making predictions on regression problems, KNN will take the mean of the k most similar
instances in the training dataset. Choose the KNN algorithm:
Click the “Choose” button and select “IBk” under the “lazy” group.
The size of the neighborhood is controlled by the k parameter. For example, if set to 1, then
predictions are made using the single most similar training instance to a given new pattern for
which a prediction is requested. Common values for k are 3, 7, 11 and 21, larger for larger
dataset sizes. Weka can automatically discover a good value for k using cross validation inside
the algorithm by setting the crossValidate parameter to True.
Another important parameter is the distance measure used. This is configured in the
nearestNeighbourSearchAlgorithm which controls the way in which the training data is stored
and searched. The default is a LinearNNSearch. Clicking the name of this search algorithm will
provide another configuration window where you can choose a distanceFunction parameter. By
default, Euclidean distance is used to calculate the distance between instances, which is good for
numerical data with the same scale. Manhattan distance is good to use if your attributes differ in
measures or type.
It is a good idea to try a suite of different k values and distance measures on your problem and
see what works best.
Click the “Start” button to run the algorithm on the Boston house price dataset.
You can see that with the default configuration that KNN algorithm achieves an RMSE of 4.6.
Decision Tree
Decision trees are more recently referred to as Classification And Regression Trees or CART.
They work by creating a tree to evaluate an instance of data, start at the root of the tree and
moving town to the leaves (roots because the tree is drawn with an inverted prospective) until a
prediction can be made. The process of creating a decision tree works by greedily selecting the
best split point in order to make predictions and repeating the process until the tree is a fixed
depth.
After the tree is construct, it is pruned in order to improve the model’s ability to generalize to
new data.
Click the “Choose” button and select “REPTree” under the “trees” group.
The depth of the tree is defined automatically, but can specify a depth in the maxDepth attribute.
You can also choose to turn off pruning by setting the noPruning parameter to True, although
this may result in worse performance.
The minNum parameter defines the minimum number of instances supported by the tree in a leaf
node when constructing the tree from the training data.
Click the “Start” button to run the algorithm on the Boston house price dataset.
You can see that with the default configuration that decision tree algorithm achieves an RMSE of
4.8.
Experiment no 04: Implement multidimensional visualization by adding variables such as
color, size, shape, and label by using Tableau.
Control color, size, shape, detail, text, and tooltips for marks in the view using the Marks card.
Drag fields to buttons on the Marks card to encode the mark data. Click the buttons on the Marks
card to open Mark properties. For related information on marks, see Change the Type of Mark in
the View and Marks card.
Note: The order of dimension fields on the Marks card is hierarchical from top to bottom, and
affects sorting in the view. Tableau first considers the topmost dimension field when ordering
marks in the view, and then considers the dimensions beneath it on the Marks card.
On the Marks card, click Color, and then select a color from the menu.
This updates all marks in the view to the color you choose. All marks have a default
color, even when there are no fields on Color on the Marks card. For most marks, blue is
the default color; for text, black is the default color.
From the Data pane, drag a field to Color on the Marks card.
Tableau applies different colors to marks based on the field’s values and members. For
example, if you drop a discrete field (a blue field), such as Category, on Color, the marks
in the view are broken out by category, and each category is assigned a color.
If you drop a continuous field, such as SUM(sales), on Color, each mark in the view is
colored based on its sales value.
Edit colors
To change the color palette or customize how color is applied to your marks:
On the Marks card, click Size, and then move the slider to the left or right.
The Size slider affects different marks in different ways, as described in the following
table.
Pie Makes the overall size of the pie bigger and smaller.
The size of your data view is not modified when you change marks using the Size slider.
However, if you change the view size, the mark size might change to accommodate the
new formatting. For example, if you make the table bigger, the marks might become
bigger as well.
From the Data pane, drag a field to Size on the Marks card.
When you place a discrete field on Size on the Marks card, Tableau separates the marks
according to the members in the dimension, and assigns a unique size to each member. Because
size has an inherent order (small to big), categorical sizes work best for ordered data like years or
quarters.
Size-encoding data with a discrete field separates the marks in the same way as
the Detail property does, and then provides additional information (a size) for each mark. For
more information, see Separate marks in the view by dimension members . When you add
categorical size encoding to a view, Tableau displays a legend showing the sizes assigned to each
member in the field on the Size target. You can modify how these sizes are distributed using the
Edit Sizes dialog box.
When you place a continuous field on Size on the Marks card, Tableau draws each mark with a
different size using a continuous range. The smallest value is assigned the smallest sized mark
and the largest value is represented by the largest mark.
When you add quantitative size encoding to the view, Tableau displays a legend showing the
range of values over which sizes are assigned. You can modify how these sizes are distributed
using the Edit Sizes dialog box.
To edit the size of marks, or change how size is being applied to marks in the view:
1. On the Size legend card (which appears when you add a field to Size on the Marks card),
click the drop-down arrow in the right-hand corner and select Edit Sizes.
2. In the Edit Sizes dialog box that appears, make your changes and then click OK.
The options available depend on whether the field being applied to Size is a continuous
or discrete field.
o For Sizes vary, click the drop-down box and select one of the following:
Automatically - Selects the mapping that best fits your data. If the data is
numeric and does not cross zero (all positive or all negative), the From
zero mapping is used. Otherwise, the By range mapping is used.
By range - Uses the minimum and maximum values in the data to
determine the distribution of sizes. For example, if a field has values from
14 to 25, the sizes are distributed across this range.
From zero - Sizes are interpolated from zero, assigning the maximum
mark size to the absolute value of the data value that is farthest from zero.
o Use the range slider to adjust the distribution of sizes. When the From zero
mapping is selected from the Sizes vary drop-down menu, the lower slider is
disabled because it is always set to zero.
o Select Reversed to assign the largest mark to the smallest value and the smallest
mark to the largest value. This option is not available if you are mapping sizes
from zero because the smallest mark is always assigned to zero.
o To modify the distribution of sizes, select the Start value in legend and End
value for range check boxes and type beginning and end values for the range.
For views where the mark type is Bar and there are continuous (green) fields on
both Rows and Columns, Tableau supports additional options and defaults for sizing the bar
marks on the axis where the bars are anchored.
The bar marks in histograms are continuous by default (with no spaces between the
marks), and are sized to match the size of the bins. See Build a Histogram for an
example.
When there is a field on Size, you can determine the width of the bar marks on the axis
where the bars are anchored by using the field on Size. To do this, click the Size card and
select Fixed.
When there is no field on Size, you can specify the width of the bar marks on the axis
where the bars are anchored in axis units. To do this, click the Size card, choose Fixed,
and then type a number in the Width in axis units field.
When there is a continuous date field on the axis where the bars are anchored, the width
of the marks is set to match the level of the date field. For example, if the level of the
continuous date field is MONTH, the bars are exactly one month wide—that is, slightly
wider for 31-day months than for 30-day months. You can configure the width of the bars
by clicking the Size card, choosing Fixed, and then typing a number in the Width in
days field, but the resulting bar widths don't take into account the varying lengths of time
units such as months and years.
From the Data pane, drag a field to Label or Text on the Marks card.
When working with a text table, the Label shelf is replaced with Text, which allows you to view
the numbers associated with a data view. The effect of text-encoding your data view depends on
whether you use a dimension or a measure.
Dimension – When you place a dimension on Label or Text on the Marks card, Tableau
separates the marks according to the members in the dimension. The text labels are
driven by the dimension member names.
Measure – When you place a measure on Label or Text on the Marks card, the text labels
are driven by the measure values. The measure can be either aggregated or disaggregated.
However, dis-aggregating the measure is generally not useful because it often results in
overlapping text.
Text is the default mark type for a text table, which is also referred to as a cross-tab or a
PivotTable.
From the Data pane, drag a dimension to Detail on the Marks card.
When you drop a dimension on Detail on the Marks card, the marks in a data view are separated
according to the members of that dimension. Unlike dropping a dimension on
the Rows or Columns shelf, dropping it on Detail on the Marks card is a way to show more data
without changing the table structure.
From the Data pane, drag a field to Shape on the Marks card.
When you place a dimension on Shape on the Marks card, Tableau separates the marks
according to the members in the dimension, and assigns a unique shape to each member. Tableau
also displays a shape legend, which shows each member name and its associated shape. When
you place a measure on Shape on the Marks card, the measure is converted to a discrete
measure.
Shape-encoding data separates the marks in the same way as the Detail property does, and then
provides additional information (a shape) for each mark. Shape is the default mark type when
measures are the inner most fields for both the Rows shelf and the Columns shelf.
In the view below, the marks are separated into different shapes according to the members of
the Customer Segment dimension. Each shape reflects the customer segment’s contribution to
profit and sales.
Edit shapes
By default, ten unique shapes are used to encode dimensions. If you have more than 10
members, the shapes repeat. In addition to the default palette, you can choose from a variety of
shape palettes, including filled shapes, arrows, and even weather symbols.
1. Click Shape on the Marks card, or select Edit Shape on the legend’s card menu.
2. In the Edit Shape dialog box, select a member on the left and then select the new shape in
the palette on the right. You can also click Assign Palette to quickly assign the shapes to
the members of the field.
Select a different shape palette using the drop-down menu in the upper right.
Note: Shape encodings are shared across multiple worksheets that use the same data source. For
example, if you define Furniture products to be represented by a square, they will automatically
be squares in all other views in the workbook. To set the default shape encodings for a field,
right-click (control-click on Mac) the field in the Data pane and select Default
Properties > Shape.
You can add custom shapes to a workbook by copying shape image files to the Shapes folder in
your Tableau Repository, which is located in your Documents folder. When you use custom
shapes, they are saved with the workbook. That way the workbook can be shared with others.
1. Create your shape image files. Each shape should be saved as its own file and can be in
any of several image formats including bitmap (.bmp), portable network graphic (.png),
.jpg, and graphics interchange format (.gif).
2. Copy the shape files to a new folder in the My Tableau Repository\Shapes folder in your
Documents folder. The name of the folder will be used as the name of the palette in
Tableau. In the example below, two new palettes are created: Maps and My Custom
Shapes.
3. In Tableau, click the drop-down arrow on the shape legend, and select Edit Shape.
4. Select the new custom palette in the drop-down list. If you modified the shapes while
Tableau was running, you may need to click Reload Shapes.
5. You can either assign members shapes one at a time, or click Assign Palette to
automatically assign the shapes to the members.
Note: You can return to the default palette by clicking the Reset button. If you open a workbook
that uses custom shapes that you don’t have, the workbook will show the custom shapes because
the shapes are saved as part of the workbook. However, you can click Reload Shapes in the Edit
Shapes dialog box to use the ones in your repository instead.
Below are some examples of views that use both the default and custom shape palettes.
Tips for creating custom shapes
When you create custom shapes there are a few things that you can do to improve how your
shapes look and function in the view. If you are creating your own shapes, we recommend
following general guidelines for making icons or clip art.
Suggested size - Unless you plan on using Size to make the shapes really large, you
should try to make your original shape size close to 32 pixels by 32 pixels. However, the
original size depends on the range of sizes you want available in Tableau. You can resize
the shapes in Tableau by clicking Size on the Marks card, or by using the cell size
options on the Format menu.
Adding color encoding - If you plan to also use Color to encode shapes, you should use
a transparent background. Otherwise, the entire square of the image will be colored rather
than just the symbol. GIF and PNG file formats both support transparency. GIF files
support transparency for a single color that is 100% transparent, while PNG files support
alpha channels with a range of transparency levels available on every pixel in the image.
When Tableau color encodes a symbol, the amount of transparency for each pixel won't
be modified, so you can maintain smooth edges
Note: Avoid including too much transparency around an image. Make the size of the
custom shape as close to the size of the image as possible. Extra transparent pixels around
the edges of the image can negatively effect the hover or click behavior near the image,
especially when custom shapes overlap each other. When the actual shape area is bigger
than what is visible, it can make hovering and clicking the shape more difficult and less
predictable for users.
File formats - Tableau doesn't support symbols that are in the Enhanced Meta File
format (.emf). The shape image files can be in one of the following formats: .png, .gif,
.jpg, .bmp, and .tiff.
Experiment no 05 Simple MongoDB and its CRUD Operations
Create Operations
For MongoDB CRUD, if the specified collection doesn’t exist, the create operation will create
the collection when it’s executed. Create operations in MongoDB target a single collection, not
multiple collections. Insert operations in MongoDB are atomic on a single document level.
MongoDB provides two different create operations that you can use to insert documents into a
collection:
db.collection.insertOne()
db.collection.insertMany()
insertOne()
As the namesake, insertOne() allows you to insert one document into the collection. For this
example, we’re going to work with a collection called RecordsDB. We can insert a single entry
into our collection by calling the insertOne() method on RecordsDB. We then provide the
information we want to insert in the form of key-value pairs, establishing the schema.
db.RecordsDB.insertOne({
name: "Marsh",
age: "6 years",
species: "Dog",
ownerAddress: "380 W. Fir Ave",
chipped: true
})
If the create operation is successful, a new document is created. The function will return an
object where “acknowledged” is “true” and “insertID” is the newly created “ObjectId.”
> db.RecordsDB.insertOne({
... name: "Marsh",
... age: "6 years",
... species: "Dog",
... ownerAddress: "380 W. Fir Ave",
... chipped: true
... })
{
"acknowledged" : true,
"insertedId" : ObjectId("5fd989674e6b9ceb8665c57d")
}
insertMany()
It’s possible to insert multiple items at one time by calling the insertMany() method on the
desired collection. In this case, we pass multiple items into our chosen collection (RecordsDB)
and separate them by commas. Within the parentheses, we use brackets to indicate that we are
passing in a list of multiple entries. This is commonly referred to as a nested method.
db.RecordsDB.insertMany([{
name: "Marsh",
age: "6 years",
species: "Dog",
ownerAddress: "380 W. Fir Ave",
chipped: true},
{name: "Kitana",
age: "4 years",
species: "Cat",
ownerAddress: "521 E. Cortland",
chipped: true}])
Read Operations
The read operations allow you to supply special query filters and criteria that let you specify
which documents you want. The MongoDB documentation contains more information on the
available query filters. Query modifiers may also be used to change how many results are
returned.
db.collection.find()
db.collection.findOne()
find()
In order to get all the documents from a collection, we can simply use the find() method on our
chosen collection. Executing just the find() method with no arguments will return all records
currently in the collection.
db.RecordsDB.find()
{ "_id" : ObjectId("5fd98ea9ce6e8850d88270b5"), "name" : "Kitana", "age" : "4 years",
"species" : "Cat", "ownerAddress" : "521 E. Cortland", "chipped" : true }
{ "_id" : ObjectId("5fd993a2ce6e8850d88270b7"), "name" : "Marsh", "age" : "6 years",
"species" : "Dog", "ownerAddress" : "380 W. Fir Ave", "chipped" : true }
{ "_id" : ObjectId("5fd993f3ce6e8850d88270b8"), "name" : "Loo", "age" : "3 years", "species" :
"Dog", "ownerAddress" : "380 W. Fir Ave", "chipped" : true }
{ "_id" : ObjectId("5fd994efce6e8850d88270ba"), "name" : "Kevin", "age" : "8 years", "species"
: "Dog", "ownerAddress" : "900 W. Wood Way", "chipped" : true }
Here we can see that every record has an assigned “ObjectId” mapped to the “_id” key.
If you want to get more specific with a read operation and find a desired subsection of the
records, you can use the previously mentioned filtering criteria to choose what results should be
returned. One of the most common ways of filtering the results is to search by value.
db.RecordsDB.find({"species":"Cat"})
{ "_id" : ObjectId("5fd98ea9ce6e8850d88270b5"), "name" : "Kitana", "age" : "4 years",
"species" : "Cat", "ownerAddress" : "521 E. Cortland", "chipped" : true }
findOne()
In order to get one document that satisfies the search criteria, we can simply use
the findOne() method on our chosen collection. If multiple documents satisfy the query, this
method returns the first document according to the natural order which reflects the order of
documents on the disk. If no documents satisfy the search criteria, the function returns null. The
function takes the following form of syntax.
db.{collection}.findOne({query}, {projection})
Notice that even though two documents meet the search criteria, only the first document that
matches the search condition is returned.
Update Operations
Like create operations, update operations operate on a single collection, and they are atomic at a
single document level. An update operation takes filters and criteria to select the documents you
want to update.
You should be careful when updating documents, as updates are permanent and can’t be rolled
back. This applies to delete operations as well.
For MongoDB CRUD, there are three different methods of updating documents:
db.collection.updateOne()
db.collection.updateMany()
db.collection.replaceOne()
updateOne()
We can update a currently existing record and change a single document with an update
operation. To do this, we use the updateOne() method on a chosen collection, which here is
“RecordsDB.” To update a document, we provide the method with two arguments: an update
filter and an update action.
The update filter defines which items we want to update, and the update action defines how to
update those items. We first pass in the update filter. Then, we use the “$set” key and provide the
fields we want to update as a value. This method will update the first record that matches the
provided filter.
db.RecordsDB.updateOne({name: "Marsh"}, {$set:{ownerAddress: "451 W. Coffee St.
A204"}})
{ "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 1 }
{ "_id" : ObjectId("5fd993a2ce6e8850d88270b7"), "name" : "Marsh", "age" : "6 years",
"species" : "Dog", "ownerAddress" : "451 W. Coffee St. A204", "chipped" : true }
updateMany()
updateMany() allows us to update multiple items by passing in a list of items, just as we did
when inserting multiple items. This update operation uses the same syntax for updating a single
document.
db.RecordsDB.updateMany({species:"Dog"}, {$set: {age: "5"}})
{ "acknowledged" : true, "matchedCount" : 3, "modifiedCount" : 3 }
> db.RecordsDB.find()
{ "_id" : ObjectId("5fd98ea9ce6e8850d88270b5"), "name" : "Kitana", "age" : "4 years",
"species" : "Cat", "ownerAddress" : "521 E. Cortland", "chipped" : true }
{ "_id" : ObjectId("5fd993a2ce6e8850d88270b7"), "name" : "Marsh", "age" : "5", "species" :
"Dog", "ownerAddress" : "451 W. Coffee St. A204", "chipped" : true }
{ "_id" : ObjectId("5fd993f3ce6e8850d88270b8"), "name" : "Loo", "age" : "5", "species" :
"Dog", "ownerAddress" : "380 W. Fir Ave", "chipped" : true }
{ "_id" : ObjectId("5fd994efce6e8850d88270ba"), "name" : "Kevin", "age" : "5", "species" :
"Dog", "ownerAddress" : "900 W. Wood Way", "chipped" : true }
replaceOne()
Delete Operations
Delete operations operate on a single collection, like update and create operations. Delete
operations are also atomic for a single document. You can provide delete operations with filters
and criteria in order to specify which documents you would like to delete from a collection. The
filter options rely on the same syntax that read operations utilize.
db.collection.deleteOne()
db.collection.deleteMany()
deleteOne()
deleteOne() is used to remove a document from a specified collection on the MongoDB server. A
filter criteria is used to specify the item to delete. It deletes the first record that matches the
provided filter.
db.RecordsDB.deleteOne({name:"Maki"})
{ "acknowledged" : true, "deletedCount" : 1 }
> db.RecordsDB.find()
{ "_id" : ObjectId("5fd98ea9ce6e8850d88270b5"), "name" : "Kitana", "age" : "4 years",
"species" : "Cat", "ownerAddress" : "521 E. Cortland", "chipped" : true }
{ "_id" : ObjectId("5fd993a2ce6e8850d88270b7"), "name" : "Marsh", "age" : "5", "species" :
"Dog", "ownerAddress" : "451 W. Coffee St. A204", "chipped" : true }
{ "_id" : ObjectId("5fd993f3ce6e8850d88270b8"), "name" : "Loo", "age" : "5", "species" :
"Dog", "ownerAddress" : "380 W. Fir Ave", "chipped" : true }
deleteMany()
deleteMany() is a method used to delete multiple documents from a desired collection with a
single delete operation. A list is passed into the method and the individual items are defined with
filter criteria as in deleteOne().
db.RecordsDB.deleteMany({species:"Dog"})
{ "acknowledged" : true, "deletedCount" : 2 }
> db.RecordsDB.find()
{ "_id" : ObjectId("5fd98ea9ce6e8850d88270b5"), "name" : "Kitana", "age" : "4 years", "specie
Experiment no 06 : Performing import, export and aggregation in MongoDB.
In MongoDB, aggregation operations process the data records/documents and return computed
results. It collects values from various documents and groups them together and then performs
different types of operations on that grouped data like sum, average, minimum, maximum, etc to
return a computed result. It is similar to the aggregate function of SQL.
MongoDB provides three ways to perform aggregation
Aggregation pipeline
Map-reduce function
Single-purpose aggregation
Aggregation pipeline
In MongoDB, the aggregation pipeline consists of stages and each stage transforms the
document. Or in other words, the aggregation pipeline is a multi-stage pipeline, so in each state,
the documents taken as input and produce the resultant set of documents now in the next stage(id
available) the resultant documents taken as input and produce output, this process is going on till
the last stage. The basic pipeline stages provide filters that will perform like queries and the
document transformation modifies the resultant document and the other pipeline provides tools
for grouping and sorting documents. You can also use the aggregation pipeline in sharded
collection.
Let us discuss the aggregation pipeline with the help of an example:
In the above example of a collection of train fares in the first stage. Here, the $match stage filters
the documents by the value in class field i.e. class: “first-class” and passes the document to the
second stage. In the Second Stage, the $group stage groups the documents by the id field to
calculate the sum of fare for each unique id.
Here, the aggregate() function is used to perform aggregation it can have three operators stages,
expression and accumulator.