0% found this document useful (0 votes)

173 views33 pages

Big Data Analytics with Python & Tableau

DSNVKJDNB

Uploaded by

tamanna sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

173 views33 pages

Big Data Analytics with Python & Tableau

DSNVKJDNB

Uploaded by

tamanna sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 33

BIG DATA ANALYTICS

Table of Contents
Introduction................................................................................................................................3

Aims and objectives...................................................................................................................3

Critical understanding of big data management.........................................................................3

Big data manipulation................................................................................................................4

Modeling methods......................................................................................................................4

Tools and techniques..................................................................................................................5

Section 1: Big Data Analytics (python).....................................................................................5

Task 1: Problem Domain, Data Description, and Research Question....................................6

1.1 Problem domain............................................................................................................6

1.2 Data Description...........................................................................................................6

1.3 Research Question........................................................................................................6

1.4 Statistical methods........................................................................................................7

Task 2: Solution Exploration.....................................................................................................8

2.1 Approaches and technologies for developing big data application..................................8

2.2 Discussion of chosen methodological approach with justification..................................8

2.3 Solutions and techniques for related problems.................................................................9

Task 3: Solution Development.................................................................................................10

3.1 Data pre-processing........................................................................................................10

3.2 Descriptive Statistics......................................................................................................14

3.3 Data Visualization..........................................................................................................15

3.4 Statistical significance....................................................................................................16

3.5 Solution of Research Questions......................................................................................16

Task 4: Evaluation and Future Development...........................................................................19

4.1 Conclusion......................................................................................................................19

4.2 Evaluation.......................................................................................................................19

4.3 Limitation.......................................................................................................................19

2
4.4 Future Work....................................................................................................................20

Section 2 – Business Intelligence (Tableau)............................................................................21

Introduction..........................................................................................................................21

Dataset Description...............................................................................................................21

Task 1...................................................................................................................................21

Task 2...................................................................................................................................22

Task 3...................................................................................................................................23

Task 4...................................................................................................................................23

Task 5...................................................................................................................................25

Conclusion............................................................................................................................26

Reference List..........................................................................................................................27

3
Introduction
“Big data analysis” is the process of extracting valuable insights and information from large
and complex datasets. Massive amounts of structured and unstructured data are referred to as
"big data," and they are produced by a variety of sources, including social media, IoT
devices, sensors, and more. As data generation across numerous industries has increased
exponentially, big data analysis has attracted a lot of attention in recent years. Analyzing big
data can provide organizations with valuable insights that can help improve decision-making,
identify new opportunities, and optimize business processes. Big data analysis involves
several steps, including data collection, cleaning, processing, and analysis. Advanced
technologies such as “machine learning”, “artificial intelligence”, and “natural language
processing” are often used to make sense of massive amounts of data some popular
applications of big data analysis include fraud detection, risk management, customer behavior
analysis, predictive maintenance, supply chain optimization, and more. Numerous industries,
including healthcare, finance, retail, and manufacturing, stand to benefit from big data
analysis. Big data analysis is a quickly expanding field that offers businesses useful
information and insights that can aid in decision-making and give them a competitive edge.
The researcher use two datasets such as “Billionaires” and “Electronic sales” to perform big
data analysis. The research study includes two different sections to perform big data analysis,
data preprocessing, and data visualization with the help of python and tableau platforms.
Aims and objectives
The aim of this project is to use the Tableau Platform and Python programming languages to
perform big data analytics on a precise dataset.
The objectives are:
● To perform big data visualization for answering all the research questions in python
● To use different parameter functions to display a chart based on the corresponding
table on the tableau platform
Critical understanding of big data management
Due to its simplicity of use and abundance of libraries and tools, Python is a well-liked
programming language for managing large amounts of data. Python's large data management
demands a thorough understanding of a number of crucial components. One of the most
critical considerations is data handling. As per the view Kharel of et al. (2020), libraries such
as Pandas, NumPy, and Desk are used to read, manipulate, and transform data in various
formats. Scalability is another important issue, and solutions for distributed computing can be
utilized to process big datasets over many nodes. Data cleansing and preparation are equally

4
key phases in big data analysis. Scikit-learn is one of the libraries offered by Python for data
preprocessing tasks like resolving missing values, eliminating duplicates, and feature scaling.
In order to find patterns and make predictions, big data analysis usually uses machine
learning algorithms. Python includes numerous machine learning frameworks such as
TensorFlow and Scikit-learn that allow developers to design and deploy machine learning
models easily ultimately, big data analysis depends on visualization. Python has a number of
libraries, such as Matplotlib and Seaborn that can be used to produce high-quality
visualizations that can assist users in drawing conclusions from sizable datasets. In
conclusion, big data management in Python is a challenging process that calls for a variety of
tools, strategies, and expertise to efficiently handle and analyze enormous datasets.
Big data manipulation
In order to work with and analyze massive datasets, “big data manipulation” in Python uses
a range of tools and packages. As per the view of Bhatia et al. (2020), data management is
one of the most crucial components of huge data manipulation in Python. Pandas, NumPy,
and Dask are just a few of the libraries that Python provides that let users read, analyze, and
modify data in a number of formats. These libraries offer strong resources for handling large
datasets and can be used to carry out a number of tasks, including merging datasets, grouping
data, and pivoting tables. Data must first be cleansed of big datasets over many nodes. Data
cleansing and preparation are equally key phases in big data analysis. And made ready for
analysis before analysis can start. For data preprocessing, including handling missing values,
eliminating duplicates, and feature scaling, Python provides a number of packages, such as
Pandas. Data filtering is a crucial component of large data manipulation because it enables
users to extract subsets of data in accordance with predetermined standards. Pandas and
NumPy are just two of the packages available in Python for data filtering. Large data
manipulation comprises the conversion of data between different formats. Python has a
number of libraries for data manipulation, including Pandas and NumPy for producing top-
notch visuals that can aid users in deriving insights from huge datasets, such as Matplotlib
and Seaborn. In summary, processing huge data in Python requires a combination of tools
and handling strategies.
Modeling methods
Many modeling strategies and procedures are used while performing large data analysis in
Python with Tableau. As per the view of Peng et al. (2021), ML methods are frequently used
in Python to find patterns and forecast data from massive datasets. TensorFlow and Scikit-
learn are only two of the packages available in Python for creating and deploying machine

5
learning models. These libraries offer a range of methods, such as deep learning, clustering,
regression, and classification. Data is one of the most crucial components of large data
analysis in Python. Data must first be cleansed and made ready for analysis before analysis
can start. For data preprocessing, including handling missing values, eliminating duplicates,
and feature scaling, Python provides a number of packages, such as Pandas. Data can be used
to train machine learning models once it has undergone preprocessing. Data modeling in
Tableau is frequently done through the drag-and-drop interface. Users may quickly develop
models and visualizations because of this without having to write any code. The modeling
methods available in Tableau include grouping, regression, and forecasting. Large datasets
can be analyzed using these methods to spot patterns and trends. Also crucial to large data
research in Python is data visualization. In order to execute large data analyses in Python and
Tableau requires combining modeling approaches and techniques, such as machine learning
algorithms, data preprocessing, and data visualization. Using these methods and tools, one
can examine enormous datasets as well as glean insightful data.
Tools and techniques
It takes a combination of tools and strategies to create big data analytics with Python and
Tableau. As per the view of Joe et al. (2021), data manipulation, statistical analysis, ML, data
visualization, data mining, and business intelligence tools and methodologies are all
combined when creating big data analytics in Python and Tableau. Users can acquire
important insights from huge datasets and make data-driven decisions by utilizing these tools
and strategies. Some of the most popular tools and methods for creating big data analytics in
Python and Tableau are listed below:
Python libraries: Pandas, NumPy, and Scikit-learn are just a few of the large data analytics
libraries available in Python. These libraries offer strong capabilities for machine learning,
statistical analysis, and data manipulation.
ML Algorithms: To find patterns and generate predictions from enormous datasets, machine
learning algorithms are frequently employed in big data analytics. TensorFlow and PyTorch
are only two of the libraries available in Python for creating and deploying machine learning
models.
Data preprocessing: Data needs to be cleaned up and made ready for analysis before analysis
can start. For data preprocessing, including handling missing values, eliminating duplicates,
and feature scaling, Python provides a number of packages, such as Pandas.
Data Visualization: Users may explore and comprehend data in a number of ways thanks to
data visualization which is essential in big data analytics.

6
Section 1: Big Data Analytics (python)
Task 1: Problem Domain, Data Description, and Research Question
1.1 Problem domain
This section is responsible for outlining the many types of Python-based big data analytics
difficulties. A challenging area for big data analytics with Python is working with enormous
datasets that cannot be processed and analyzed using normal data analysis methods. As the
digital age has progressed, massive amounts of data are being generated every day from a
variety of sources, including social media, IoT devices, sensors, and more. Companies can
use the trends and insights shown by this data to enhance their processes and gain a
competitive edge.
1.2 Data Description

Figure 1.2.1: Dataset

(Source: Kaggle)
The image shows the “billionaires” dataset that has 2615 rows and 21 columns. The columns
include “name”, “rank”, “year”, “company. Founded”, “company. Names”, “company.
Relationship”, “company. Type”, “location. Region” and so on. The dataset holds all the
information about the company. Once the dataset has been imported into the Jupiter platform,
the researcher completes all the required tasks of this research study.
1.3 Research Question

7
There are four research questions that arise in this particular section. The researcher has to
answer all the questions with the help of python language in the jupyter platform. The
research questions are discussed below:
● What are the top 10 countries with the highest number of billionaires?
● What industries/sectors are most successful?
● What are the main industries with the highest number of women billionaires?
● What age range represents the highest and lowest number of billionaires?
1.4 Statistical methods
A large variety of statistical techniques are available in Python for data analysis. These
techniques can be used to model relationships between variables, test hypotheses, and
evaluate and summaries data. Among the often employed statistical techniques in Python are:
Testing Hypothesis: Python includes tools for doing hypothesis tests, including the t-test and
ANOVA. As per the view of Wang et al. (2022), these tests can assist in determining the
statistical significance of a difference between groups.
Regression analysis: Modeling the relationship between variables using regression analysis
is a powerful statistical technique. As per the view of Ghivary et al. (2023), regression
analysis is supported by a number of Python modules, including stats models and sci-kit-
learn.
Time series analysis: It is a statistical technique used to examine time series data. Pandas and
Statsmodels are only two of the time series analysis libraries that Python offers.
Clustering: It is a statistical technique for combining related data elements. As per the view
of Guerrero et al. (2020), many clustering methods, including K-means and hierarchical
clustering, are available in Python.

8
Task 2: Solution Exploration
2.1 Approaches and technologies for developing big data application
Python provides a wide range of tools and techniques for building big data analytics
applications. As per the view of Musazade et al. (2022), developers may quickly process,
analyze, and extract insights from massive amounts of data, resulting in more informed
business decisions, by integrating these technologies and methodologies. A few of the
powerful Python modules that can be used to efficiently handle and manage massive amounts
of structured and unstructured data include Pandas, NumPy, Dask, and Apache Spark. Scikit-
learn, TensorFlow, and Keras are just a handful of the numerous ML libraries that are
available in Python that may be used to build prediction models and derive conclusions from
enormous volumes of data. Prediction charts can grow far more complex in models that are
more complicated, such as those employed in time-series analysis or machine learning. A
prediction plot for a neural network, for instance, can show the actual observed values
alongside the expected values of the model while also including extra details like error bars
or training/validation set performance data. Data visualization is a crucial tool for data
analysis since it makes it possible to convey complex information in an understandable and
direct way. The researcher can make a variety of visualizations using Python packages for
data visualization to help The researcher comprehend the data and share insights with others.
Python includes a number of visualization tools that can be used to create interesting and
instructive huge data visualizations, such as Matplotlib, Seaborn, Plotly, and Bokeh.
2.2 Discussion of chosen methodological approach with justification
Using Python programming languages, the researcher performed the preprocessing and
visualization for this work. Visualization is the process of representing data graphically or
visually in order to clearly convey concepts or information. Visualization is an essential
component of big data analytics because it makes complex data patterns, relationships, and
trends easier to understand for users. In big data analytics, visualization techniques like
scatter plots, line charts, histograms, box plots, heat maps, and geographical maps are
frequently employed. Anomalies and outliers can be found via visualization, as can trends,
different datasets can be compared, and results can be presented. Using machine learning
algorithms or other statistical techniques, preprocessing is the cleaning and translation of raw
data into a format that can be easily studied. In big data.
Data Cleaning

9
This entails addressing outliers, completing any data gaps, and correcting any inaccuracies.
Data cleaning is a process that looks for and corrects errors or discrepancies in the data to
improve its quality and correctness.
Data Transformation
Data transformation is the process of converting data from one format to another in order to
get it ready for analysis or modeling. A dataset must be transformed into one that is suitable
for the desired modeling or analytic task.
2.3 Solutions and techniques for related problems
The issues of big data analysis include those related to data processing, analysis, and
visualization. Fortunately, there are a variety of approaches and methods available to get over
these obstacles and boost the effectiveness and efficiency of big data analysis.
Data storage is one of the main difficulties in big data analysis. As a result, big data sets may
necessitate specialized storage solutions. These systems offer scalable and distributed storage
options that make it possible to process big data sets effectively.

10
Task 3: Solution Development
3.1 Data pre-processing
As per the view of Musazade et al. (2022), python data analysis A well-liked method for
handling and analyzing data is Jupyter. Jupyter is an interactive environment that enables
users to create and share documents that contain live code, graphics, and narrative text.
Python is a flexible programming language that offers a wide variety of tools and frameworks
for data analysis. Data analysts can carry out a variety of data processing and analysis tasks
using Python Jupyter notebooks. Data input and export, data cleaning, data transformation,
and data visualization are supported by notebooks. Jupyter notebooks streamline the data
analysis process by combining the code, visuals, and narrative text into a single document.

3.1.1 Load Data

(Source: Created in Jupyter)
Here, the researcher loads them into the dataframe as “data”.

11
3.1.2 Data preprocessing
(Source: Created in Jupyter)
The researcher performs data preprocessing by calculating the shape of the dataset.

12
3.1.3 Data preprocessing
(Source: Created in Jupyter)
The researcher performs data preprocessing by removing and dropping null values from the
imported dataset.

13
3.1.4 Data preprocessing
(Source: Created in Jupyter)
The researcher performs data preprocessing by finding correlations between all variables and
dropping unnecessary values from the imported dataset.

14
3.1.5 Data preprocessing
(Source: Created in Jupyter)
The researcher performs data preprocessing by statistical analysis.

15
3.1.6 Data preprocessing
(Source: Created in Jupyter)
The researcher performs data preprocessing by finding unique values present in the imported
dataset.
3.2 Descriptive Statistics
A collection of Python techniques known as descriptive statistics is used to enumerate and
describe a data set's fundamental characteristics. NumPy, Pandas, and SciPy are just a few of
the descriptive statistics modules that Python offers. One of the most commonly used
descriptive statistics in Python is the mean, which represents the average value of a data set.
The mean can be calculated using NumPy's "mean" function or Pandas' "mean" method.
Another frequently used statistic is the standard deviation, which measures the spread of the
data around the mean. As per the view of Guerrero et al. (2020), the standard deviation can
be calculated using NumPy's "std" function or Pandas' "std" method. Here, the researcher
performs a descriptive analysis process in an efficient manner.
3.3 Data Visualization
Data visualization is the process of finding correlations as well as trends between different
types of data that are present in the imported dataset. Various types of library functions are
used for visualizing data. By using different types of visual elements such as graphs, charts,
and maps. As per the view Kharel of et al. (2020), it is mainly used as a graphical
representation of data and information. The ability to better understand and share insights

16
from the data is made possible by data visualization, which is a crucial step in the data
analysis process. The researcher can spot patterns, trends, and relationships in the data by
producing visualizations that might not be visible from the data's raw statistics or tables.

3.3.1 Data Visualization

(Source: Created in Jupyter)
The researcher performs data visualization by plotting the box plot to represent the column
“demographies. age”.

3.3.2 Data Visualization

(Source: Created in Jupyter)
The researcher performs data visualization by plotting the line plot to represent the column
“location. gdp”.

17
3.3.3 Data Visualization
(Source: Created in Jupyter)
The researcher performs data visualization by plotting the pair plot to represent all the
column of the dataset.

3.3.4 Data Visualization

(Source: Created in Jupyter)
The researcher performs data visualization by plotting the pie chart to represent the column
“year”.

3.3.5 Data Visualization

18
(Source: Created in Jupyter)
The researcher performs data visualization by plotting the bar plot to represent the column
“location. gdp”.

3.3.6 Data Visualization

(Source: Created in Jupyter)
The researcher performs data visualization by plotting the histogram plot to represent the
column “demographics.age”.

3.3.7 Data Visualization

(Source: Created in Jupyter)
The researcher performs data visualization by plotting the line plot to represent the column
“demographics.age”.

19
3.3.8 Data Visualization
(Source: Created in Jupyter)
The researcher performs data visualization by plotting the box plot to represent the column
“wealth. World in billions”.

3.4 Statistical significance

3.5 Solution of Research Questions

3.5.1 Question 1
(Source: Created in Jupyter)
Here, the researcher performs data visualization by plotting the bar plot to calculate the top
10 countries with the highest number of billionaires.

20
3.5.2 Question 2
(Source: Created in Jupyter)
Here, the researcher performs data visualization by plotting the bar plot to find out the most
successful sectors or industries.

3.5.3 Question 3
(Source: Created in Jupyter)
Here, the researcher performs data visualization by plotting the pie chart to find out the main
industries with the highest number of woman billionaires.

21
3.5.4 Question 4
(Source: Created in Jupyter)
Here, the researcher performs data visualization by plotting the histogram plot to find out the
range of the highest and lowest number of billionaires.

22
Task 4: Evaluation and Future Development
4.1 Conclusion
The researcher was able to perform preprocessing and a data type check after reviewing the
entirety of section one's work. The researcher in this project doing big data analytics using
python and the other part is totally based on the prediction, and calculation using various
parameter functions in the Tableau platform. Python programming, along with related tools,
is used in big data analytics to examine massive, intricate data sets. Large amounts of data
must be processed, stored, and analyzed in order to uncover insightful trends and patterns that
can guide company strategy and decisions. The ability to create perceptive visualizations to
better understand the data is a critical part of big data analysis and Python provides a range of
visualization libraries like Matplotlib and Seaborn. These libraries offer a range of visuals,
from simple line charts and scatter plots to more complex heat maps and 3D visualizations. In
addition to providing modules for data manipulation and visualization, Python also provides
modules for distributed computing, which is crucial for processing large volumes of data also
obtained numerous visualization plots after thoroughly examining them. Big data analytics
built on Python offers a powerful as well as flexible toolkit for managing vast and complex
datasets. It is a popular choice for big data analytics due to its popularity as a programming
language and the availability of strong data analysis and machine learning modules such as
NumPy, Pandas, as well as Scikit-learn. Big data analytics built on the Python language
offers a variety of chances for data-driven insights and decision-making in a variety of fields
and applications.
4.2 Evaluation
This section is in charge of describing how to evaluate big data analytics using Python. As
part of the evaluation of big data analytics, the efficacy and usefulness of data-driven insights
and decision-making obtained by analyzing enormous datasets using a variety of analytical
approaches and tools are evaluated. The process of turning data into insightful knowledge
that can be applied to business choices is known as business intelligence. With the help of
Tableau, users can build interactive dashboards and visualizations to analyze data. The
quality of the analytical models and algorithms used in big data analytics determines how
accurate predictions and judgments may be made. The evaluation should rate the model's
predicted outcomes in addition to identifying trends and patterns.
4.3 Limitation
Big data analysis calls for expertise in distributed computing and parallel processing in
addition to programming skills in Python. Although Python includes distributed computing

23
frameworks like Apache Spark, using them successfully necessitates a certain amount of
experience and comprehension of the underlying concepts. Overall, Python is a flexible
language for big data analysis, but it's vital to take into account its constraints and difficulties
when working with extraordinarily massive datasets. Due to the enormous volumes of data
generated by many sources, including social media, sensors, and other digital platforms, big
data analysis has become more and more popular in recent years. Yet, there are several
restrictions related to big data analysis that should be taken into account. The issue of data
quality is one of the main restrictions. Big data analysis makes the premise that the data it
uses are correct and trustworthy, but in practice, the data may be lacking, inconsistent, or
inaccurate.
4.4 Future Work
Big data analytics depends on the freshness of data for performing decision making, real-time
analyses. Big data analytics has been developing quickly in recent years, and its prospects are
bright. Big data analytics has the potential to revolutionize a wide range of industries and
sectors due to the ongoing development of enormous and diversified data sets. The constantly
evolving and bettering Python libraries like NumPy, Pandas, and Scikit-learn makes it easier
for analysts and data scientists to work with massive data. There are already several new
libraries emerging that offer distinct features for big data analytics. Big data analytics are in
high demand as more and more businesses see the benefits of data-driven decision-making.
Python is a great option for big data analytics projects because of how well-liked it is as a
programming language and how well it can handle massive data.

24
Section 2 – Business Intelligence (Tableau)
Introduction
Tableau is an effective tool for data visualization and big data analysis. In order to analyze
massive amounts of data and get insightful knowledge, one can connect tableau to a number
of data sources. Here, the researcher performs the role of a data analyst for answering all the
business questions with the help of data visualization in an efficient manner. The researcher
used a dataset for data visualization by using the tableau software platform.
Dataset Description

Figure 1: Chosen dataset

(Source: Kaggle)
The graphic displays the "Electronic Sales" dataset, which has 12037 rows and 6 columns.
Product name, order id, price, order date, quantity ordered, and purchase address are all listed
in the columns. The dataset holds all the sales information through the columns. Once the
dataset has been imported into the tableau platform, the researcher completes all the required
tasks of this research study.
Task 1

25
Figure 2: Top 10 highest-selling products
(Source: Tableau)
Here the researcher visualized the top 10 highest-selling products of Walmart.

Figure 3: Top 10 lowest-selling products

(Source: Tableau)
Here the researcher visualized the top 10 lowest-selling products of Walmart.

Task 2

26
Figure 4: Sum and Average sales per city
(Source: Tableau)
Here, the researchers perform the visualization to display the sum and average sales per city.
Task 3

Figure 5: Weekly sales

(Source: Tableau)
Here, the researchers perform the visualization to display the weekly sales per city.
Task 4

27
Figure 6: Month of Order Date
(Source: Tableau)
Here, the researcher calculates the month of order date to represent a 6-month warranty from
the date of purchase.

28
Figure 7: 6month data function
(Source: Tableau)
Here, the researchers perform the visualization to display the 6-month warranty from the date
of purchase.
Task 5

Figure 8: Dashboard
(Source: Tableau)

29
Here, the researcher creates a dashboard that represents all the task sheets.
Conclusion
Tableau is used for data analysis, data preprocessing, collaboration, as well as share big data
insights. The researcher used data functions, and parameters in order to create various plots,
charts, tables, and maps. Here the researcher answered all the questions through data
visualization and create an interactive dashboard to display all the task sheets.

30
Reference List
Al Ghivary, R., Mawar, M., Wulandari, N. and Srikandi, N., 2023. PERAN VISUALISASI
DATA UNTUK MENUNJANG ANALISA DATA KEPENDUDUKAN DI INDONESIA.
PENTAHELIX, 1(1), pp.57-62.
Bhatia, K., Chhabra, B. and Kumar, M., 2020, November. Data analysis of various terrorism
activities using big data approaches on global terrorism database. In 2020 Sixth International
Conference on Parallel, Distributed and Grid Computing (PDGC) (pp. 137-140). IEEE.
Guerrero-Prado, J.S., Alfonso-Morales, W., Caicedo-Bravo, E., Zayas-Pérez, B. and
Espinosa-Reza, A., 2020. The power of big data and data analytics for AMI data: A case
study. Sensors, 20(11), p.3289.
Ide, N., Serout, A., Rankel, T. and Dengler, T., 2020. Leveraging big data analysis to enhance
the validation of EGT-Systems. In 20. Internationales Stuttgarter Symposium: Automobil-
und Motorentechnik (pp. 297-313). Springer Fachmedien Wiesbaden.
Joe, V., Raj, J.S. and Smys, S., 2021. Towards Efficient Big Data Storage With MapReduce
Deduplication System. International Journal of Information Technology and Web
Engineering (IJITWE), 16(2), pp.45-57.
Kharel, T.P., Ashworth, A.J., Owens, P.R. and Buser, M., 2020. Spatially and temporally
disparate data in systems agriculture: Issues and prospective solutions. Agronomy Journal,
112(5), pp.4498-4510.
Musazade, N., 2022. Understanding the relevant skills for data analytics-related positions: An
empirical study of job advertisements.
Peng, J., Wu, W., Lockhart, B., Bian, S., Yan, J.N., Xu, L., Chi, Z., Rzeszotarski, J.M. and
Wang, J., 2021, June. Dataprep. eda: task-centric exploratory data analysis for statistical
modeling in python. In Proceedings of the 2021 International Conference on Management of
Data (pp. 2271-2280).
Sousa, B.C., Valente, R., Krueger, A., Schmid, E., Cote, D.L. and Neamtu, R., 2022,
February. Investigating the Suitability of Tableau Dashboards and Decision Trees for
Particulate Materials Science and Engineering Data Analysis. In TMS 2022 151st Annual
Meeting & Exhibition Supplemental Proceedings (pp. 691-701). Cham: Springer
International Publishing.
Wang, Y., 2022, October. Construction and Application of Precision Marketing System of E-
commerce Platform under the Background of Big Data. In Proceedings of the International
Conference on Information Economy, Data Modeling and Cloud Computing, ICIDC 2022,
17-19 June 2022, Qingdao, China.

31
32
33

Data Analysis PHASE
No ratings yet
Data Analysis PHASE
14 pages
37 A Review Paper On Big Data Analytics
No ratings yet
37 A Review Paper On Big Data Analytics
4 pages
Big Data Handling for Researchers
No ratings yet
Big Data Handling for Researchers
8 pages
Big Data Analytics Tutorial
100% (15)
Big Data Analytics Tutorial
101 pages
Sales Analysis and Prediction Using Pyth
No ratings yet
Sales Analysis and Prediction Using Pyth
5 pages
Big Data A Survey Dinesh
No ratings yet
Big Data A Survey Dinesh
9 pages
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
100% (1)
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
8 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
15 pages
Big Data Analytics Applications Challenges Amp Future Directions
No ratings yet
Big Data Analytics Applications Challenges Amp Future Directions
7 pages
Big Data Analytics Primer
No ratings yet
Big Data Analytics Primer
6 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Documentation Sample
No ratings yet
Documentation Sample
37 pages
Research Paper (1) .Docxxx
No ratings yet
Research Paper (1) .Docxxx
6 pages
Hemanth SDP
No ratings yet
Hemanth SDP
13 pages
TP 4 2docuatrimestre
No ratings yet
TP 4 2docuatrimestre
10 pages
Big Data Analytics Essentials
No ratings yet
Big Data Analytics Essentials
143 pages
Tushar Verma 21scse1310012 Data Analysis Using Big Data Tools 21scse1310012 Report
No ratings yet
Tushar Verma 21scse1310012 Data Analysis Using Big Data Tools 21scse1310012 Report
6 pages
Big Data Analytics
100% (1)
Big Data Analytics
11 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
7 pages
Big Data Analytics - Applications, Challenges & Future Directions
No ratings yet
Big Data Analytics - Applications, Challenges & Future Directions
6 pages
Big Data Analytics Project Proposal by Slidesgo
No ratings yet
Big Data Analytics Project Proposal by Slidesgo
12 pages
Big Data Analysis
No ratings yet
Big Data Analysis
9 pages
Big Data Analytics Tutorial
50% (2)
Big Data Analytics Tutorial
25 pages
Very Imp Read Once
No ratings yet
Very Imp Read Once
30 pages
ITECH2302 MainAssessment Report
No ratings yet
ITECH2302 MainAssessment Report
8 pages
Big Data Analtics (Unit 1)
No ratings yet
Big Data Analtics (Unit 1)
31 pages
Big Data Analytics
No ratings yet
Big Data Analytics
14 pages
Bda Notes
No ratings yet
Bda Notes
13 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
31 pages
Big Data Analytics
100% (1)
Big Data Analytics
3 pages
Challenges in Big Data Analytics Techniques
No ratings yet
Challenges in Big Data Analytics Techniques
6 pages
Advanced Data Analytics and Visualization Course Material
No ratings yet
Advanced Data Analytics and Visualization Course Material
45 pages
Big Data Analytics
No ratings yet
Big Data Analytics
3 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
Big Data Analytics
No ratings yet
Big Data Analytics
19 pages
Detailed Python Data Analysis Big Data Tools
No ratings yet
Detailed Python Data Analysis Big Data Tools
9 pages
Data Analytics - Pre Lab
No ratings yet
Data Analytics - Pre Lab
10 pages
Session 1
No ratings yet
Session 1
12 pages
Hitesh Bhatt Synopsis
No ratings yet
Hitesh Bhatt Synopsis
7 pages
Big Data Analytics Unit 1
No ratings yet
Big Data Analytics Unit 1
8 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
4 pages
Big Datadoc
No ratings yet
Big Datadoc
9 pages
BDA1-4 Bunits
No ratings yet
BDA1-4 Bunits
113 pages
Big Data Analytics Unit-1
100% (2)
Big Data Analytics Unit-1
5 pages
Big Data Analytics
No ratings yet
Big Data Analytics
7 pages
Big Data Management Research Paper
No ratings yet
Big Data Management Research Paper
9 pages
Super 25 Unit 1 and Unit 2
No ratings yet
Super 25 Unit 1 and Unit 2
15 pages
Big Data Analytics
No ratings yet
Big Data Analytics
6 pages
1 Res PDF
No ratings yet
1 Res PDF
15 pages
Big Data - 1544723672 PDF
No ratings yet
Big Data - 1544723672 PDF
15 pages
Big Data - 1544723612 PDF
No ratings yet
Big Data - 1544723612 PDF
15 pages
Class 12 BD & MMS
No ratings yet
Class 12 BD & MMS
8 pages
Big Data
No ratings yet
Big Data
54 pages
Big Data Insights for Businesses
No ratings yet
Big Data Insights for Businesses
23 pages
BDCC 03 00032 v2 PDF
No ratings yet
BDCC 03 00032 v2 PDF
30 pages
BDA Module
No ratings yet
BDA Module
6 pages
A Review of Big Data Analytics
No ratings yet
A Review of Big Data Analytics
15 pages
1.2 Big Data
No ratings yet
1.2 Big Data
23 pages
# - Final Report - Career Development and Five-Year Roadmap
No ratings yet
# - Final Report - Career Development and Five-Year Roadmap
5 pages
ITC540 202030 S I-Version 4
No ratings yet
ITC540 202030 S I-Version 4
28 pages
Assessment 2 - 18-03-2023 - 07 - 55 - 15
0% (1)
Assessment 2 - 18-03-2023 - 07 - 55 - 15
16 pages
Trimetrix/Indigo Assessment and Career Roadmap - Part B
No ratings yet
Trimetrix/Indigo Assessment and Career Roadmap - Part B
8 pages
Trimetrix/Indigo Assessment and Career Roadmap - Part B: Name: Course: Institution
No ratings yet
Trimetrix/Indigo Assessment and Career Roadmap - Part B: Name: Course: Institution
4 pages
Assessment 1 - 18-03-2023 - 07 - 54 - 30
No ratings yet
Assessment 1 - 18-03-2023 - 07 - 54 - 30
15 pages
Assignment 2 Final
No ratings yet
Assignment 2 Final
16 pages
5
No ratings yet
5
1 page
9
No ratings yet
9
1 page
Rjraj21758280000010433 2022
No ratings yet
Rjraj21758280000010433 2022
2 pages
Assessment 1 - 18-03-2023 - 07 - 54 - 10
No ratings yet
Assessment 1 - 18-03-2023 - 07 - 54 - 10
14 pages
5
No ratings yet
5
1 page
8
No ratings yet
8
1 page
It
No ratings yet
It
1 page
I Took The Best Seats in This Room For The First Time in My Life and Was Absolutely Shocked When My Friend Turned To Say
No ratings yet
I Took The Best Seats in This Room For The First Time in My Life and Was Absolutely Shocked When My Friend Turned To Say
1 page
He Was So Happy To Hear His Fellow Countrymen
No ratings yet
He Was So Happy To Hear His Fellow Countrymen
1 page
K
No ratings yet
K
17 pages
We Were Talking This Afternoon and I Was Thinking About The Story
No ratings yet
We Were Talking This Afternoon and I Was Thinking About The Story
1 page
Sun You Mean
No ratings yet
Sun You Mean
1 page
MITS5002 Software Engineering Methodologies: Final Assessment June 2020
No ratings yet
MITS5002 Software Engineering Methodologies: Final Assessment June 2020
4 pages
MITS5002 Software Engineering Methodologies: Final Assessment June 2020
No ratings yet
MITS5002 Software Engineering Methodologies: Final Assessment June 2020
5 pages
MITS5002 Software Engineering Methodologies: Final Assessment June 2020
No ratings yet
MITS5002 Software Engineering Methodologies: Final Assessment June 2020
4 pages
MITS5002 Software Engineering Methodologies: Final Assessment June 2020
No ratings yet
MITS5002 Software Engineering Methodologies: Final Assessment June 2020
4 pages
Wadi Garawi
No ratings yet
Wadi Garawi
1 page
02 Cuspal
No ratings yet
02 Cuspal
9 pages
Childhood Trauma Info Sheet 2015
100% (1)
Childhood Trauma Info Sheet 2015
3 pages
MATLAB/Simulink for Compressor Modeling
No ratings yet
MATLAB/Simulink for Compressor Modeling
10 pages
Pizza Box Project
No ratings yet
Pizza Box Project
5 pages
Icse 10 Physics Sample Papers
100% (1)
Icse 10 Physics Sample Papers
152 pages
Spartacus Workout 2.0 PDF
No ratings yet
Spartacus Workout 2.0 PDF
2 pages
Facebook Is Now Meta - But Why, and What Even Is The Metaverse?
No ratings yet
Facebook Is Now Meta - But Why, and What Even Is The Metaverse?
1 page
Importance of English in Medicine
No ratings yet
Importance of English in Medicine
7 pages
Performative Reflections On Love and Commitment
100% (1)
Performative Reflections On Love and Commitment
5 pages
Narvasa
No ratings yet
Narvasa
49 pages
Stryker PneumoSure 45L High Flow Insufflator User Manual PDF
No ratings yet
Stryker PneumoSure 45L High Flow Insufflator User Manual PDF
5 pages
Screw, Standard Specification For - China Wikipedia - China Wikipedia Wiki
No ratings yet
Screw, Standard Specification For - China Wikipedia - China Wikipedia Wiki
9 pages
SOC-security Operations Center
No ratings yet
SOC-security Operations Center
43 pages
Swips Standard Wireline Jh070313 02 Engrev.03
No ratings yet
Swips Standard Wireline Jh070313 02 Engrev.03
2 pages
Калинина ПП 19-1 ПКАГ (October, 28th - November, 1st)
No ratings yet
Калинина ПП 19-1 ПКАГ (October, 28th - November, 1st)
3 pages
Opportunity Research Assistant Positions For PEP Project Apply by May 27, 2025
No ratings yet
Opportunity Research Assistant Positions For PEP Project Apply by May 27, 2025
2 pages
(April 23, 2019 - Tuesday - 04:45 AM. IST) : Today'S Hukamnama From Sri Darbar Sahib, Sri Amritsar
No ratings yet
(April 23, 2019 - Tuesday - 04:45 AM. IST) : Today'S Hukamnama From Sri Darbar Sahib, Sri Amritsar
1 page
Heating Expansion Vessel Tank Cylinder Sizing Guide
No ratings yet
Heating Expansion Vessel Tank Cylinder Sizing Guide
1 page
Packed Columns Are Used For Distillation, Gas Absorption and Liquid-Liquid Extraction
No ratings yet
Packed Columns Are Used For Distillation, Gas Absorption and Liquid-Liquid Extraction
3 pages
It's All About The Details: Ferd Vollmar
No ratings yet
It's All About The Details: Ferd Vollmar
3 pages
The EndoVac Method of Irrigation
No ratings yet
The EndoVac Method of Irrigation
7 pages
The Mind of The Strategist PDF
No ratings yet
The Mind of The Strategist PDF
31 pages
01 Intro To OSINT and Public Domain
No ratings yet
01 Intro To OSINT and Public Domain
110 pages
NV MedPress SM
No ratings yet
NV MedPress SM
8 pages
NSCA Coach 4.2
No ratings yet
NSCA Coach 4.2
6 pages
Casp 003
No ratings yet
Casp 003
830 pages
Active and Passive Voice Conversion Guide
No ratings yet
Active and Passive Voice Conversion Guide
5 pages
Zaltyre's Guide To Range + Line of Sight For Descent: Journeys in The Dark Second Edition
No ratings yet
Zaltyre's Guide To Range + Line of Sight For Descent: Journeys in The Dark Second Edition
11 pages
Basic Statistics For The Behavioral Sciences 7th Edition Heiman Test Bank All Chapters Available
100% (1)
Basic Statistics For The Behavioral Sciences 7th Edition Heiman Test Bank All Chapters Available
135 pages

Big Data Analytics with Python & Tableau

Uploaded by

Big Data Analytics with Python & Tableau

Uploaded by

BIG DATA ANALYTICS

Aims and objectives...................................................................................................................3

Critical understanding of big data management.........................................................................3

Big data manipulation................................................................................................................4

Tools and techniques..................................................................................................................5

Section 1: Big Data Analytics (python).....................................................................................5

Task 1: Problem Domain, Data Description, and Research Question....................................6

1.1 Problem domain............................................................................................................6

1.2 Data Description...........................................................................................................6

1.3 Research Question........................................................................................................6

1.4 Statistical methods........................................................................................................7

Task 2: Solution Exploration.....................................................................................................8

2.1 Approaches and technologies for developing big data application..................................8

2.2 Discussion of chosen methodological approach with justification..................................8

2.3 Solutions and techniques for related problems.................................................................9

Task 3: Solution Development.................................................................................................10

3.1 Data pre-processing........................................................................................................10

3.2 Descriptive Statistics......................................................................................................14

3.3 Data Visualization..........................................................................................................15

3.4 Statistical significance....................................................................................................16

3.5 Solution of Research Questions......................................................................................16

Task 4: Evaluation and Future Development...........................................................................19

Section 2 – Business Intelligence (Tableau)............................................................................21

Figure 1.2.1: Dataset

3.1.1 Load Data

3.3.1 Data Visualization

3.3.2 Data Visualization

3.3.4 Data Visualization

3.3.5 Data Visualization

3.3.6 Data Visualization

3.3.7 Data Visualization

3.4 Statistical significance

Figure 1: Chosen dataset

Figure 3: Top 10 lowest-selling products

Figure 5: Weekly sales

You might also like