0% found this document useful (0 votes)
48 views

DS Ass 1

The document is an assignment for a data science course that asks students to: 1) Describe the lifecycle of a data science project for forecasting oil production in Saudi Arabia in 5 steps: data discovery, data preparation, model planning and building, communication of results, and operationalization. 2) Explain the usage of statistics and probability in data science, including how Bayes' theorem, variance, mean, and probability distributions are used. 3) Explain how machine learning and big data are used in data science to find patterns in large datasets. 4) Differentiate between data science and databases in terms of data structure, priorities, data size, and properties. 5)

Uploaded by

Batool Al-Sowaiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

DS Ass 1

The document is an assignment for a data science course that asks students to: 1) Describe the lifecycle of a data science project for forecasting oil production in Saudi Arabia in 5 steps: data discovery, data preparation, model planning and building, communication of results, and operationalization. 2) Explain the usage of statistics and probability in data science, including how Bayes' theorem, variance, mean, and probability distributions are used. 3) Explain how machine learning and big data are used in data science to find patterns in large datasets. 4) Differentiate between data science and databases in terms of data structure, priorities, data size, and properties. 5)

Uploaded by

Batool Al-Sowaiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

King Faisal University Assignment-1

Faculty of Computer Science Subject: CS 412- Data Science


and Information Technology Submission Date:9th Jan 2023
Computer Science department
Total Marks: 15

Name: …Batool Al-Sowaiq ……..………… ID…219015035……….

Learning Outcome

Describe the structure of a data science project and develop a data science team

1. Consider the Problem of forecasting the Oil production in the Kingdom of Saudi
Arabia towards sale to the outside world. So, in view of forecasting the oil production,
explain in detail the life cycle of data science [5 marks]

1. Data Discovery: This is the first phase of the Data Science life cycle where we try to gather
data from various resources such as surveys, social media, etc. Applying this to forecasting the
Oil production problem we would first understand what features or factors affect the oil
production in the Kingdom of Saudi Arabia such as the location, quality, value, amount, price,
and time of production, etc. After identifying the important features of the problem, we would
start collecting the required data from all possible resources.

2. Data Preparation: This phase contains two sub-section which are data cleaning and data
integration. After we collect the required data for forecasting Oil production now it is time to
convert our data into a common format to work with them properly. Next, we have to take the
clean subset of the data we have as well as identify the missing value by one of the data
cleaning methods like modeling. After that, we may merge two or more features.

3. Model planning and building: This stage is about choosing the best model that fits the data that
we have about oil production. Our problem is forecasting problem, so we need to search and
try different forecasting models on our data such as Simple Exponential Smoothing, Holt
Model ETS (Error, Trend, Seasonal) Model, Naive Forecast/Moving Average, and ARIMA
Models, etc. This is to see which model gives the best and most reasonable forecast of oil.

4. Communication Result: In this stage, as data scientists, we must share our results with the key
stakeholders and decision-makers in the oil production organization to help them in making
effective actions.
5. Operationalize: If the project or the service which is for forecasting oil production, is done
successfully, then it is time to publish the work in public to the end-users.

1/2
2. Explain the usage of statistics and probability in Data Science [2.5 marks]

The core concept of data science lies in statistics and probability. The probability theory makes the
prediction. Estimates and predictions form an important part of Data science. With the help of
statistical methods, we make estimates for further analysis. Thus, statistical methods are largely
dependent on the theory of probability. And all probability and statistics are dependent on Data.

statistics and probability have many impacts on data science such as


1- Bayes theorem for implementing binary classifications.
2- Variance to measure the uncertainty of the data.
3- Mean to measure the central tendency.
4- Probability distribution to visualize the features of the data and how it is distributed.

3. Explain the usage of Machine learning and Big Data in Data Science [2.5 marks]

Big Data focuses on providing tools and techniques for managing and processing large and diverse
quantities of data. This is where the area of Data Science continues and focuses on using advanced
statistical techniques to analyze Big Data and interpret the results in a domain-specific context. Here
comes the usage of Machine learning, which uses these cleaned data to find hidden patterns in the data
and generates insights that help organizations solve the problem.

4. Differentiate between Data Science versus Database [2.5 marks]

1. Data Science uses raw data which can be unstructured or weakly structured while the database
is strongly structured data.
2. In terms of priorities, Data Science focuses on speed, availability, and query richness while
Database focuses on consistency, error recovery, and audibility.
3. Database has modest and precious data while Data Science has massive and cheap data.
4. In terms of properties, Data Science has eventual consistency while the database has
transaction and ACID.

5. Explain in brief about Apache Hadoop and Spark tool [2.5 marks]

1- Apache Hadoop is open-source software that provides reliable, scalable, distributed computing
for large datasets across clusters of computers using simple programming models. It is also
designed to detect and handle failures in the application.
2- Apache Spark is an open-source parallel processing framework that supports in-memory
processing to boost the performance of applications that analyze big data. Big data solutions
are designed to handle data that is too large or complex for traditional databases. Spark
processes large amounts of data in memory, which is much faster than disk-based alternatives.

2/2

You might also like