DS Ass 1
DS Ass 1
Learning Outcome
Describe the structure of a data science project and develop a data science team
1. Consider the Problem of forecasting the Oil production in the Kingdom of Saudi
Arabia towards sale to the outside world. So, in view of forecasting the oil production,
explain in detail the life cycle of data science [5 marks]
1. Data Discovery: This is the first phase of the Data Science life cycle where we try to gather
data from various resources such as surveys, social media, etc. Applying this to forecasting the
Oil production problem we would first understand what features or factors affect the oil
production in the Kingdom of Saudi Arabia such as the location, quality, value, amount, price,
and time of production, etc. After identifying the important features of the problem, we would
start collecting the required data from all possible resources.
2. Data Preparation: This phase contains two sub-section which are data cleaning and data
integration. After we collect the required data for forecasting Oil production now it is time to
convert our data into a common format to work with them properly. Next, we have to take the
clean subset of the data we have as well as identify the missing value by one of the data
cleaning methods like modeling. After that, we may merge two or more features.
3. Model planning and building: This stage is about choosing the best model that fits the data that
we have about oil production. Our problem is forecasting problem, so we need to search and
try different forecasting models on our data such as Simple Exponential Smoothing, Holt
Model ETS (Error, Trend, Seasonal) Model, Naive Forecast/Moving Average, and ARIMA
Models, etc. This is to see which model gives the best and most reasonable forecast of oil.
4. Communication Result: In this stage, as data scientists, we must share our results with the key
stakeholders and decision-makers in the oil production organization to help them in making
effective actions.
5. Operationalize: If the project or the service which is for forecasting oil production, is done
successfully, then it is time to publish the work in public to the end-users.
1/2
2. Explain the usage of statistics and probability in Data Science [2.5 marks]
The core concept of data science lies in statistics and probability. The probability theory makes the
prediction. Estimates and predictions form an important part of Data science. With the help of
statistical methods, we make estimates for further analysis. Thus, statistical methods are largely
dependent on the theory of probability. And all probability and statistics are dependent on Data.
3. Explain the usage of Machine learning and Big Data in Data Science [2.5 marks]
Big Data focuses on providing tools and techniques for managing and processing large and diverse
quantities of data. This is where the area of Data Science continues and focuses on using advanced
statistical techniques to analyze Big Data and interpret the results in a domain-specific context. Here
comes the usage of Machine learning, which uses these cleaned data to find hidden patterns in the data
and generates insights that help organizations solve the problem.
1. Data Science uses raw data which can be unstructured or weakly structured while the database
is strongly structured data.
2. In terms of priorities, Data Science focuses on speed, availability, and query richness while
Database focuses on consistency, error recovery, and audibility.
3. Database has modest and precious data while Data Science has massive and cheap data.
4. In terms of properties, Data Science has eventual consistency while the database has
transaction and ACID.
5. Explain in brief about Apache Hadoop and Spark tool [2.5 marks]
1- Apache Hadoop is open-source software that provides reliable, scalable, distributed computing
for large datasets across clusters of computers using simple programming models. It is also
designed to detect and handle failures in the application.
2- Apache Spark is an open-source parallel processing framework that supports in-memory
processing to boost the performance of applications that analyze big data. Big data solutions
are designed to handle data that is too large or complex for traditional databases. Spark
processes large amounts of data in memory, which is much faster than disk-based alternatives.
2/2