Data Science
Data Science
Installation of Python/R language, Visual Studio code editors can be demonstrated along with
Kaggle data set usage. 2. Write programs in Python/R and Execute them in either Visual Studio
Code or PyCharm Community Edition or any other suitable environment. 3. A study was
conducted to understand the effect of number of hours the students spent studying on their
performance in the final exams. Write a code to plot line chart with number of hours spent
studying on x-axis and score in final exam on y-axis. Use a red ‘*’ as the point character, label the
axes and give the plot a title. 03.09.2022 Number of hrs spent studying (x) 10 9 2 15 10 16 11 16
Score in the final exam (0 – 100) (y) 95 80 10 50 45 98 38 93 4. For the given dataset mtcars.csv
(www.kaggle.com/ruiromanini/mtcars), plot a histogram to check the frequency distribution of
the variable ‘mpg’ (Miles per gallon
ChatGPT
ChatGPT
Sure, let's break down each of your requirements and address them step by step.
To install Python, you can download it from the official Python website
(https://www.python.org/downloads/). For R, you can download it from the Comprehensive R
Archive Network (CRAN) website (https://cran.r-project.org/).
For Visual Studio Code, you can download and install it from the official Visual Studio Code
website (https://code.visualstudio.com/).
Once installed, you can set up Python and R extensions in Visual Studio Code to enable coding in
these languages.
For using Kaggle datasets, you can utilize the Kaggle API. First, install the Kaggle package using
pip:
bash
1. Installation of Python/R language, Visual Studio code editors can be demonstrated along with
Kaggle data set usage. 2. Write programs in Python/R and Execute them in either Visual Studio
Code or PyCharm Community Edition or any other suitable environment. 3. A study was
conducted to understand the effect of number of hours the students spent studying on their
performance in the final exams. Write a code to plot line chart with number of hours spent
studying on x-axis and score in final exam on y-axis. Use a red ‘*’ as the point character, label the
axes and give the plot a title. 03.09.2022 Number of hrs spent studying (x) 10 9 2 15 10 16 11 16
Score in the final exam (0 – 100) (y) 95 80 10 50 45 98 38 93 4. For the given dataset mtcars.csv
(www.kaggle.com/ruiromanini/mtcars), plot a histogram to check the frequency distribution of
the variable ‘mpg’ (Miles per gallon
ChatGPT
ChatGPT
Sure, let's break down each of your requirements and address them step by step.
To install Python, you can download it from the official Python website
(https://www.python.org/downloads/). For R, you can download it from the
Comprehensive R Archive Network (CRAN) website (https://cran.r-project.org/).
For Visual Studio Code, you can download and install it from the official Visual Studio
Code website (https://code.visualstudio.com/).
Once installed, you can set up Python and R extensions in Visual Studio Code to
enable coding in these languages.
For using Kaggle datasets, you can utilize the Kaggle API. First, install the Kaggle
package using pip:
Then, you can download datasets directly from Kaggle using commands like:
After setting up Python and R extensions in Visual Studio Code, you can create new
Python (.py) or R (.R) files, write your code, and execute them directly within the
editor.
You can execute these scripts in your preferred environment, such as Visual Studio Code or
PyCharm Community Edition, by running them as Python scripts. Make sure you have the
necessary packages installed (matplotlib, pandas) before running the scripts.
2 Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle
(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains
information about books. Write a program to demonstrate the following. Import the data
into a DataFrame Find and drop the columns which are irrelevant for the book information.
Change the Index of the DataFrame Tidy up fields in the data such as date of publication
with the help of simple regular expression. Combine str methods with NumPy to clean
column
Sure, here's a Python program demonstrating those steps using the Pandas library:
python
This program first imports the data into a DataFrame, then drops irrelevant
columns. It changes the index of the DataFrame to the 'Identifier' column. It
then tidies up the 'Date of Publication' field using a regular expression to
extract the year, and cleans up the 'Place of Publication' field using a
combination of string methods and NumPy. Finally, it displays the cleaned
DataFrame.
Here's how you can train a regularized logistic regression classifier on the iris dataset using
scikit-learn and report the classification accuracy:
This code first loads the iris dataset, splits it into training and testing sets, and then creates a
pipeline with a StandardScaler and a LogisticRegression model with a regularization
parameter (C) set to 1e4. It then trains the model on the training set, makes predictions on the
test set, and calculates the classification accuracy using the accuracy_score function from
scikit-learn. Finally, it prints the classification accuracy.
3b Train an SVM classifier on the iris dataset using sklearn. Try different kernels and the associated
hyperparameters. Train model with the following set of hyperparameters RBFkernel, gamma=0.5,
one-vs-rest classifier, no-feature-normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set
of hyperparameters, find the best classification accuracy along with total number of support vectors
on the test data
Here's how you can train an SVM classifier on the iris dataset using scikit-learn with different
kernels and hyperparameters, and then find the best classification accuracy along with the
total number of support vectors on the test data:
This code iterates over different combinations of kernels, gammas, and Cs, creates an SVM
classifier for each combination, trains it on the training set, predicts on the test set, calculates
accuracy, and counts the number of support vectors. Finally, it prints the best classification
accuracy along with the total number of support vectors.