DA Practicle Answers Easyw
DA Practicle Answers Easyw
(random 500 entries) build a linear regression model by identifying independent and
target variable. split the variables into training and testing sets. then divide the
training and testing sets into a 7:3 ratio, respectively and print them. build a simple
linear regression model.
import pandas as pd
import numpy as np
def generate_sales_data(num_entries=500):
np.random.seed(0)
data = {
return pd.DataFrame(data)
sales_data = generate_sales_data()
y = sales_data['sales']
# Split the dataset into training and testing sets (70% training, 30% testing)
print("Training set:")
print("X_train shape:", X_train.shape)
print("\nTesting set:")
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
import numpy as np
def generate_realestate_data(num_entries=500):
np.random.seed(0)
data = {
return pd.DataFrame(data)
realestate_data = generate_realestate_data()
# Identify independent (X) and target (y) variables
X = realestate_data[['flat', 'houses']]
y = realestate_data['purchase']
# Split the dataset into training and testing sets (70% training, 30% testing)
print("Training set:")
print("\nTesting set:")
model = LinearRegression()
model.fit(X_train, y_train)
print("\nCoefficients:", model.coef_)
print("Intercept:", model.intercept_)
create 'user' data set having 5 columns namely: user Id, gender, age, estimated
salary and purchased build a logistic regression model that can predict whether on
the given paramenter a person will buy a car or not.
import pandas as pd
import numpy as np
def generate_user_data(num_entries=500):
np.random.seed(0)
data = {
return pd.DataFrame(data)
user_data = generate_user_data()
y = user_data['Purchased']
# Split the dataset into training and testing sets (70% training, 30% testing)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, y_pred)
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Build a simple linear regression mode for fish species weight prediction
import pandas as pd
# 3. Split data
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
predictions = model.predict(new_data)
print(predictions)
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Weight')
plt.ylabel('Predicted Weight')
plt.show()
use the iris dataset. write a python program to view some basic statistical details
like percentile, mean, std etc. of the species of 'Iris-setosa', 'iris-versicolor' and
'iris-virginica'. apply logistic regression on the dataset on the dataset to identify
different species (setosa, versicolor, verginica) of Iris flowers given just 4
features: sepal and petal lengths and widths. find the accuracy of the model
import pandas as pd
iris = load_iris()
iris_df['Species'] = iris.target
print("Species:", species)
print(species_data.describe())
print("\n")
X = iris_df.iloc[:, :-1] # Features: sepal length, sepal width, petal length, petal width
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate accuracy
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
create the following dataset in python (tid=1, items= bread, milk ) (tid=2, items=
bread, diaper,beer,egss ) (tid=3, items= milk, diaper, beer, coke ) (tid=4,
items=bread, mild, diaper, beer ) (tid=5, items= bread, milk,diaper,coke)convert the
categorical values into numeric format. apply the apriori algorithem on the above
dataset to generate the frequent itemsets and association rules. Repeat the process
with different minimum _support values
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd
dataset = [
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
print(frequent_itemsets)
print("\n")
# Generate association rules
print(rules)
print("\n")
create your own transactions dataset apply the apriori algorithem on the dataset
(same as above)
download the market basket dataset. write a python program to read the dataset
and display its information preprocess the data (drop null values etc.) convert the
categorical values into numeric format. apply the apriori algorithm on the above
dataset to generate the frequent itemsets and association rules
import pandas as pd
print("Dataset information:")
print(data.info())
data.dropna(inplace=True)
print(data.info())
te = TransactionEncoder()
data_encoded = te.fit_transform(data.values)
df = pd.DataFrame(data_encoded, columns=te.columns_)
# Apply the Apriori algorithm
print("\nFrequent Itemsets:")
print(frequent_itemsets)
print("\nAssociation Rules:")
print(rules)
download the groceries dataset. write a python program to read the dataset and
display its information preprocess the data (drop null values etc.) convert the
categorical values into numeric format. apply the apriori algorithm on the above
dataset to generate the frequent itemsets and association rules
import pandas as pd
print("Dataset information:")
print(data.info())
data.dropna(inplace=True)
print(data.info())
data_encoded = te.fit_transform(data.values)
df = pd.DataFrame(data_encoded, columns=te.columns_)
print("\nFrequent Itemsets:")
print(frequent_itemsets)
print("\nAssociation Rules:")
print(rules)
write a python code to implement the apriori algorithm test the code on any
standard dataset.
from collections import defaultdict
class Apriori:
self.min_support = min_support
self.min_confidence = min_confidence
self.itemsets = None
self.transactions = None
itemsets = defaultdict(int)
itemsets[frozenset([item])] += 1
return itemsets
def _get_frequent_itemsets(self, itemsets, num_transactions):
frequent_itemsets = {}
frequent_itemsets[item] = support
return frequent_itemsets
candidates = set()
union_set = itemset1.union(itemset2)
if len(union_set) == len(itemset1) + 1:
candidates.add(union_set)
return candidates
rules = []
if len(itemset) >= 2:
antecedent = frozenset(combination)
support = frequent_itemsets[itemset]
return rules
def fit(self, transactions):
self.transactions = transactions
num_transactions = len(transactions)
itemsets = self._get_itemsets(transactions)
self.itemsets = frequent_itemsets
def generate_association_rules(self):
return association_rules
# Example usage
if __name__ == "__main__":
transactions = [
apriori.fit(transactions)
frequent_itemsets = apriori.itemsets
print("Frequent Itemsets:")
print("\nAssociation Rules:")
import nltk
def preprocess_text(text):
return processed_text
processed_text = preprocess_text(text)
sentences = sent_tokenize(processed_text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [[word for word in tokens if word not in stop_words] for tokens in
word_tokens]
vectorizer = CountVectorizer()
similarity_matrix = cosine_similarity(X[1:], X)
importance_scores = similarity_matrix.sum(axis=1)
sorted_indices = importance_scores.argsort()[::-1]
return summary
# Example text
text = """
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial
intelligence concerned with the interactions between computers and human language, in
particular how to program computers to process and analyze large amounts of natural language
data. The goal is a computer capable of "understanding" the contents of documents, including the
contextual nuances of the language within them. The technology can then accurately extract
information and insights contained in the documents as well as categorize and organize the
documents themselves. This technology is very useful in a variety of applications such as machine
translation, text summarization, sentiment analysis, and more.
"""
# Generate summary
summary = generate_summary(text)
print("Summary:")
print(summary)
consider any text paragraph remove the stopwords tokenize the paragraph to
extract words and sentences. calculate the word frequnecy distribution and plot the
frequencies. plot the wordcloud of the text
import re
def remove_stopwords(text):
stop_words = set(stopwords.words('english'))
words = word_tokenize(text)
def tokenize_text(text):
sentences = sent_tokenize(text)
word_freq = Counter(words)
return word_freq
def plot_word_frequency(word_freq):
plt.figure(figsize=(10, 6))
plt.bar(words, freqs)
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()
def plot_wordcloud(text):
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud')
plt.show()
text = """
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial
intelligence concerned with the interactions between computers and human language, in
particular how to program computers to process and analyze large amounts of natural language
data. The goal is a computer capable of "understanding" the contents of documents, including the
contextual nuances of the language within them. The technology can then accurately extract
information and insights contained in the documents as well as categorize and organize the
documents themselves. This technology is very useful in a variety of applications such as machine
translation, text summarization, sentiment analysis, and more.
"""
# Step 1: Remove stopwords
text_without_stopwords = remove_stopwords(text)
plot_word_frequency(word_freq)
plot_wordcloud(text_without_stopwords)
consider the following review messages. perform sentiment analysis on the messages. 1. i
purchased headphones online. i am very happy with the product. 2. i saw the movie
yesterday. the animation was really good but the script was ok. 3. i enjoy listening to music 4.
i take a walk in the park everyday
from textblob import TextBlob
def analyze_sentiment(text):
blob = TextBlob(text)
sentiment = blob.sentiment.polarity
if sentiment > 0:
return 'Positive'
elif sentiment < 0:
return 'Negative'
else:
return 'Neutral'
# Review messages
messages = [
"i purchased headphones online. i am very happy with the product.",
"i saw the movie yesterday. the animation was really good but the script was ok.",
"i enjoy listening to music",
"i take a walk in the park everyday"
]
write a python script for the following 1. first export the whatsapp chat of any group.
read the exported ".txt" file using open() and read() fucntions. 2. tokenize the read
data into sentences and print it. 3. remove the stopwords from data and perform
lemmatization 4. plot the wordcloud for the given data
import re
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
return lemmatized_words
# i. Read the dataset and find the top 5 Instagram influencers from India.
def top_influencers(df):
top_5_influencers = df[df['Country'] == 'India'].nlargest(5, 'Followers')
return top_5_influencers
# ii. Find the Instagram account having the least number of followers.
def least_followers(df):
least_follower_account = df[df['Followers'] == df['Followers'].min()]
return least_follower_account
# iii. Read the column "Category", remove stopwords, and plot the wordcloud.
def plot_wordcloud(df):
category_words = ' '.join(df['Category'])
stop_words = set(stopwords.words('english'))
words = [word for word in word_tokenize(category_words) if word.lower() not in
stop_words]
wordcloud = WordCloud(width=800, height=400, background_color='white').generate('
'.join(words))
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Category')
plt.show()
# iv. Group the Instagram accounts category-wise.
def group_by_category(df):
grouped = df.groupby('Category').size().reset_index(name='Count')
return grouped
# v. Visualize the dataset and plot the relationship between Followers and Authentic
engagement columns.
def visualize_relationship(df):
plt.figure(figsize=(10, 6))
plt.scatter(df['Followers'], df['Authentic engagement'], alpha=0.5)
plt.title('Relationship between Followers and Authentic Engagement')
plt.xlabel('Followers')
plt.ylabel('Authentic Engagement')
plt.show()
111. Perform sentiment analysis and find the percentage of positive, negative and neutral
comments.
import pandas as pd
import re
def clean_data(df):
df.dropna(inplace=True)
return df
def tokenize_comments(df):
return df
# iii. Perform sentiment analysis and find the percentage of positive, negative, and neutral
comments
def analyze_sentiment(comment):
blob = TextBlob(comment)
sentiment = blob.sentiment.polarity
if sentiment > 0:
return 'Positive'
return 'Negative'
else:
return 'Neutral'
def sentiment_analysis(df):
df['Sentiment'] = df['Comment'].apply(analyze_sentiment)
return sentiment_counts
df_cleaned = clean_data(df)
df_tokenized = tokenize_comments(df_cleaned)
# iii. Perform sentiment analysis and find the percentage of positive, negative, and neutral
comments
sentiment_percentage = sentiment_analysis(df_tokenized)
print(sentiment_percentage)
Write a Python script for the following:
ii. Find the total views, total likes, total dislikes and comment count.
iv. Perform year wise statistics for views and plot the analyzed data.
import pandas as pd
def clean_data(df):
df.dropna(inplace=True)
return df
# ii. Find the total views, total likes, total dislikes, and comment count.
def get_statistics(df):
total_views = df['Views'].sum()
total_likes = df['Likes'].sum()
total_dislikes = df['Dislikes'].sum()
total_comments = df['Comments'].sum()
def get_top_least_videos(df):
# iv. Perform year-wise statistics for views and plot the analyzed data.
def year_wise_statistics(df):
df['Year'] = pd.to_datetime(df['Published_Date']).dt.year
year_wise_views = df.groupby('Year')['Views'].sum()
return year_wise_views
def plot_reactions(df):
plt.title('Reactions on Videos')
plt.xlabel('Reaction Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
# ii. Find the total views, total likes, total dislikes, and comment count.
print()
# iii. Find the least and topmost liked and commented videos.
print(top_liked_video)
print()
print(least_liked_video)
print()
print(top_commented_video)
print()
print(least_commented_video)
print()
# iv. Perform year-wise statistics for views and plot the analyzed data.
year_wise_views = year_wise_statistics(df_cleaned)
print(year_wise_views)
print()
# Plot year-wise statistics
plt.title('Year-wise Views')
plt.xlabel('Year')
plt.ylabel('Views')
plt.xticks(rotation=45)
plt.show()
plot_reactions(df_cleaned)
Write a Python script to read the Tweets using Twitter API and tweepy library to perform the
following tasks:
v.Visualize the tweets and plot the time series for likes and retweets along with dates on which
tweets are published.
import tweepy
import pandas as pd
auth = tweepy.AppAuthHandler(bearer_token)
tweets = []
tweets.append(tweet)
return tweets
# iii. Find the total number of likes and retweets on each tweet.
def get_likes_retweets(tweets):
data['Tweet'].append(tweet.full_text)
data['Likes'].append(tweet.favorite_count)
data['Retweets'].append(tweet.retweet_count)
return pd.DataFrame(data)
# iv. Find the most liked tweet and print its text
def most_liked_tweet(tweets_df):
most_liked_index = tweets_df['Likes'].idxmax()
most_liked_text = tweets_df.iloc[most_liked_index]['Tweet']
return most_liked_text
# v. Visualize the tweets and plot the time series for likes and retweets along with dates on which
tweets are published.
def plot_time_series(tweets_df):
tweets_df['Date'] = pd.to_datetime(tweets_df['Date'])
tweets_df.set_index('Date', inplace=True)
plt.xlabel('Date')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
# Perform tasks
tweets_df = get_likes_retweets(tweets)
most_liked = most_liked_tweet(tweets_df)
print(most_liked)
print()
plot_time_series(tweets_df)