0% found this document useful (0 votes)

537 views

Types of Data Represented As Strings

Uploaded by

vasumajji383

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

537 views

Types of Data Represented As Strings

Uploaded by

vasumajji383

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 2

1.

Types of Data Represented as Strings

Text data can be categorized into different types based on the source and application:
 Categorical Data: Strings that represent categories, like product types ("Electronics", "Clothing").
 Free-form Text: Open-ended responses, descriptions, reviews, etc.
 Structured Text: Logs, dates, or formats that follow a specific structure (e.g., phone numbers,
emails).
 Natural Language Text: Sentences and paragraphs (e.g., emails, articles).
Example:
A dataset of movie reviews contains free-form text data (review content) and categorical data (like genre).
2. Example Application: Sentiment Analysis of Movie Reviews
Sentiment Analysis is the process of determining the emotional tone behind a body of text. In movie
reviews, sentiment analysis helps gauge whether a review is positive, negative, or neutral.
Example Steps:
1. Data Collection: Collect movie reviews (e.g., from IMDB) with user ratings.
2. Preprocessing: Tokenize the reviews, remove stop words, and normalize text (lowercase,
stemming).
3. Feature Engineering: Convert the text into a numeric format, such as Bag of Words or TF-IDF
(explained below).
4. Modeling: Use classification algorithms (e.g., logistic regression, Naïve Bayes) to classify reviews
based on sentiment.
5. Evaluation: Calculate metrics like accuracy and F1 score to evaluate the model's effectiveness.
Sum Example:
If a movie review dataset has 2000 positive and 1800 negative reviews, a sentiment analysis model can help
classify the overall mood and trend among the reviews.
3. Representing Text Data as a Bag of Words
The Bag of Words (BoW) approach transforms text into a vector of word counts or frequencies. Each
unique word in the corpus becomes a feature, and its count in each document forms the vector.
Example:
For a corpus of reviews like:
 Review 1: "The movie was great and fantastic."
 Review 2: "The movie was bad."
The Bag of Words model for these two reviews could look like:
Word Review 1 Review 2
the 1 1
movie 1 1
was 1 1
great 1 0
fantastic 1 0
bad 0 1
Limitation:
BoW doesn’t capture semantic meanings or word order but is simple and effective for many applications.
4. Stop Words
Stop Words are common words (like "the", "is", "and") that don’t add significant meaning in many cases
and are usually removed to reduce the feature space in text analysis.
Example:
In a review like "The movie was absolutely amazing," words like "the" and "was" may be removed, leaving
"movie", "absolutely", and "amazing" for analysis.
Sum Example:
Removing stop words from "This is a fantastic movie and it was very engaging" leaves "fantastic", "movie",
"engaging" as the key terms.
5. Rescaling the Data with tf-idf
TF-IDF (Term Frequency-Inverse Document Frequency) adjusts word importance by considering both
frequency in a document (term frequency) and how rare it is across the dataset (inverse document
frequency). This helps balance common words that are too frequent but not meaningful.
Formula:
tf-idf(t,d)=tf(t,d)×log⁡(Ndf(t))\text{tf-idf}(t, d) = \text{tf}(t, d) \times \log \left( \frac{N}{\text{df}(t)} \
right)tf-idf(t,d)=tf(t,d)×log(df(t)N)
 tf(t, d): Term frequency of term ttt in document ddd
 N: Total number of documents
 df(t): Document frequency of term ttt (in how many documents ttt appears)
Example:
In a dataset with 1000 reviews, the word "amazing" appears in 100 reviews, and "movie" appears in 900.
TF-IDF will give "amazing" a higher score because it’s more unique.
Sum Example:
If "movie" appears frequently across reviews, TF-IDF will downscale its importance compared to less
common words like "thrilling" or "mind-blowing."
6. Investigating Model Coefficients
Understanding the coefficients in a text classification model can reveal insights into how different words
contribute to predictions. In linear models like logistic regression, positive coefficients for certain words
indicate positive sentiment, and negative ones indicate negative sentiment.
Example:
A sentiment model for movie reviews may show that words like "awesome" and "love" have positive
coefficients, while "terrible" and "disappointing" have negative coefficients, indicating the direction of
sentiment.
7. Approaching a Machine Learning Problem
Approaching a text-based machine learning problem generally involves the following steps:
1. Define the Problem: Determine the goal (e.g., sentiment classification).
2. Data Preprocessing: Clean and prepare text (tokenization, removing stop words).
3. Feature Extraction: Represent text numerically (Bag of Words, TF-IDF).
4. Model Selection: Choose a suitable algorithm (logistic regression, Naïve Bayes).
5. Training and Validation: Split data, train the model, and validate with metrics.
6. Deployment: Test the model with real-world data.
Example:
For a spam detection task, you would preprocess emails, extract features using TF-IDF, train a model (e.g.,
Naïve Bayes), and evaluate it based on accuracy and recall.
8. Testing Production Systems
Testing ensures that a text-based ML model works as expected after deployment. It includes:
 A/B Testing: Comparing different model versions in real-time.
 Monitoring: Keeping track of model performance on live data.
 Error Analysis: Analyzing misclassifications to improve the model.
Example:
In a recommendation system, A/B testing can measure user engagement with recommendations made by
different models.
9. Ranking
Ranking arranges documents or items based on relevance to a user query. In recommendation systems, for
instance, ranking algorithms use similarity scores to order recommendations.
Example:
For a movie recommendation engine, ranking helps arrange movies by predicted relevance based on the
user's past viewing preferences.
10. Recommender Systems and Other Kinds of Learning
Recommender Systems use collaborative filtering, content-based filtering, or hybrid approaches to suggest
items.
 Collaborative Filtering: Recommends based on user-item interactions (e.g., users who liked similar
movies).
 Content-Based Filtering: Recommends based on item features (e.g., genre, director).
 Hybrid Methods: Combines collaborative and content-based filtering.

Transcript Duluth
No ratings yet
Transcript Duluth
2 pages
Fake Job Posting Detection Documentation Updated
No ratings yet
Fake Job Posting Detection Documentation Updated
65 pages
Mini Ethnography
No ratings yet
Mini Ethnography
5 pages
Tatsuniyoyin Hausawa Na Kano: David Westley
No ratings yet
Tatsuniyoyin Hausawa Na Kano: David Westley
78 pages
Session 13 AO Memory Bounded Heuristic Search Heuristic Functions
No ratings yet
Session 13 AO Memory Bounded Heuristic Search Heuristic Functions
22 pages
A Model For Network Security
No ratings yet
A Model For Network Security
1 page
18CS72
No ratings yet
18CS72
2 pages
Module-1: Review Questions: Automata Theory and Computability - 15CS54
No ratings yet
Module-1: Review Questions: Automata Theory and Computability - 15CS54
4 pages
Information Visualization Technologies
No ratings yet
Information Visualization Technologies
15 pages
18CS42 Model Question Paper-1 With Effect From 2019-20 (CBCS Scheme) Usn: Fourth Semester B.E. Degree Examination Design and Analysis of Algorithms
No ratings yet
18CS42 Model Question Paper-1 With Effect From 2019-20 (CBCS Scheme) Usn: Fourth Semester B.E. Degree Examination Design and Analysis of Algorithms
3 pages
UNIT III Network Layer
No ratings yet
UNIT III Network Layer
51 pages
Mc-Unit I
No ratings yet
Mc-Unit I
16 pages
Cloud Computing Unit-1 Notes
No ratings yet
Cloud Computing Unit-1 Notes
12 pages
Python Solutions For iPA 10-Feb-23
No ratings yet
Python Solutions For iPA 10-Feb-23
21 pages
Domain Specific Iot
No ratings yet
Domain Specific Iot
17 pages
Experiment No. 1: Theory
No ratings yet
Experiment No. 1: Theory
7 pages
2-Edge Streamng Analytics
No ratings yet
2-Edge Streamng Analytics
21 pages
BDA Unit 1-1
No ratings yet
BDA Unit 1-1
21 pages
Beyond Syllabus - 2
No ratings yet
Beyond Syllabus - 2
4 pages
A Distinguish Between Linearly Separable and Linearly Inseparable Problems With Example
100% (1)
A Distinguish Between Linearly Separable and Linearly Inseparable Problems With Example
3 pages
3-1 Bigdata (Spark)
No ratings yet
3-1 Bigdata (Spark)
3 pages
Jntuh Iot Le Cture Notes
No ratings yet
Jntuh Iot Le Cture Notes
93 pages
Enhancing CC Environment Using A Cluster As A Service
100% (1)
Enhancing CC Environment Using A Cluster As A Service
40 pages
B.SC (CS) Real Syllabus
No ratings yet
B.SC (CS) Real Syllabus
75 pages
Project Synopsis Format
No ratings yet
Project Synopsis Format
2 pages
Recursively Enumerable Languages
No ratings yet
Recursively Enumerable Languages
8 pages
Planning and Acting in The Real World
No ratings yet
Planning and Acting in The Real World
31 pages
Playstore App Review Analysis: Capstone Project
No ratings yet
Playstore App Review Analysis: Capstone Project
11 pages
Cs2351 Artificial Intelligence 16 Marks
100% (1)
Cs2351 Artificial Intelligence 16 Marks
1 page
Internet of Things Unit-5
No ratings yet
Internet of Things Unit-5
88 pages
Information Retrieval Techniques-Anna University QP
No ratings yet
Information Retrieval Techniques-Anna University QP
11 pages
Data Analytics Unit-3 Notes
No ratings yet
Data Analytics Unit-3 Notes
21 pages
Sonata Software Sample Aptitude Placement Paper Level1
No ratings yet
Sonata Software Sample Aptitude Placement Paper Level1
7 pages
Final Document
No ratings yet
Final Document
73 pages
FIND-S Algorithm: Machine Learning 15CSL76
No ratings yet
FIND-S Algorithm: Machine Learning 15CSL76
3 pages
R20 - II To IV Year Syllabus CSE
No ratings yet
R20 - II To IV Year Syllabus CSE
25 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
RDBMS Unit 5
No ratings yet
RDBMS Unit 5
39 pages
Soft Computing Question Paper
No ratings yet
Soft Computing Question Paper
2 pages
An XML File Which Will Display The Book Information and DTD
No ratings yet
An XML File Which Will Display The Book Information and DTD
7 pages
Computer Networks JNTUH Unit1 Notes
No ratings yet
Computer Networks JNTUH Unit1 Notes
6 pages
DBDAL LAB - MANUAL - Final
No ratings yet
DBDAL LAB - MANUAL - Final
93 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
SENTIMENT ANALYSIS PPT
100% (1)
SENTIMENT ANALYSIS PPT
35 pages
Model Question Paper II - 21cs642 - 6 Sem (2021 Scheme)
No ratings yet
Model Question Paper II - 21cs642 - 6 Sem (2021 Scheme)
2 pages
Updated 5th and 6th Sem 2021 Scheme and Syllabus
No ratings yet
Updated 5th and 6th Sem 2021 Scheme and Syllabus
71 pages
ML Question Bank - Beena Kapadia
No ratings yet
ML Question Bank - Beena Kapadia
3 pages
Cs-3491-Ai-Ml-Lab RECORD
No ratings yet
Cs-3491-Ai-Ml-Lab RECORD
59 pages
UNIT 4 - Perceptron and DL
No ratings yet
UNIT 4 - Perceptron and DL
39 pages
Sorting Visualizer
No ratings yet
Sorting Visualizer
16 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Chapter - 2 Taxonomy of Bugs
No ratings yet
Chapter - 2 Taxonomy of Bugs
7 pages
Implement Union, Intersection, Complement and Difference Operations of Fuzzy Set Using Python
No ratings yet
Implement Union, Intersection, Complement and Difference Operations of Fuzzy Set Using Python
2 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
2 pages
ANFIS
No ratings yet
ANFIS
42 pages
Unit Iv: Syllabus: Knowledge Representation: Introduction, Approaches To Knowledge Representation, Knowledge
No ratings yet
Unit Iv: Syllabus: Knowledge Representation: Introduction, Approaches To Knowledge Representation, Knowledge
14 pages
Java - Lab - Manual-21csl35 - Skit
No ratings yet
Java - Lab - Manual-21csl35 - Skit
30 pages
R18 CSM 3-2 Devops
No ratings yet
R18 CSM 3-2 Devops
28 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
38 pages
Software Engg Question Bank 2021
No ratings yet
Software Engg Question Bank 2021
8 pages
BDA3
No ratings yet
BDA3
61 pages
UNIT 5
No ratings yet
UNIT 5
9 pages
mining text data and classificatin
No ratings yet
mining text data and classificatin
4 pages
2Ms Sequence 01 Possesive Pronouns
No ratings yet
2Ms Sequence 01 Possesive Pronouns
3 pages
C Quest
No ratings yet
C Quest
23 pages
Project Code:: A Comparative Study of Traditional and Contemporary Nigerian Dances
100% (1)
Project Code:: A Comparative Study of Traditional and Contemporary Nigerian Dances
10 pages
Bahan Ajar Bing Kls Vii-K13
No ratings yet
Bahan Ajar Bing Kls Vii-K13
42 pages
TNPSC Group 2, 2A 2022 Notification English Final
No ratings yet
TNPSC Group 2, 2A 2022 Notification English Final
89 pages
English Grammar - Who or Whom?: How Can I Determine Which One To Use?
No ratings yet
English Grammar - Who or Whom?: How Can I Determine Which One To Use?
3 pages
Bai Tap Av10 Tu Unit 1 Den 16
No ratings yet
Bai Tap Av10 Tu Unit 1 Den 16
112 pages
My Sixth Grade Lit Terms and Devices 2022 (1)
No ratings yet
My Sixth Grade Lit Terms and Devices 2022 (1)
7 pages
Verb PYQ
No ratings yet
Verb PYQ
22 pages
Easp Q1 W1
No ratings yet
Easp Q1 W1
42 pages
CHSL All English Sets (WWW - Qmaths.in)
100% (1)
CHSL All English Sets (WWW - Qmaths.in)
412 pages
Assignment - Cultural Aspects of Language
No ratings yet
Assignment - Cultural Aspects of Language
6 pages
Expressions For Comparing Two Objects: Er Than Er Than More Than More Than Not As As Not As As
No ratings yet
Expressions For Comparing Two Objects: Er Than Er Than More Than More Than Not As As Not As As
3 pages
FIA3 ISMG English PDF
No ratings yet
FIA3 ISMG English PDF
2 pages
Soal Latihan UTS Bahasa Inggris Kelas 8
100% (3)
Soal Latihan UTS Bahasa Inggris Kelas 8
20 pages
Branches in Stylistics
No ratings yet
Branches in Stylistics
34 pages
Present Continuous Exercises
No ratings yet
Present Continuous Exercises
2 pages
Para Jumbles Rules PDF
No ratings yet
Para Jumbles Rules PDF
20 pages
Dictionary of Indian Numerical Symbols: Estores - Macabanding - Millan - Villafuerte - Yunani de Deus
No ratings yet
Dictionary of Indian Numerical Symbols: Estores - Macabanding - Millan - Villafuerte - Yunani de Deus
12 pages
Chatura February 2013
100% (1)
Chatura February 2013
169 pages
Preliminary Academic Difficulties Assessment - Screening Form
No ratings yet
Preliminary Academic Difficulties Assessment - Screening Form
4 pages
Affixes Cot1
No ratings yet
Affixes Cot1
3 pages
Speaking Self-Study Programme
No ratings yet
Speaking Self-Study Programme
15 pages
PRESENT-SIMPLE-Collaborative-task CIVIL 02-B1 - GRUPO 2
No ratings yet
PRESENT-SIMPLE-Collaborative-task CIVIL 02-B1 - GRUPO 2
3 pages
Thuyết trình NGHIÊN CỨU KHOA HỌC
No ratings yet
Thuyết trình NGHIÊN CỨU KHOA HỌC
22 pages
Personal Information Ganuelas Juliana Ianne Lubiano: Surname(s) / First Name(s)
No ratings yet
Personal Information Ganuelas Juliana Ianne Lubiano: Surname(s) / First Name(s)
3 pages
6, 7, 8th ENGLISH Lesson Plan - Final
0% (1)
6, 7, 8th ENGLISH Lesson Plan - Final
104 pages

Types of Data Represented As Strings

Uploaded by

Types of Data Represented As Strings

Uploaded by

1.

Types of Data Represented as Strings

You might also like