Format Synopsis DP
Format Synopsis DP
ON
“PROJECT TITLE”
SUBMITTED BY
STUDENT NAME
GUIDED BY
GUIDE NAME
FEBRUARY 2024
INDEX
Page No
ABSTRACT…………………………………..……….. 1
1. INTRODUCTION...………….......………............... 2
1.1 BACKGROUND…………………………………….. 2
1.2 AIM & OBJECTIVE………………………………… 2
1.3 SCOPE OF PROBLEM………………………………… 3
2. LITERATURE SURVEY………………………… 4
3. PROPOSED SYSTEM………….……………….... 6
5 REFERENCES……………………………………. 9
ABSTRACT:
Measuring the similarity between documents is an important operation in the text
processing field. A new similarity measure is proposed. To compute the similarity
between two documents with respect to a feature, the technique takes the following
three cases into account: a) The feature appears in both documents, b) the feature
appears in only one document, and c) the feature appears in none of the documents.
The existing algorithms for similarity measure would not consider the third case of
suggested technique. The proposed measure is extended to gauge the similarity
between two sets of documents. The effectiveness of our technique will evaluate on
several real-world data sets for text clustering problems.
1. INTRODUCTION:
1.1 BACKGROUND:
Figure below details high-level layered design for a web page clustering
infrastructure. Web page data collection is necessary to give input to the data
mining model. Data pre-processing is used to handle the web page
complexity. Only a small portion of the Web’s pages contain truly relevant or
useful information, dismissing the rest as uninteresting data that serves only
to swamp the desired results using remove stop words and Stemming.
Feature extraction is implemented by representing each document as a vector
in which each component indicates the value of the corresponding feature in
the document. The feature value can be term frequency, relative term
frequency, or a combination of term frequency and inverse document
frequency. Measuring the similarity between documents is an important
operation in the text processing field. A similarity measure is proposed. The
proposed measure takes the following three cases into account: a) The feature
appears in both documents, b) The feature appears in only one document, and
c) The feature appears in none of the documents. Similarity measures have
been extensively used in text clustering algorithms. The effectiveness of our
measure is going to be evaluated on several real-world data for text clustering
problems.
Web Pages as Data Input
Data Pre-processing
Feature Extraction
Implementation of Similarity
Measure for Text Processing
Output as clusters
Fig :
4 PLAN OF RESEARCH WORK:
1 Literature survey.
4 Implementation.
[6] Yanhong Zhai and Bing Liu, “Web Data Extraction Based on
Partial Tree Alignment”, International World Wide Web Conference
Committee (IW3C2), ACM 1-59593-046, 9/05/2005.
Guide
Name of Guide-
Assistant Professor
Department of Information Technology
G H R C E, Nagpur