0% found this document useful (0 votes)

13 views

Format Synopsis DP

Uploaded by

ritikgajjalwar

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Format Synopsis DP

Uploaded by

ritikgajjalwar

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 12

PROJECT SYNOPSIS

“PROJECT TITLE”

This synopsis is submitting to

G.H. Raisoni College of Engineering
In partial fulfilment of the requirement for the award of Degree of
Bachelor of Technology in Information Technology

SUBMITTED BY
STUDENT NAME

GUIDED BY
GUIDE NAME

G H Raisoni College of Engineering, Nagpur

(An Empowered Autonomous Institute affiliated to R.T.M. Nagpur University,
Nagpur)
NAAC Accredited with “A++” Grade (3 rd Cycle)
Ranked in the Band of 151-200 in Engineering Category by NIRF Ranking 2023

FEBRUARY 2024
INDEX
Page No

ABSTRACT…………………………………..……….. 1

1. INTRODUCTION...………….......………............... 2

1.1 BACKGROUND…………………………………….. 2
1.2 AIM & OBJECTIVE………………………………… 2
1.3 SCOPE OF PROBLEM………………………………… 3

2. LITERATURE SURVEY………………………… 4

3. PROPOSED SYSTEM………….……………….... 6

3.1 PROPOSED APPROACH......................................... 6

3.2 PROPOSED ARCHITECHTURE........................... 7

4 PLAN OF RESEARCH WORK.................................. 8

5 REFERENCES……………………………………. 9
ABSTRACT:
Measuring the similarity between documents is an important operation in the text
processing ﬁeld. A new similarity measure is proposed. To compute the similarity
between two documents with respect to a feature, the technique takes the following
three cases into account: a) The feature appears in both documents, b) the feature
appears in only one document, and c) the feature appears in none of the documents.
The existing algorithms for similarity measure would not consider the third case of
suggested technique. The proposed measure is extended to gauge the similarity
between two sets of documents. The effectiveness of our technique will evaluate on
several real-world data sets for text clustering problems.
1. INTRODUCTION:
1.1 BACKGROUND:

The Text processing plays an important role in information retrieval, data

mining, and web search. Text mining attempts to discover new,
previously unknown information by applying techniques from data
mining. Clustering, one of the traditional data mining techniques is an
unsupervised learning paradigm where clustering methods try to identify
inherent groupings of the text documents, so that a set of clusters is
produced in which clusters exhibit high intracluster similarity and low
intercluster similarity. Generally, text document clustering methods
attempt to segregate the documents into groups where each group
represents some topic that is different than those topics represented by the
other groups.

A document is usually represented as a vector in which each component

indicates the value of the corresponding feature in the document.
Selecting the similarity measure can be a severe challenge which is an
important operation in text processing. A lot of measures have been
proposed for computing the similarity between two vectors. Proposing a
measure for computing the similarity between documents which embeds
several characteristics i.e. It is a symmetric measure, the difference
between presence and absence of a feature is considered more essential
than the difference between the values associated with a present feature.
The similarity increases as the difference between the two values
associated with a present feature decreases. The similarity decreases
when the number of presence-absence features increases.

1.2 AIM AND OBJECTIVE:

 The process of Data mining is used to uncover hidden or unknown

information that is not apparent, but potentially useful.

 The study of similarity measure for clustering is initially motivated by

a research on automated text categorization.

 The aim of organizing data in such a way is to improve data

availability and to fasten data access, so that Web information
retrieval and content delivery on the Web are improved.

 The objective is to propose a measure for computing the similarity

between documents (Web Pages).
1.3 SCOPE OF PROBLEM :

Meaningful information lives in the form of text which is extracted from

web pages. Therefore, specific pre-processing method is required in order
to extract useful patterns from web pages. Preprocessing of web
document consists of steps that take as input a plain text document and
output a set of tokens. These steps typically consist of: Filtering,
Tokenization, Stemming, Stop word removal, Pruning. In particular, the
similarity or dissimilarity measures, calculated on the documents, are
utilized in the clustering algorithms to group the similar documents .

On the basis of preprocessing results of web pages; measuring the

similarity between documents is an important operation in the text
processing field. Documents are least similar to each other if none of the
features have non-zero values in both documents. Besides, it is desirable
to consider the value distribution of a feature for its contribution to the
similarity.

In this, similarity measure is proposed to compute the similarity between

documents with respect to a feature. To improve the efficiency, there is
need to reduce the complexity which will involve in the computation. By
optimizing similarity measures the optimal clusters can be formed thus
performance is improved.
2. LITERATURE SURVEY:
Yung-Shen Lin, Jung-Yi Jiang, and Shie-Jue Lee [1] propose a new measure
for computing the similarity between two documents. The difference between
presence and absence of a feature is considered more essential than the
difference between the values associated with a present feature. The
similarity increases as the difference between the two values associated with
a present feature decreases. Furthermore, the contribution of the difference is
normally scaled. The similarity decreases when the number of presence-
absence features increases. An absent feature has no contribution to the
similarity.
Gaddam Saidi Reddy and Dr.R.V.Krishnaiah [2] approach in finding
similarity between documents or objects while performing clustering is
multi-view based similarity. It makes use of more than one point of reference
as opposed to existing algorithms used for clustering text documents.
Shady Shehata, Fakhri Karray and Mohamed S. Kamel [3] mentioned that it
is important to note that extracting the relations between verbs and their
arguments in the same sentence has the potential for analyzing terms within a
sentence. The information about who is doing what to whom clarifies the
contribution of each term in a sentence to the meaning of the main topic of
that sentence.
Anna Huang [4] found that the performance of the cosine similarity, Jaccard
correlation and Pearson’s coefficient are very close, and are significantly
better than the Euclidean distance measure experimented with the web page
documents.
Hung Chim and Xiaotie Deng [5] found the concept of the suffix tree and the
document similarity are quite simple, but the implementation is complicated.
Investigation is required to improve the performance of the document
similarity.
Yanhong Zhai and Bing Liu [6] proposed approach to extract structured data
from Web pages. Although the problem has been studied by several
researchers, existing techniques are either inaccurate or make many strong
assumptions.
Jacob Kogan, Marc Teboulle and Charles Nicholas [7] argue that the choice
of a particular similarity measure may improve clustering of a specific
dataset. They called this choice the “data driven similarity measure". They
found that the overall complexity of large data sets motivates application a
sequence of algorithms for clustering a single data set.
Syed Masum Emran and Nong Ye [9] said distance metric value is used to
find the similarity or dissimilarity of the current observation from the already
established normal profile. To find the distance between normal profile and
current observation value, one can use many distance metrics.
Alexander Strehl, Joydeep Ghosh, and Raymond Mooney [10] studied if
clusters are to be meaningful, the similarity measure should be invariant to
transformations natural to the problem domain. The features have to be
chosen carefully.
S. Kullback and R. A. Leibler [11] found that in terms of similarity measure
for information retrieval, difficult it is to discriminate between the
populations. R. A. Fisher introduced the criteria for sufficiency required that
the statistic chosen should summarize the whole of the relevant information
supplied by the sample.
A lot of measures have been proposed for computing the similarity between
two vectors. A new measure for computing the similarity between two
documents is suggested. Several characteristics are embedded in this
technique. It is a symmetric measure. The difference between presence and
absence of a feature is considered more essential with the difference between
the values associated with a present feature.
3. PROPOSED SYSTEM:

3.1 PROPOSED APPROACH:

Similarity measures have been extensively used in text classification and

clustering algorithms. A measure for computing the similarity between
documents which embeds several characteristics. i.e. it is a symmetric
measure, the difference between presence and absence of a feature is
considered more essential than the difference between the values associated
with a present feature. The similarity increases as the difference between the
two values associated with a present feature decreases. The similarity
decreases when the number of presence-absence features increases.

3.2 PROPOSED ARCHITECHTURE:

Figure below details high-level layered design for a web page clustering
infrastructure. Web page data collection is necessary to give input to the data
mining model. Data pre-processing is used to handle the web page
complexity. Only a small portion of the Web’s pages contain truly relevant or
useful information, dismissing the rest as uninteresting data that serves only
to swamp the desired results using remove stop words and Stemming.
Feature extraction is implemented by representing each document as a vector
in which each component indicates the value of the corresponding feature in
the document. The feature value can be term frequency, relative term
frequency, or a combination of term frequency and inverse document
frequency. Measuring the similarity between documents is an important
operation in the text processing field. A similarity measure is proposed. The
proposed measure takes the following three cases into account: a) The feature
appears in both documents, b) The feature appears in only one document, and
c) The feature appears in none of the documents. Similarity measures have
been extensively used in text clustering algorithms. The effectiveness of our
measure is going to be evaluated on several real-world data for text clustering
problems.
Web Pages as Data Input

Data Pre-processing

Feature Extraction

Implementation of Similarity
Measure for Text Processing

Clustering Algorithm is used

to make clusters

Output as clusters
Fig :
4 PLAN OF RESEARCH WORK:

Sr No Month Activity Planned

1 Literature survey.

2 Data collection and paper publication on

literature review.

3 Design phase – Data flow analysis.

4 Implementation.

5 Comparative study and result analysis and thesis

writing.

6 Thesis writing and paper publication based on

implementation.
5 REFERENCES:

[1] Yung-Shen Lin, Jung-Yi Jiang and Shie-Jue Lee,” A

Similarity Measure for Text Classification and Clustering”, IEEE
Transactions On Knowledge And Data Engineering, 2013.

[2] Gaddam Saidi Reddy and Dr.R.V.Krishnaiah,” Clustering

Algorithm with a Novel Similarity Measure”, IOSR Journal of
Computer Engineering (IOSRJCE),Vol. 4, No. 6, pp. 37-42, Sep-Oct.
2012.

[3] Shady Shehata, Fakhri Karray, and Mohamed S. Kamel, “An

Efficient Concept-Based Mining Model for Enhancing Text
Clustering”, IEEE Transactions On Knowledge And Data Engineering,
Vol. 22, No. 10, October 2010.

[4] Anna Huang, Department of Computer Science, The

University of Waikato, Hamilton, New Zealand,” Similarity Measures
for Text Document Clustering”, New Zealand Computer Science
Research Student Conference (NZCSRSC), Christchurch, New
Zealand, April 2008.

[5] H. Chim and X. Deng, “Efficient phrase-based document

similarity for clustering”, IEEE Transactions on Knowledge and Data
Engineering, Vol. 20, No. 9, pp. 1217 – 1229, 2008.

[6] Yanhong Zhai and Bing Liu, “Web Data Extraction Based on
Partial Tree Alignment”, International World Wide Web Conference
Committee (IW3C2), ACM 1-59593-046, 9/05/2005.

[7] J. Kogan, M. Teboulle and C. K. Nicholas, “Data driven

similarity measures for k-means like clustering algorithms”,
Information Retrieval, Vol. 8, No. 2, pp. 331–349, 2005.

[8] I. S. Dhillon, J. Kogan and C. Nicholas, “ Feature Selection

and Document Clustering”, In Berry MW Ed. A Comprehensive Survey
of Text Mining, 2003.

[9] Syed Masum Emran and Nong Ye, “Robustness of Canberra

Metric in Computer Intrusion Detection”, IEEE Workshop on
Information Assurance and Security United States Military Academy,
West Point, NY, 5-6 June, 2001.

[10] Alexander Strehl, Joydeep Ghosh, and Raymond Mooney,”

Impact of Similarity Measures on Web-page Clustering”, Workshop of
Artificial Intelligence for Web Search, July 2000.

[11] S. Kullback and R. A. Leibler, “On information and

sufficiency”, Annuals of Mathematical Statistics, Vol. 22, No. 1, pp. 79–
86, March 1951.
[12] Mei-Ling Shyu, Shu-Ching Chen, Min Chen and Stuart H.
Rubin,” Affinity-Based Similarity Measure for Web Document
Clustering”, Distributed Multimedia Information System Laboratory,
School of Computer Science Florida International University Miami, FL
33199, USA.

Guide

Name of Guide-
Assistant Professor
Department of Information Technology
G H R C E, Nagpur