0% found this document useful (0 votes)
3 views5 pages

Plagiarism

Uploaded by

Akshaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views5 pages

Plagiarism

Uploaded by

Akshaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Computer Applications (0975 – 8887)

Volume 116 – No. 23, April 2015

Plagiarism Detection by using Karp-Rabin and String


Matching Algorithm Together
Sonawane Kiran Shivaji Prabhudeva S
Master of Engineering, Computer Engineering, Asst. Professor,
Ahmednagar, Maharashtra, VACOE, Ahmednagar, Maharashtra

ABSTRACT Aggregator: Papers having citation but that is not original


work.
In today word copying something from other sources and
claiming it as an own contribution is a crime. Paraphrasing: Borrows changed some word but not whole
statement.
We have also seen it is major problem in academic where
students of UG, PG or even at PhD level copying some part of Copying Idea: Concept is taken without any coping text.
original documents and publishing on own name without There are many reasons for plagiarism among students like
taking proper permission from author or developer. laziness, fear of failure, high expectation, poor time
Many software tools in exist to find out and assist the management etc... It found that there is very less awareness
monotonous and time consuming task of tracing plagiarism, about plagiarism and their respective action.
because identifying the owner of that whole text is practically From this we can divide plagiarism in mainly 4
difficult and impossible for markers. In our presentation we categories i.e. Singular, Paired, Multidimensional and Carpal.
have focused on practical assignments (projects) as well as It is again spit into 2 parts, one is external and another is
written document which is to be submitted by students in to external plagiarism. External detection use list of reference
college or university. document for detecting plagiarism in doubtful document.
Because of this crucial task and day by day increasing In internal detection, it does not refer any reference document
research in different fields, industry, academy people for plagiarism detection.
demanding such software to detect whether submitted articles,
books, national or international papers are genuine or not. In Plagiarism in Software Assignment: Most of the students
this paper, our algorithm divides submitted articles in small copying content without prior permission or acknowledgment.
pieces and scans it to compare with connected databases to the In academic courses different programming languages are
server on internet. Some existing work compares submitted there and a student has performed many assignments on it in
articles with previously submitted articles i.e. with existing their academic years. Oftenly Students never pay much
database. concentration and instead performing assignments they copy
and paste programs inadvertently. Since we have focused on
Keywords such tool to prevent student’s tendency of copying contents of
others and if they do such then teacher may be able to detect
Document retrieval; Plagiarism, Algorithm Karp-Rabin; and punish to students.
plagiarism detection String mtching.)
Many academic courses have programming languages as
1. INTRODUCTION subject or in many subject assignments have to written using
Since last few decades, it is a challenge to find out similarity different programming languages. Same problem or similar
between two documents and coping somebody’s work in our problem is assigned to the entire class. Students never pay
paper’s tendency is rocking. It is serious problem in all area. much attention in their practical assignments. They copy and
This challenge encourages us to take a efforts to provide paste programs unintentionally. This tendency of student must
practical approach for detecting plagiarism in two sequence. be changed. Any tool which can detect such copy would
support teacher to punish such cases.
A special issue is published by IEEE transaction on
plagiarism. It is confirmed in Guest Editorial Plagiarism that To find out similarity between two sequences is a plagiarism
[1], “Plagiarism is a deplorable and increasing threat to or whether similarity is resulting from an analogous working
educational organizations and it is a risk for function of method based on the hypothetical knowledge is difficult
academic. This threat is especially true in a world where because what is cause of similarity is difficult to understand.
Information Technology has made copying information easier. And we are considering it precisely.
Plagiarism is an act of fraud. It involves both stealing
someone else’s work and lying about it afterward”. Since we are using Karp-Rabin algorithm along with String
matching algorithm. Karp-Rabin used in many existing tools
We have also observed that often unclear margin between like Jplag, MOSS (Measure of software similarity), CPD
plagiarism and research. (Copy/Paste Detector) etc….
Plagiarism Problem: 2. RELATED WORK
There are different kinds of plagiarism such as; Many program plagiarism detection tools are developed which
are based on programming language keywords or logical
Replica: Copying exact another work as a own work. statements of program [e.g.3-6]. To hide original code
Fusion: Copying text from multiple sources and creating new plagiarist adds unwanted program lines or change position of
fusion of it without citation. statements.

Borrows find similar word and replace it.

37
International Journal of Computer Applications (0975 – 8887)
Volume 116 – No. 23, April 2015

These changes can be detected by structure metric systems but in a separate DO. The last updated properties date is
exact plagiarism percentage cannot be measured. PK2 is a associated with changes to the built-in summary properties. If
structure metric tool developed by Technical University, database is copied then it is an exact duplicate of the original
Madrid, Spain. The only situation that pk2 cannot detect is an and all of the creation date, last updated date, and name
assignment compounded by several very small fragments of properties for all of the objects within the copied database are
source code. These tools may give falls result in case of the same as in the original.
shuffling of statements or addition of unnecessary statements.
Comparison of tools based on these two concepts is done in The CCPE is very effective technological tool which is
detailed in paper [10]. implemented to detect plagiarism in Database Access projects.
This tool has given positive result by reducing percentage of
Data dependency matrix method [8, 9] developed by us is a plagiarized projects [3].
new concept which based on data assignment statements of
program. Both this methods are elaborated and compared in
following chapters. Tommy W. S. Chow and M. K. M. Rahman has developed a
The PK2 tool has been developed in the Computer approach of “Multilayer SOM With Tree-Structured Data for
Architecture Department of the Technical University of Efficient Document Retrieval and Plagiarism Detection” in
Madrid, Spain [1]. Students are asked to developed small 2009 [2].
project using system programming in C, assembly language They have proposed data retrieval (DR) and plagiarism
programming, input/ output system and microprogramming. detection (PD) using tree-structured document demonstration
Students copied or plagiarized program by same method. Each and multilayer self-organizing map ( MLSOM ).
programming language has its own keywords (reserved
words). These keywords can be used to catch the cheater. It evaluates a full input document or program as a query for
Plagiarists are people who do not know enough to do the executing retrieval of data and PD. Tree-structured
assignment on their own. They usually do aesthetic changes representation of documents increases accuracy of DR and PD
without significantly altering the underlying program by including local with traditional global characteristics.
structure, randomly changing identifier names, comments, Hierarchical presentation of documents of global and local
punctuation, and indentation [1]. variables enables the MLSOM to be used for PD. Tommy W.
S. Chow and M. K. M. Rahman has proposed two methods of
The PK2 tool is based on structure metric system. While PD. First is an additional room to DR method along with
comparing a programming language such as Java, only its additional local sorting. The second method is document
reserved words and the most used library function names are association on the bottom layer SOM. Computational cost of
considered. The PK2 processes each given program file, DR and PD is very high due to capacity of huge document
transforming it into an internal representation. scanning at a once since it is useful for large database [2].
This process translated the occurrence of each key word by a
corresponding internal symbol. This process generates
signature string. The tool compares only the underlying The SID tool is developed by Chen, Brent Francia, Ming Li,
program structure. As this tool is based on structure metric Brian Mckinnon, Amit Seker in the University of
system it uses four similarity criteria. These are as follows: California[2]. They have defined the Information Distance
between two sequences to be roughly the minimum amount of
1. Length of longest common substring. energy to convert one sequence to another sequence, and vice
2. Cumulative value of the length of common sequences versa.

of reserved words. The SID is based on compression algorithm. Authors has


created improved Lempel Ziv algorithm to obtain proper
3. Normalized value of cumulative value compression technique. It has included different steps of SID
are as follows:
4. Percentage of reserve words common to both files.
1. It breaks program string into small segments that is token.
The PK2 tool gives teachers hints about which pairs to inspect
for plagiarism, but the final decision in a case of plagiarism is 2. Lempel Ziv algorithm Compresses tokens.
very difficult to make. This tool has proved to be flexible. It
has been successfully used to detect partial and total copies in 3. It finds out percentage of plagiarism by using Kolmogorov
very different environments. complexity formulas.

James A McCart and Jay Jarman, both were working on SID has been widely examined and then used. Users have
project of Microsoft Access and they proposed and developed used SID positively to catch plagiarism cases. It has checked
tool as a Cheater Cheater Pumkin Eater (CCPE) in 2008. UCSB programming assignments, JAVA assignments and
CCPE was written using Visual Basic for Applications within many projects SID system uses a special compression
a Microsoft Access database. To determine if Microsoft program to heuristically approximate Kolmogorov
Access projects were duplicates, properties such as the read- complexity. It also shows similar part in two programs which
only creation date of the database and its objects of tables, reduces teachers work to search copied part. it also detects
queries, forms, reports, etc.. were compared. When a database subtler similarities.
or an object within a database is created, a document object
(DO) is created which stores properties of the newly created
3. SYSTEM ALOGORITHM
database or object. Karp-Robin Algorithm:
We are taking a help of Karp-Rabin Algorithm. It uses
Each DO contains standard properties such as creation date, fingerprints to find occurrences of one string into another
last updated date, and name. It also provides built-in summary string. Karp-Rabin Algorithm reduces time of comparison of
properties of the database, such as the database title, are stored two sequences by assigning hash value to each string and

38
International Journal of Computer Applications (0975 – 8887)
Volume 116 – No. 23, April 2015

word. Without hash value, it takes too much time for Block Diagram of Plagiarism detection by using these
comparison like if there is a word W and input string is S then algorithms:
word is compared with every string and sub string in program
and hence it consumes more time. Karp-Rabin has introduced
concept of Hash value to avoid time complexity O(m2). It Input
assigns hash value by calculating to both word and file
string/substring. So hash of substring (S) matches with hash
value of W then only we can say exact comparison is done. Token Seq1
n*n
Input simila
At the comparison process there are four categories [4]: Parser Comp rity
file Token Seq2 ressor matri
1. Right to left
x
Token Seq n
2. Left to right-
Input
3. In specific order file

4. In any order
Fig 1: System Architecture
Karp-Rabin algorithm preferred category from left to right
comparison. Function of hash must able to find has value
efficiently. When first time name would be hashing with the Doc Keyword
same hash it save the data causing yields a value which will I/P Doc Extraction
Recognitio
be compared to at data is index with the value. n (Parsing)

It can deal with multiple pattern matching that’s why people


preferred this Karp-Rabin algorithm. Otherwise behavior of Keyword
other algorithm is to perform basic pattern matching.
Its having O(nm ) complexity. Where n is length of text and m
is length of pattern. It is little bit slow also due to we have to
check every single character from the text. I/P Internal DB Internal DB Web Search

But we can overcome this by having hash function which is


efficient as well as easy to implement.
Karp-Rabin
Suppose a k-grams c1…ck is consider as K- digit number by Algo Comparison of two Extracted
considering base b, then hash value H(c1….ck) will be; String

String
Matching Algo
Discovery of Similarity
c1*bk-1 + c2*bk-2+….+ck-1*b+ck

Plagiarized Text
We are using dependency matrix for comparison of same size
matrices [9]. It is assumed that plagiarist will change text,
position or name of variables but total variables in a function
would remain same. Such a plagiarized code can be detected
using the algorithm. The expression list algorithm compares Fig 2: System Architecture of Plagiarism Detection.
all lists of functions with another function this is Here system architecture shows in Figure 2. We have given
advantageous in various ways. In this case matrix method input document for parsing process.
gives false detection as it compares only same size matrices.
This is drawback of matrix system. It is also known as Syntactic Analysis which performs
analysis of given i/p, it can be a natural language or machine
String matching Algorithm: understandable language. It checks heuristics rules which are
It is used to compute similar strings. It performs character by predefined in system. It also confirms grammar rules while
character matching. matching string. It breaks sentence into tokens known as
segmentation.
4. SYSTEM ARCHITECTURE In keyword extraction, it find and extract keyword from whole
Block Diagram shown in Fig 1 which gives outline of i/p document, removes stop Word and stored as a stemmed
Plagiarism detection by using Hash function and string words in Keyword list.
matching algorithms. It gives overall idea about processing of
string matching as well as creating similarity matrix for Then Keyword is given as input to the Search Engine in
finding out percentage of plagirism. System. In Web Search engine, it contains internal dataset
which are provided by internal user/candidates from
college/university etc…
We can also map external dataset with existing system
dataset.

39
International Journal of Computer Applications (0975 – 8887)
Volume 116 – No. 23, April 2015

Search Engine forward two set of string for matching towards 5. CONCLUSION
Comparison Engine module in which input dataset and This paper proposes new plagiarism detection techniques by
internal dataset query present. using Karp-Robin algorithm and String Matching algorithm.
Here we are using Karp-Rabin algorithm along with String Here data dependency expression list, extracted keyword and
matching algorithm for detecting suspicious material in given using dual algorithm approach which overcomes all problems
documents. Karp-Rabin is string searching and comparison of matrix, similar hash value as well as string matching, which
algorithm by using hash function. Karp-Rabin also speeds up detects plagiarized programs or documents by using hash
the processing of string comparison by matching given pattern function.
with different i/p document’s string/substring by using hash Experiments have well verified its efficiency over existing
values. It uses hash function for assigning hash value to every tools and its applicability in practice. Our proposed system
string/substring in text. will give result of precision value up to 85% and above as
If hash value of given pattern and substring matches, it means well as recall value. It is also able to minimize failed detection
two string are similar. percentage around 10%.

Eg. WordString [Plagiarism]=6[hash value], 6. ACKNOWLEDGMENT


I am very thankful to the people those who have provided me
If search engine got similar hash value in internal dataset then
continuous encouragement and support to all the stages and
both strings are matching strings.
ideas visualize. I am very much grateful to the entire BVDU
There is problem with Karp-Rabin algorithm is that, for group for giving me all facilities and work environment which
keeping minimum value of hashed word, it assign similar hash enable me to complete my task. I express my sincere thanks to
values for different string in documents. Since it creates M.K.Shirsagar Sir, Prabhudeva Sir, Head of the Computer
confusion even hash value are same but string are non similar. Department, VB College of Engineering, Ahmednagar who
Sometimes it cannot detect similarity. gave me their valuable and rich guidance and help in
presentation of this research paper.
For removing this drawback and improving efficiency of
system, we are using String Matching algorithm along with 7. REFERENCES
Karp-Rabin Algorithm when it detects similar hash value in [1] Manuel Cebrián, Manuel Alfonseca, and Alfonso Ortega
system. It keeps string in arrays and checks it character by “Towards the Validation of Plagiarism Detection Tools
character in array. It checks for similarity, if it is match then by Means of Grammar Evolution” IEEE
and then it is similar. Since it improves accuracy of TRANSACTIONS ON EVOLUTIONARY
plagiarism detection which are not getting in existing tools. COMPUTATION, VOL. 13, NO. 3, JUNE 2009.
Then result of similarity is generated and highlights that [2] Tommy W. S. Chow and M. K. M. Rahman has
plagiarized text. It also shows the source of plagiarized text in developed a approach of “Multilayer SOM With Tree-
document. Structured Data for Efficient Document Retrieval and
In analysis part we have seen different tools performance and Plagiarism Detection” IEEE TRANSACTIONS ON
our proposed system performance which we are claiming NEURAL NETWORKS, VOL. 20, NO. 9,
here. SEPTEMBER 2009

We have computer Plagiarism detection (PD) accuracy by [3] Francisco Rosales, Antonio García, Santiago Rodríguez,
using Precision value and Recall value as a follows; [13] José L. Pedraza, Rafael Méndez, and Manuel M.
Nieto,” Detection of Plagiarism in Programming
Number of Correct Doc recovers for PD Assignments” , IEEE Transactions on Education, vol.51,
no.2, May 2008, pp.174-183.
Precision Value (P) =
[4] Arliadinda D, Yuliuskhris Bintoro, R Denny Prasetyadi
Number of total doc recovers for PD
Utomo, Pencocokan String dengan Menggunakan
Algoritma Karp-Rabin dan Algoritma Shift Or. Jurnal
STT Telkom.
Number of Correct Doc recovers for PD
Recall Value (R) = [5] Manuel Cebrián, Manuel Alfonseca, and Alfonso Ortega
“Towards the Validation of Plagiarism Detection Tools
Number of Total relevant Doc for PD by Means of Grammar Evolution” IEEE
Table 1: Performance Analysis TRANSACTIONS ON EVOLUTIONARY
COMPUTATION, VOL. 13, NO. 3, JUNE 2009.
Sr. Approach Precision Recall Val % [6] X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker,
No Val % “Shared information and program plagiarism detection,”
IEEE Trans. Inf. Theory, vol. 50, no. 7, pp. 1545–1551,
1 MLSOM 64% 60% Jul. 2004.
[7] A. Parker and J. O. Hamblen, “Computer algorithm for
2 LSI 63% 66% plagiarism detection,” IEEE Trans. Educ., vol. 32, no. 2,
pp. 94–99, May 1989.
3 Proposed System 80% 80% and
[8] S. Schleimer, D. Wilkerson, and A. Aiken, “Winnowing:
and Above
Local algorithms for document fingerprinting,” in Proc.
Above 22nd Association for Computing Machinery Special

40
International Journal of Computer Applications (0975 – 8887)
Volume 116 – No. 23, April 2015

Interest Group Management of Data Int. Conf., San [17] C. Liu, C. Chen, J. Han, and P. S. Yu, “Gplag: Detection
Diego, CA, Jun. 2003, pp. 76–85. of software plagiarism by program dependence graph
analysis,” in Proc. 12th Int. Conf. Special Interest Group
[9] Seema Kolkur , Madhavi Naik (Samant) “Program Knowledge Discovery and Data Mining, New York,
plagiarism detection using data dependency matrix 2006, pp. 872–881.
method” , in proceedings of International Conference on
Computer Applications 2010 , Pondicherry , India [18] A. Apostolico, “String editing and longest common
December 24-27, 2010, pp 215-220. subsequences,” in Handbook of Formal Languages,
Volume 2 Linear Modeling: Background and
[10] Seema Kolkur, Madhavi Naik (Samant) “Comparative Application. Berlin, Germany: Springer-Verlag, 1997,
study of two different aspects of program plagiarism pp. 361–398.
detection”, presented and published in proceedings of
International Conference On Sunrise Technologies 2011, [19] L. Bergroth, H. Hakonen, and T. Raita, “A survey of
Dhule, India January 13-15, 2011. longest common subsequence algorithms,” in Proc. 7th
Int. Symp. String Processing Information Retrieval, Los
[11] Xin Chen, Brent Francia, Ming Li, Member, IEEE, Brian Alamitos, CA, 2000, pp. 39–48.
McKinnon, and Amit Seker “Shared Information and
Program Plagiarism Detection” IEEE TRANSACTIONS [20] V. Levenshtein, “Binary codes capable of correcting
ON INFORMATION THEORY, VOL. 50, NO. 7, JULY deletions, insertions and reversals,” Sov. Phys.—Dokl.,
2004. vol. 10, no. 8, pp. 707–710, 1966.
[12] Chao Liu, Chen Chen, Jiawei Han , Philip S. Yu [21] C. Daly and J. Horgan, “Patterns of plagiarism,” in Proc.
“GPLAG: Detection of Software Plagiarism by Program 36th SIGCSE Tech. Symp. Computer Science Education,
Dependence Graph Analysis”, KDD’06, Philadelphia, New York, 2005, pp. 383–387.
Pennsylvania, USA. August 20–23, 2006.
[22] ―MC88110: Second Generation RISC Microprocessor
[13] Tommy W. S. Chow and M. K. M. Rahman “ User’s Manual,‖ Motorola Inc., Schaumburg, IL, 1991.
Multilayer SOM With Tree-Structured Data for Efficient
Document Retrieval and Plagiarism Detection“IEEE [23] A. S. Tanenbaum, Modern Operating Systems, 2nd ed.
TRANSACTIONS ON NEURAL NETWORKS, VOL. Engelewood Cliffs, NJ: Prentice-Hall, 2001.
20, NO. 9, SEPTEMBER 2009. [24] M. Córdoba and M. Nieto, Jul. 2007, Technical Univ.
[14] K. L. Verco and M. J. Wise, “Software for detecting Madrid, Spain
suspected plagiarism: Comparing structure and attribute- http://www.datsi.fi.upm.es/docencia/Estructura/U_Contr
counting systems,” in Proc. 1st SIGCSE Australasian ol
Conf. Computer Science Education, J. Rosenberg, Ed., [25] C. Bell and A. Newell, Computer Structures: Readings
New York, Jul. 1996, pp. 81–88. and Examples. New York: McGraw-Hill, 1971.
[15] M. Joy and M. Luck, “Plagiarism in programming [26] F. Culwin and T. Lancaster, 2001, Plagiarism,
assignments,” IEEE Trans. Educ., vol. 42, no. 2, pp. Prevention, Deterrence and Detection [Online].
129–133, May 1999. Available: http://www.ilt.ac.uk/resources/ Culwin-
[16] X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker, Lancaster.htm
“Shared information and program plagiarism detection,”
IEEE Trans. Inf. Theory, vol. 50, no. 7, pp. 1545–1551,
Jul. 2004.

IJCATM : www.ijcaonline.org 41

You might also like