Plagiarism
Plagiarism
37
International Journal of Computer Applications (0975 – 8887)
Volume 116 – No. 23, April 2015
These changes can be detected by structure metric systems but in a separate DO. The last updated properties date is
exact plagiarism percentage cannot be measured. PK2 is a associated with changes to the built-in summary properties. If
structure metric tool developed by Technical University, database is copied then it is an exact duplicate of the original
Madrid, Spain. The only situation that pk2 cannot detect is an and all of the creation date, last updated date, and name
assignment compounded by several very small fragments of properties for all of the objects within the copied database are
source code. These tools may give falls result in case of the same as in the original.
shuffling of statements or addition of unnecessary statements.
Comparison of tools based on these two concepts is done in The CCPE is very effective technological tool which is
detailed in paper [10]. implemented to detect plagiarism in Database Access projects.
This tool has given positive result by reducing percentage of
Data dependency matrix method [8, 9] developed by us is a plagiarized projects [3].
new concept which based on data assignment statements of
program. Both this methods are elaborated and compared in
following chapters. Tommy W. S. Chow and M. K. M. Rahman has developed a
The PK2 tool has been developed in the Computer approach of “Multilayer SOM With Tree-Structured Data for
Architecture Department of the Technical University of Efficient Document Retrieval and Plagiarism Detection” in
Madrid, Spain [1]. Students are asked to developed small 2009 [2].
project using system programming in C, assembly language They have proposed data retrieval (DR) and plagiarism
programming, input/ output system and microprogramming. detection (PD) using tree-structured document demonstration
Students copied or plagiarized program by same method. Each and multilayer self-organizing map ( MLSOM ).
programming language has its own keywords (reserved
words). These keywords can be used to catch the cheater. It evaluates a full input document or program as a query for
Plagiarists are people who do not know enough to do the executing retrieval of data and PD. Tree-structured
assignment on their own. They usually do aesthetic changes representation of documents increases accuracy of DR and PD
without significantly altering the underlying program by including local with traditional global characteristics.
structure, randomly changing identifier names, comments, Hierarchical presentation of documents of global and local
punctuation, and indentation [1]. variables enables the MLSOM to be used for PD. Tommy W.
S. Chow and M. K. M. Rahman has proposed two methods of
The PK2 tool is based on structure metric system. While PD. First is an additional room to DR method along with
comparing a programming language such as Java, only its additional local sorting. The second method is document
reserved words and the most used library function names are association on the bottom layer SOM. Computational cost of
considered. The PK2 processes each given program file, DR and PD is very high due to capacity of huge document
transforming it into an internal representation. scanning at a once since it is useful for large database [2].
This process translated the occurrence of each key word by a
corresponding internal symbol. This process generates
signature string. The tool compares only the underlying The SID tool is developed by Chen, Brent Francia, Ming Li,
program structure. As this tool is based on structure metric Brian Mckinnon, Amit Seker in the University of
system it uses four similarity criteria. These are as follows: California[2]. They have defined the Information Distance
between two sequences to be roughly the minimum amount of
1. Length of longest common substring. energy to convert one sequence to another sequence, and vice
2. Cumulative value of the length of common sequences versa.
James A McCart and Jay Jarman, both were working on SID has been widely examined and then used. Users have
project of Microsoft Access and they proposed and developed used SID positively to catch plagiarism cases. It has checked
tool as a Cheater Cheater Pumkin Eater (CCPE) in 2008. UCSB programming assignments, JAVA assignments and
CCPE was written using Visual Basic for Applications within many projects SID system uses a special compression
a Microsoft Access database. To determine if Microsoft program to heuristically approximate Kolmogorov
Access projects were duplicates, properties such as the read- complexity. It also shows similar part in two programs which
only creation date of the database and its objects of tables, reduces teachers work to search copied part. it also detects
queries, forms, reports, etc.. were compared. When a database subtler similarities.
or an object within a database is created, a document object
(DO) is created which stores properties of the newly created
3. SYSTEM ALOGORITHM
database or object. Karp-Robin Algorithm:
We are taking a help of Karp-Rabin Algorithm. It uses
Each DO contains standard properties such as creation date, fingerprints to find occurrences of one string into another
last updated date, and name. It also provides built-in summary string. Karp-Rabin Algorithm reduces time of comparison of
properties of the database, such as the database title, are stored two sequences by assigning hash value to each string and
38
International Journal of Computer Applications (0975 – 8887)
Volume 116 – No. 23, April 2015
word. Without hash value, it takes too much time for Block Diagram of Plagiarism detection by using these
comparison like if there is a word W and input string is S then algorithms:
word is compared with every string and sub string in program
and hence it consumes more time. Karp-Rabin has introduced
concept of Hash value to avoid time complexity O(m2). It Input
assigns hash value by calculating to both word and file
string/substring. So hash of substring (S) matches with hash
value of W then only we can say exact comparison is done. Token Seq1
n*n
Input simila
At the comparison process there are four categories [4]: Parser Comp rity
file Token Seq2 ressor matri
1. Right to left
x
Token Seq n
2. Left to right-
Input
3. In specific order file
4. In any order
Fig 1: System Architecture
Karp-Rabin algorithm preferred category from left to right
comparison. Function of hash must able to find has value
efficiently. When first time name would be hashing with the Doc Keyword
same hash it save the data causing yields a value which will I/P Doc Extraction
Recognitio
be compared to at data is index with the value. n (Parsing)
String
Matching Algo
Discovery of Similarity
c1*bk-1 + c2*bk-2+….+ck-1*b+ck
Plagiarized Text
We are using dependency matrix for comparison of same size
matrices [9]. It is assumed that plagiarist will change text,
position or name of variables but total variables in a function
would remain same. Such a plagiarized code can be detected
using the algorithm. The expression list algorithm compares Fig 2: System Architecture of Plagiarism Detection.
all lists of functions with another function this is Here system architecture shows in Figure 2. We have given
advantageous in various ways. In this case matrix method input document for parsing process.
gives false detection as it compares only same size matrices.
This is drawback of matrix system. It is also known as Syntactic Analysis which performs
analysis of given i/p, it can be a natural language or machine
String matching Algorithm: understandable language. It checks heuristics rules which are
It is used to compute similar strings. It performs character by predefined in system. It also confirms grammar rules while
character matching. matching string. It breaks sentence into tokens known as
segmentation.
4. SYSTEM ARCHITECTURE In keyword extraction, it find and extract keyword from whole
Block Diagram shown in Fig 1 which gives outline of i/p document, removes stop Word and stored as a stemmed
Plagiarism detection by using Hash function and string words in Keyword list.
matching algorithms. It gives overall idea about processing of
string matching as well as creating similarity matrix for Then Keyword is given as input to the Search Engine in
finding out percentage of plagirism. System. In Web Search engine, it contains internal dataset
which are provided by internal user/candidates from
college/university etc…
We can also map external dataset with existing system
dataset.
39
International Journal of Computer Applications (0975 – 8887)
Volume 116 – No. 23, April 2015
Search Engine forward two set of string for matching towards 5. CONCLUSION
Comparison Engine module in which input dataset and This paper proposes new plagiarism detection techniques by
internal dataset query present. using Karp-Robin algorithm and String Matching algorithm.
Here we are using Karp-Rabin algorithm along with String Here data dependency expression list, extracted keyword and
matching algorithm for detecting suspicious material in given using dual algorithm approach which overcomes all problems
documents. Karp-Rabin is string searching and comparison of matrix, similar hash value as well as string matching, which
algorithm by using hash function. Karp-Rabin also speeds up detects plagiarized programs or documents by using hash
the processing of string comparison by matching given pattern function.
with different i/p document’s string/substring by using hash Experiments have well verified its efficiency over existing
values. It uses hash function for assigning hash value to every tools and its applicability in practice. Our proposed system
string/substring in text. will give result of precision value up to 85% and above as
If hash value of given pattern and substring matches, it means well as recall value. It is also able to minimize failed detection
two string are similar. percentage around 10%.
We have computer Plagiarism detection (PD) accuracy by [3] Francisco Rosales, Antonio García, Santiago Rodríguez,
using Precision value and Recall value as a follows; [13] José L. Pedraza, Rafael Méndez, and Manuel M.
Nieto,” Detection of Plagiarism in Programming
Number of Correct Doc recovers for PD Assignments” , IEEE Transactions on Education, vol.51,
no.2, May 2008, pp.174-183.
Precision Value (P) =
[4] Arliadinda D, Yuliuskhris Bintoro, R Denny Prasetyadi
Number of total doc recovers for PD
Utomo, Pencocokan String dengan Menggunakan
Algoritma Karp-Rabin dan Algoritma Shift Or. Jurnal
STT Telkom.
Number of Correct Doc recovers for PD
Recall Value (R) = [5] Manuel Cebrián, Manuel Alfonseca, and Alfonso Ortega
“Towards the Validation of Plagiarism Detection Tools
Number of Total relevant Doc for PD by Means of Grammar Evolution” IEEE
Table 1: Performance Analysis TRANSACTIONS ON EVOLUTIONARY
COMPUTATION, VOL. 13, NO. 3, JUNE 2009.
Sr. Approach Precision Recall Val % [6] X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker,
No Val % “Shared information and program plagiarism detection,”
IEEE Trans. Inf. Theory, vol. 50, no. 7, pp. 1545–1551,
1 MLSOM 64% 60% Jul. 2004.
[7] A. Parker and J. O. Hamblen, “Computer algorithm for
2 LSI 63% 66% plagiarism detection,” IEEE Trans. Educ., vol. 32, no. 2,
pp. 94–99, May 1989.
3 Proposed System 80% 80% and
[8] S. Schleimer, D. Wilkerson, and A. Aiken, “Winnowing:
and Above
Local algorithms for document fingerprinting,” in Proc.
Above 22nd Association for Computing Machinery Special
40
International Journal of Computer Applications (0975 – 8887)
Volume 116 – No. 23, April 2015
Interest Group Management of Data Int. Conf., San [17] C. Liu, C. Chen, J. Han, and P. S. Yu, “Gplag: Detection
Diego, CA, Jun. 2003, pp. 76–85. of software plagiarism by program dependence graph
analysis,” in Proc. 12th Int. Conf. Special Interest Group
[9] Seema Kolkur , Madhavi Naik (Samant) “Program Knowledge Discovery and Data Mining, New York,
plagiarism detection using data dependency matrix 2006, pp. 872–881.
method” , in proceedings of International Conference on
Computer Applications 2010 , Pondicherry , India [18] A. Apostolico, “String editing and longest common
December 24-27, 2010, pp 215-220. subsequences,” in Handbook of Formal Languages,
Volume 2 Linear Modeling: Background and
[10] Seema Kolkur, Madhavi Naik (Samant) “Comparative Application. Berlin, Germany: Springer-Verlag, 1997,
study of two different aspects of program plagiarism pp. 361–398.
detection”, presented and published in proceedings of
International Conference On Sunrise Technologies 2011, [19] L. Bergroth, H. Hakonen, and T. Raita, “A survey of
Dhule, India January 13-15, 2011. longest common subsequence algorithms,” in Proc. 7th
Int. Symp. String Processing Information Retrieval, Los
[11] Xin Chen, Brent Francia, Ming Li, Member, IEEE, Brian Alamitos, CA, 2000, pp. 39–48.
McKinnon, and Amit Seker “Shared Information and
Program Plagiarism Detection” IEEE TRANSACTIONS [20] V. Levenshtein, “Binary codes capable of correcting
ON INFORMATION THEORY, VOL. 50, NO. 7, JULY deletions, insertions and reversals,” Sov. Phys.—Dokl.,
2004. vol. 10, no. 8, pp. 707–710, 1966.
[12] Chao Liu, Chen Chen, Jiawei Han , Philip S. Yu [21] C. Daly and J. Horgan, “Patterns of plagiarism,” in Proc.
“GPLAG: Detection of Software Plagiarism by Program 36th SIGCSE Tech. Symp. Computer Science Education,
Dependence Graph Analysis”, KDD’06, Philadelphia, New York, 2005, pp. 383–387.
Pennsylvania, USA. August 20–23, 2006.
[22] ―MC88110: Second Generation RISC Microprocessor
[13] Tommy W. S. Chow and M. K. M. Rahman “ User’s Manual,‖ Motorola Inc., Schaumburg, IL, 1991.
Multilayer SOM With Tree-Structured Data for Efficient
Document Retrieval and Plagiarism Detection“IEEE [23] A. S. Tanenbaum, Modern Operating Systems, 2nd ed.
TRANSACTIONS ON NEURAL NETWORKS, VOL. Engelewood Cliffs, NJ: Prentice-Hall, 2001.
20, NO. 9, SEPTEMBER 2009. [24] M. Córdoba and M. Nieto, Jul. 2007, Technical Univ.
[14] K. L. Verco and M. J. Wise, “Software for detecting Madrid, Spain
suspected plagiarism: Comparing structure and attribute- http://www.datsi.fi.upm.es/docencia/Estructura/U_Contr
counting systems,” in Proc. 1st SIGCSE Australasian ol
Conf. Computer Science Education, J. Rosenberg, Ed., [25] C. Bell and A. Newell, Computer Structures: Readings
New York, Jul. 1996, pp. 81–88. and Examples. New York: McGraw-Hill, 1971.
[15] M. Joy and M. Luck, “Plagiarism in programming [26] F. Culwin and T. Lancaster, 2001, Plagiarism,
assignments,” IEEE Trans. Educ., vol. 42, no. 2, pp. Prevention, Deterrence and Detection [Online].
129–133, May 1999. Available: http://www.ilt.ac.uk/resources/ Culwin-
[16] X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker, Lancaster.htm
“Shared information and program plagiarism detection,”
IEEE Trans. Inf. Theory, vol. 50, no. 7, pp. 1545–1551,
Jul. 2004.
IJCATM : www.ijcaonline.org 41