Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Similarity Detection in Large Volume Data using Machine Learning Techniques

View through CrossRef
When unauthorized copying or stealing of intellectual properties of others happen, it is called plagiarism. Two main approaches are used to counter this problem – external plagiarism detection and intrinsic plagiarism detection. External algorithms compare a suspicious file with numerous sources whereas intrinsic algorithms are allowed to solely inspect the suspicious file in order to predict plagiarism. In this work, the area chosen for detecting plagiarism is with programs or source code files. Copying the entire source code or logic used in a particular program without permissions or copyright is the stealing that happens in the case of source codes. There exist many ways to detect plagiarism in source code files. To perform plagiarism checking for a large dataset, the computational cost is very high and moreover it’s a time consuming job. To achieve a computationally efficient similarity detection in source code files, the Hadoop framework is used where parallel computation is possible for large datasets. But the raw data available to us is not in a suitable form for the existing plagiarism checking tools to work with, as their size is too high and they possess features of big data. Thus a qualifying model is required for the dataset, to be fed into Hadoop so that it could efficiently process them to check for plagiarism in source codes. To generate such a model, machine learning is used which incorporates big data with machine learning.
Title: Similarity Detection in Large Volume Data using Machine Learning Techniques
Description:
When unauthorized copying or stealing of intellectual properties of others happen, it is called plagiarism.
Two main approaches are used to counter this problem – external plagiarism detection and intrinsic plagiarism detection.
External algorithms compare a suspicious file with numerous sources whereas intrinsic algorithms are allowed to solely inspect the suspicious file in order to predict plagiarism.
In this work, the area chosen for detecting plagiarism is with programs or source code files.
Copying the entire source code or logic used in a particular program without permissions or copyright is the stealing that happens in the case of source codes.
There exist many ways to detect plagiarism in source code files.
To perform plagiarism checking for a large dataset, the computational cost is very high and moreover it’s a time consuming job.
To achieve a computationally efficient similarity detection in source code files, the Hadoop framework is used where parallel computation is possible for large datasets.
But the raw data available to us is not in a suitable form for the existing plagiarism checking tools to work with, as their size is too high and they possess features of big data.
Thus a qualifying model is required for the dataset, to be fed into Hadoop so that it could efficiently process them to check for plagiarism in source codes.
To generate such a model, machine learning is used which incorporates big data with machine learning.

Related Results

Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)
Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)
BACKGROUND As of July 2020, a Web of Science search of “machine learning (ML)” nested within the search of “pharmacokinetics or pharmacodynamics” yielded over 100...
CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021
CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021
The pandemic Covid-19 currently demands teachers to be able to use technology in teaching and learning process. But in reality there are still many teachers who have not been able ...
Similarity Search with Data Missing
Similarity Search with Data Missing
Similarity search is a fundamental research problem with broad applications in various research fields, including data mining, information retrieval, and machine learning. The core...
Advanced frameworks for fraud detection leveraging quantum machine learning and data science in fintech ecosystems
Advanced frameworks for fraud detection leveraging quantum machine learning and data science in fintech ecosystems
The rapid expansion of the fintech sector has brought with it an increasing demand for robust and sophisticated fraud detection systems capable of managing large volumes of financi...
A comprehensive review of machine learning's role in enhancing network security and threat detection
A comprehensive review of machine learning's role in enhancing network security and threat detection
As network security threats continue to evolve in complexity and sophistication, there is a growing need for advanced solutions to enhance network security and threat detection cap...
Machine Learning for Enhancing Mortgage Origination Processes: Streamlining and Improving Efficiency
Machine Learning for Enhancing Mortgage Origination Processes: Streamlining and Improving Efficiency
The mortgage industry, historically characterized by manual processes, paperwork, and complex decision-making, is on the brink of a digital revolution driven by machine learning (M...
Technology Focus: Data Analytics (October 2021)
Technology Focus: Data Analytics (October 2021)
With a moderate- to low-oil-price environment being the new normal, improving process efficiency, thereby leading to hydrocarbon recovery at reduced costs, is becoming the need of ...
Inaugural Editorial of the Inspire Health First Issue Publication
Inaugural Editorial of the Inspire Health First Issue Publication
Recent advances in molecular science, AI, and health informatics are transforming how complex diseases are understood, predicted, and managed. For accurate diagnosis and prognosis,...

Back to Top