Javascript must be enabled to continue!
Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning
View through CrossRef
Context: In a social coding platform such as GitHub, a pull-request mechanism is frequently used by contributors to submit their code changes to reviewers of a given repository. In general, these code changes are either to add a new feature or to fix an existing bug. However, this mechanism is distributed and allows different contributors to submit unintentionally similar pull-requests that perform similar development activities. Similar pull-requests may be submitted to review in parallel time by different reviewers. This will cause redundant reviewing time and efforts. Moreover, it will complicate the collaboration process. Objective: Therefore, it is useful to assign similar pull-requests to the same reviewer to be able to decide which pull-request to choose in effective time and effort. In this article, we propose to group similar pull-requests together into clusters so that each cluster is assigned to the same reviewer or the same reviewing team. This proposal allows saving reviewing efforts and time. Method: To do so, we first extract descriptive textual information from pull-requests content to link similar pull-requests together. Then, we employ the extracted information to find similarities among pull-requests. Finally, machine learning algorithms (K-Means clustering and agglomeration hierarchical clustering algorithms) are used to group similar pull-requests together. Results: To validate our proposal, we have applied it to twenty popular repositories from public dataset. The experimental results show that the proposed approach achieved promising results according to the well-known metrics in this subject: precision and recall. Furthermore, it helps to save the reviewer time and effort. Conclusion: According to the obtained results, the K-Means algorithm achieves 94% and 91% average precision and recall values over all considered repositories, respectively, while agglomeration hierarchical clustering performs 93% and 98% average precision and recall values over all considered repositories, respectively. Moreover, the proposed approach saves reviewing time and effort on average between (67% and 91%) by K-Means algorithm and between (67% and 83%) by agglomeration hierarchical clustering algorithm.
Title: Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning
Description:
Context: In a social coding platform such as GitHub, a pull-request mechanism is frequently used by contributors to submit their code changes to reviewers of a given repository.
In general, these code changes are either to add a new feature or to fix an existing bug.
However, this mechanism is distributed and allows different contributors to submit unintentionally similar pull-requests that perform similar development activities.
Similar pull-requests may be submitted to review in parallel time by different reviewers.
This will cause redundant reviewing time and efforts.
Moreover, it will complicate the collaboration process.
Objective: Therefore, it is useful to assign similar pull-requests to the same reviewer to be able to decide which pull-request to choose in effective time and effort.
In this article, we propose to group similar pull-requests together into clusters so that each cluster is assigned to the same reviewer or the same reviewing team.
This proposal allows saving reviewing efforts and time.
Method: To do so, we first extract descriptive textual information from pull-requests content to link similar pull-requests together.
Then, we employ the extracted information to find similarities among pull-requests.
Finally, machine learning algorithms (K-Means clustering and agglomeration hierarchical clustering algorithms) are used to group similar pull-requests together.
Results: To validate our proposal, we have applied it to twenty popular repositories from public dataset.
The experimental results show that the proposed approach achieved promising results according to the well-known metrics in this subject: precision and recall.
Furthermore, it helps to save the reviewer time and effort.
Conclusion: According to the obtained results, the K-Means algorithm achieves 94% and 91% average precision and recall values over all considered repositories, respectively, while agglomeration hierarchical clustering performs 93% and 98% average precision and recall values over all considered repositories, respectively.
Moreover, the proposed approach saves reviewing time and effort on average between (67% and 91%) by K-Means algorithm and between (67% and 83%) by agglomeration hierarchical clustering algorithm.
Related Results
TOGI Pull-In and Connection Systems
TOGI Pull-In and Connection Systems
ABSTRACT
A unique and prototype pull-in and connection system were developed for the TOGI project. The offshore pull-in of the 20" pipeline and service lines and ...
Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)
Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)
BACKGROUND
As of July 2020, a Web of Science search of “machine learning (ML)” nested within the search of “pharmacokinetics or pharmacodynamics” yielded over 100...
Development of pullulan-based delivery systems for cancer treatment
Development of pullulan-based delivery systems for cancer treatment
Développement de systèmes de délivrance de médicament à base de pullulane pour le traitement du cancer
Le pullulane (PULL) est un polysaccharide produit par le cham...
To determine the effectiveness of different interventions to reduce unnecessary requests of serum thyroid stimulating hormone levels in a hospital.
To determine the effectiveness of different interventions to reduce unnecessary requests of serum thyroid stimulating hormone levels in a hospital.
Objective: To reduce unnecessary requests for Serum Thyroid stimulating hormone (TSH) levels in a hospital setting using targeted interventions. Study Design: Interventional study....
CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021
CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021
The pandemic Covid-19 currently demands teachers to be able to use technology in teaching and learning process. But in reality there are still many teachers who have not been able ...
J-Tube Design for Flexible Umbilicals
J-Tube Design for Flexible Umbilicals
ABSTRACT
The design of J-tubes with one or more bends under functional loads is dictated by a single force TL, the cable tension at the bottom of the J-tube. This...
Towards Transparent Presentation of FAIR-enabling Data Repository Functions & Characteristics
Towards Transparent Presentation of FAIR-enabling Data Repository Functions & Characteristics
Identifying, finding and gaining a sufficient overview of the functions and characteristics of data repositories and their catalogues is essential for users of data repositories an...
Low-Code Machine Learning Platforms: A Fastlane to Digitalization
Low-Code Machine Learning Platforms: A Fastlane to Digitalization
In the context of developing machine learning models, until and unless we have the required data engineering and machine learning development competencies as well as the time to tr...

