Javascript must be enabled to continue!

Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning

Context: In a social coding platform such as GitHub, a pull-request mechanism is frequently used by contributors to submit their code changes to reviewers of a given repository. In general, these code changes are either to add a new feature or to fix an existing bug. However, this mechanism is distributed and allows different contributors to submit unintentionally similar pull-requests that perform similar development activities. Similar pull-requests may be submitted to review in parallel time by different reviewers. This will cause redundant reviewing time and efforts. Moreover, it will complicate the collaboration process. Objective: Therefore, it is useful to assign similar pull-requests to the same reviewer to be able to decide which pull-request to choose in effective time and effort. In this article, we propose to group similar pull-requests together into clusters so that each cluster is assigned to the same reviewer or the same reviewing team. This proposal allows saving reviewing efforts and time. Method: To do so, we first extract descriptive textual information from pull-requests content to link similar pull-requests together. Then, we employ the extracted information to find similarities among pull-requests. Finally, machine learning algorithms (K-Means clustering and agglomeration hierarchical clustering algorithms) are used to group similar pull-requests together. Results: To validate our proposal, we have applied it to twenty popular repositories from public dataset. The experimental results show that the proposed approach achieved promising results according to the well-known metrics in this subject: precision and recall. Furthermore, it helps to save the reviewer time and effort. Conclusion: According to the obtained results, the K-Means algorithm achieves 94% and 91% average precision and recall values over all considered repositories, respectively, while agglomeration hierarchical clustering performs 93% and 98% average precision and recall values over all considered repositories, respectively. Moreover, the proposed approach saves reviewing time and effort on average between (67% and 91%) by K-Means algorithm and between (67% and 83%) by agglomeration hierarchical clustering algorithm.

MDPI AG

Hamzeh Eyal Salman Zakarea Alshara Abdelhak-Djamel Seriai

Information

2022

Title: Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning

Description:

Context: In a social coding platform such as GitHub, a pull-request mechanism is frequently used by contributors to submit their code changes to reviewers of a given repository.

In general, these code changes are either to add a new feature or to fix an existing bug.

However, this mechanism is distributed and allows different contributors to submit unintentionally similar pull-requests that perform similar development activities.

Similar pull-requests may be submitted to review in parallel time by different reviewers.

This will cause redundant reviewing time and efforts.

Moreover, it will complicate the collaboration process.

Objective: Therefore, it is useful to assign similar pull-requests to the same reviewer to be able to decide which pull-request to choose in effective time and effort.

In this article, we propose to group similar pull-requests together into clusters so that each cluster is assigned to the same reviewer or the same reviewing team.

This proposal allows saving reviewing efforts and time.

Method: To do so, we first extract descriptive textual information from pull-requests content to link similar pull-requests together.

Then, we employ the extracted information to find similarities among pull-requests.

Finally, machine learning algorithms (K-Means clustering and agglomeration hierarchical clustering algorithms) are used to group similar pull-requests together.

Results: To validate our proposal, we have applied it to twenty popular repositories from public dataset.

The experimental results show that the proposed approach achieved promising results according to the well-known metrics in this subject: precision and recall.

Furthermore, it helps to save the reviewer time and effort.

Conclusion: According to the obtained results, the K-Means algorithm achieves 94% and 91% average precision and recall values over all considered repositories, respectively, while agglomeration hierarchical clustering performs 93% and 98% average precision and recall values over all considered repositories, respectively.

Moreover, the proposed approach saves reviewing time and effort on average between (67% and 91%) by K-Means algorithm and between (67% and 83%) by agglomeration hierarchical clustering algorithm.

Back

Related Results

TOGI Pull-In and Connection Systems

ABSTRACT A unique and prototype pull-in and connection system were developed for the TOGI project. The offshore pull-in of the 20" pipeline and service lines and ...

Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)

BACKGROUND As of July 2020, a Web of Science search of “machine learning (ML)” nested within the search of “pharmacokinetics or pharmacodynamics” yielded over 100...

CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021

The pandemic Covid-19 currently demands teachers to be able to use technology in teaching and learning process. But in reality there are still many teachers who have not been able ...

Development of pullulan-based delivery systems for cancer treatment

Développement de systèmes de délivrance de médicament à base de pullulane pour le traitement du cancer Le pullulane (PULL) est un polysaccharide produit par le cham...

To determine the effectiveness of different interventions to reduce unnecessary requests of serum thyroid stimulating hormone levels in a hospital.

Objective: To reduce unnecessary requests for Serum Thyroid stimulating hormone (TSH) levels in a hospital setting using targeted interventions. Study Design: Interventional study....

J-Tube Design for Flexible Umbilicals

ABSTRACT The design of J-tubes with one or more bends under functional loads is dictated by a single force TL, the cable tension at the bottom of the J-tube. This...

Low-Code Machine Learning Platforms: A Fastlane to Digitalization

In the context of developing machine learning models, until and unless we have the required data engineering and machine learning development competencies as well as the time to tr...

Towards Transparent Presentation of FAIR-enabling Data Repository Functions & Characteristics

Identifying, finding and gaining a sufficient overview of the functions and characteristics of data repositories and their catalogues is essential for users of data repositories an...

Email:
Password:

Email: