Research

Abstract: The cost of a skilled and competent software developer is high, and it is desirable to minimize dependency on such costly human resources. Fortunately, there also exists a large volume of experiential latent knowledge/wisdom embedded in various artifacts present in different Open Source Software (OSS) repositories. However, such knowledge exists in the form of unstructured data associated with items such as source files, code commit logs, defect reports, comments, etc. In recent times, we have also witnessed unprecedented advances in ML technologies and hardware capabilities. We aim to leverage the knowledge-bearing data associated with various OSS repositories, and the latest advances in Artificial Intelligence (AI) and hardware to create a knowledge warehouse and an expert system for the software development domain. Such an automated system will be able to reduce if not eliminate the involvement of human programmers in carrying our common SDLC activities.

(Read more) (Open-access link)

Keywords: Automated Software Engineering, Software Maintenance, Data Mining, Supervised Learning, Defect Prediction, Effort Estimation.

Abstract: An important issue faced during software development is to identify defects and the properties of those defects, if found, in a given source file. Determining the defectiveness of source code assumes significance due to its implications on software development and maintenance cost. We present a novel system to estimate the presence of defects in source code and detect attributes of the possible defects, such as the severity of defects. The salient elements of our system are: i) a dataset of newly introduced source code metrics, called PROgramming CONstruct (PROCON) metrics, and ii) a novel Machine-Learning (ML) based system, called Defect Estimator for Source Code (DESCo), that makes use of PROCON dataset for predicting defectiveness in a given scenario. The dataset was created by processing 30400+ source files written in four popular programming languages, viz. C, C++, Java, and Python.

The results of our experiments show that the DESCo system outperforms one of the state-of-the-art methods with an improvement of 44.9%. To verify the correctness of our system, we compared the performance of 12 different ML algorithms with 50+ different combinations of their key parameters. Our system achieves the best results with the SVM technique with a mean accuracy measure of 80.8%.

(Read more) (Open-Access Link)

Keywords: Maintaining software, Source code mining, Software defect prediction, Software metrics, Software faults and failures, Automated software engineering, AI in software engineering.

Code reviews are one of the effective methods to estimate defectiveness in source code. However, manually performed code reviews tend to be slow and limited by the skill of the reviewer. This work proposes a code review system that is effective, accurate, and efficient.

The central idea of the approach is to estimate the defectiveness for an input source code by using the defectiveness score of similar code fragments present in

various StackOverflow (SO) posts. We leverage the source code present in various GitHub repositories to train models (M) using the Paragraph Vectors algorithm (PVA). The vector representations of SO code fragments, generated using PVA, are used to perform the source code similarity detection.

The significant contributions of this work are:

i) SOpostsDB: a dataset containing the PVA vectors and the SO posts information,

ii) CRUSO: a code review assisting system based on PVA models trained on SOpostsDB.

For a given input source code, CRUSO labels it as {Likely to be defective, Unlikely to be defective, Unpredictable}. To develop CRUSO, we processed >3 million SO posts and 188200+ GitHub source files. CRUSO is designed to work with source code written in the popular programming languages {C, C#, Java, JavaScript, and Python}

CRUSO outperforms an existing code review approach with an improvement of 97.82% in response time and a storage reduction of 99.15%. CRUSO achieves the highest mean accuracy score of 99.6% when tested with the C programming language, thus achieving an improvement of 5.6% over the existing method.

(Read more) (Open-access link)

Keywords: Automated code review, StackOverflow, Paragraph Vector, Code quality, Software maintenance.

Third-party libraries (TPLs) provide ready-made implementations of various software functionalities and are frequently used in software development. However, as software development progresses through various iterations, there often remains an unused set of TPLs referenced in the application’s distributable. These unused TPLs become a prominent source of software bloating and are responsible for excessive consumption of resources, such as CPU cycles, memory, and mobile devices’ battery usage. Thus, the identification of such bloat-TPLs is essential. We present a rapid, storage-efficient, obfuscation-resilient method to detect the bloat-TPLs. Our approach’s novel aspects are i)Computing a vector representation of a .class file using a model that we call Jar2Vec. The Jar2Vec model is trained using the Paragraph Vector Algorithm. ii)Before using it for training the Jar2Vec models, a .class file is converted to a normalized form via semantics-preserving transformations. iii) A Bloat-Library Detector (BloatLibD) developed and tested with 27 different Jar2Vec models. These models were trained using different parameters and >30000 .class files taken from >100 different Java libraries available at MavenCentral.com. BloatLibD achieves an accuracy of 99% with an F1 score of 0.968 and outperforms the existing tools, viz., LibScout, LiteRadar, and LibD with an accuracy improvement of 74.5%, 30.33%, and 14.1%, respectively. Compared with LibD, BloatLibD achieves a response time improvement of 61.37% and a storage reduction of 87.93%.

(Read more) (Open-access link)

Keywords: Third-party library detection, code similarity, Paragraph Vectors, Software Bloat, Obfuscation.

OSS effort estimation using software features similarity and developer activity-based metrics

Software development effort estimation (SDEE) generally involves leveraging the information about the effort spent in developing similar software in the past. Most organizations do not have access to sufficient and reliable forms of such data from past projects. As such, the existing SDEE methods suffer from low usage and accuracy.

We propose an efficient SDEE method for open source software, which provides accurate and fast effort estimates. The significant contributions of our paper are i) Novel SDEE software metrics derived from developer activity information of various software repositories, ii) SDEE dataset comprising of the SDEE metrics' values derived from approx. 13,000 GitHub repositories from 150 different software categories, iii) an effort estimation tool based on SDEE metrics, and a software description similarity model trained using Paragraph Vectors algorithm (PVA) on the software product descriptions of GitHub repositories. Given the software description of a newly-envisioned software, our tool yields an effort estimate for developing it, along with the information about the existing functionally similar software.

Our method achieves the highest Standard Accuracy (SA) score of 87.26% when compared with the existing methods, and achieves a SA of 42.7% with the Automatic Transformed Linear Baseline model. (Complete read coming soon.)

Keywords: Effort estimation, software development effort, developer activity, software maintenance, software planning.