Effect of Identifier Tokenization on Automatic Source Code Documentation

Publications

Effect of Identifier Tokenization on Automatic Source Code Documentation

Author : Dr Sawan Rai

Year : 2022

Publisher : Springer Science and Business Media Deutschland GmbH

Source Title : Arabian Journal for Science and Engineering

Document Type :

Abstract

In software development, source code documents play essential role during program comprehension and software maintenance. Natural language descriptions and identifier names are the main parts of the source code document. Source code document generation spares the working hours of developers. Automatic source code documentation is a rapidly growing research area at the present time. Researchers have proposed various template based, IR based (information retrieval), and learning-based techniques for automatic source code documentation. There is not much work related to preprocessing and its effect on the automatic source code documentation. Tokenization is one of the essential steps in preprocessing. We found some important flaws in the basic tokenization steps that could affect automatic source code documentation performance. Therefore, we propose an updated tokenization approach to remove the flaws of basic tokenization steps. We performed method name prediction and comment generation studies to analyze the effect of updated tokenization approach. We found that the updated tokenization helped in improving the performance of the automatic source code documentation. Name prediction and comment generation performance improved by more than 2.5% and 3.5%, respectively, in terms of F1 score.