Naturalness of natural language artifacts in software
Abstract
We present a study on the naturalness of the natural language artifacts in software. Naturalness is essentially repetitiveness or predictability. By natural language artifacts, we mean source code comments, revision history messages, bug reports and so on. We measure "naturalness" using a standard measure, cross-entropy or perplexity from the widely used N-Gram models. Previously, Hindle et al. demonstrated empirically that source code was comparatively more repetitive or regular (i.e., more natural) when compared with traditional English text. A question that logically follows from their work is the naturalness of other artifacts associated with software. We present our findings on source code comments, commit logs, bug reports, string messages and content present in the popular question and answer forum, StackOverflow. Each of the artifact that we examine is a natural language artifact that is associated with software. However, they do not exhibit the same amount of regularity (naturalness). Commit logs were the most regular, followed by string literal messages and source code comments. Content from StackOverflow (viz., title, question and answers) showed a behavior similar to traditional English text i.e., comparatively lesser regularity. Bug reports from industrial projects exhibited more regularity than bug reports from open source projects, whose naturalness resembled that of typical English text. Our findings have implications for feasibility of building tools such as comment and bug report completion engines. We describe a next-word prediction tool that we built using the N-Gram language model. This tool achieved an accuracy ranging from 70 to 90% on commit messages in different projects. It also achieved an accuracy ranging from 56 to 78% on source comments. We also present a part of speech based analysis of words that are easy to predict and difficult to predict.