Innovations on Automatic Test Data Generation

PDF files in automatic test data generation

PDF files in automatic test data generation

By: Morteza Zakeri, Ph.D. Student

April 2020


Fuzz testing (Fuzzing) is a dynamic software testing technique. In this technique with repeated generation and injection of malformed test data to the software under test (SUT), we are looking for possible faults and vulnerabilities. To this goal, fuzz testing requires varieties of test data. The most critical challenge is to handle the complexity of the file structures as program input. Surveys have revealed that many of the generated test data in these cases follow restricted numbers and superficial paths, because of being rejected by the parser of SUT in the initial stages of parsing. Using the grammatical structure of input files to generate test data lead to increase code coverage. However, often, the grammar extraction is performed manually, which is a time consuming, costly and error-prone task.

Recently, we proposed an automated method for hybrid test data generation. We applied neural language models (NLMs) that are constructed by recurrent neural networks (RNNs). The proposed models by using deep learning techniques can learn the statistical structure of complex files and then generate new textual test data, based on the grammar, and binary data, based on mutations. Fuzzing the generated data is done by two newly introduced algorithms, called neural fuzz algorithms that use these models. We use our proposed method to generate test data, and then fuzz testing of MuPDF complicated software which takes portable document format (PDF) files as input. To train our generative models, we gathered a large corpus of PDF files. Our experiments demonstrate that the data generated by this method leads to an increase in the code coverage, more than 7.0%, compared to state-of-the-art file format fuzzers such as American fuzzy lop (AFL). Experiments also indicate a better learning accuracy of simpler NLMS in comparison with the more complicated encoder-decoder model and confirm that our proposed models can outperform the encoder-decoder model in code coverage when fuzzing the SUT.

Out paper presents a solution for complex test data generation with the help of deep neural networks. The word “complex” in this context means that test data consist of various data types gathering together based on a specific format or grammar. This is what happens in most of the real-word applications which accept a file as their main input. For example, PDF reader software must handle PDF files as input, and PDF is one of the most complex input formats. A PDF file contains both textual and non-textual or binary data plus many human-defined rules which put such data fields beside each other and generate a file. To handle a complex file, an application processes the file in different stages. These stages usually begin by parsing the input file, then continue with semantic analysis, and finally terminate by executing the file content. Generating test data, here a complex input file, which can access to high code coverage and find more probably existing bugs, require that test data pass all of these stages. To the best of our knowledge, the methods used in fuzzing, one of the most effective software testing technics, show that randomly generating such test data lead to very low coverage of code and hence can not guarantee the absence of bugs or reliability of software. On the other hand, generating test data from grammar requires extracting the grammar or model of the file manually, which is expensive and time-consuming.

Challenges and Solutions

The problem of generating complex test data that successfully pass different stages in the processing of the file is addressed by using some machine learning techniques to learn the structure of a given input file and then generates some new test data based on the learned model. A file can be seen as a sequence of bytes which generate by a grammar of that file. Hence, we can use a language model to automatically learn this grammar form a corpus containing various samples of the given file. Neural language models are effective models to learn natural language properties and successfully are utilized in the complex natural language processing (NLP) tasks such as machine translation and image captioning. We apply a model based on deep neural language to learn the grammar of the file. The learned model is then sampled to generate new files as test data.

The first problem we encountered was finding a mechanism to distinguish between data and meta-data. To do so, we applied a reasonable trick: As meta-data repeated in almost every sample of a file format, the learned model predicts the meta-data with higher probability in comparison to data. By putting a threshold at a model output, the data and metadata are distinguishable.

Aims to seek bugs in software by fuzzing techniques, the second problem was how to determine which byte should be fuzzed to reveal failures in the software under test (SUT). This can be done by targeting different stages of input processing that SUT used them. For example, if we would like to fuzz parser, we should fuzz meta-data because parser usually deals with meta-data to validate the format of the input file and to extract data. On the other hand, if we would like to fuzz the execution stage, it may be better to fuzz data. The ability of the learned model in distinguishing data and meta-data is used to determine the place of fuzz in the file.

In addition to determining the place of the fuzz, we should inform the value for replacing the third problem we addressed. As we know, the goal of fuzzing is creating the malformed input, and hence the most inappropriate byte is expected to put in the place of fuzz. Again most inappropriate byte can be determined by the learned neural language model. It is enough to select a byte with the lowest likelihood instead of the highest likelihood used in the default manner.

The fourth problem that raised in this regard was training a neural language model on an ASCII character set rather than training it on all bytes. To deal with non-ASCII bytes that make non-textual parts of the input, we replaced these parts by a specific token, called  Binary Token, and asked the model to learn that token. At the generating time, whenever a model predicted the specific binary token, we replace the binary token with a real binary section previously deleted form file. This is a simple but effective method to reach a hybrid test data generation scheme. Two specific fuzzing algorithms, MetadataNeuralFuzz and DataNeuralFuzz, are proposed based on a learned generative model. The former targets the parsing stage, and the latter focuses on rendering or executions stage in the processing of the input files. We believe that both algorithms are required to reach a complete fuzz testing with high code coverage and probably a high number of discovered bugs. We investigate the effectiveness of various language models with different configurations and several sampling strategies in the context of complex test data generation. Also, we study various parameters required when generating and fuzzing test data with deep learning techniques.

Tools and Publications

To bring our new theory to a practical tool, we designed and implemented IUST-DeePFUzz as a modular file format fuzzer. The main module of IUST DeepFuzz is a test data generator that implements our neural fuzz algorithms. The fuzzer injects test data to SUT and checks for unexpected results such as crash the memory of the SUT. IUST DeepFuzz uses Microsoft Application Verifier, a free runtime monitoring tool, as a monitoring module to catch any memory corruption. It also uses VSPerfMon, another tool from Microsoft, to measure code coverage. Modules are connected using modest Python and batch scripts. IUST-DeepFuzz can find both the place and the value of the fuzzed symbol automatically while generating the input. Other file formats such as HTML, CSS, XML, JSON, and all types of source codes can be produced in the same manner which is suitable for fuzzing and quick quality assurance of any software systems.

For more information about both the theoretical and practical aspect of IUST-DeepFuzz, refer to the IUST-DeepFuzz relevant publications: