CITL Fuzzer Early Data
CITL’s primary research goal has been focused around if we could formalize a technique security practitioners use for identifying potentially vulnerable code. We wanted to know if we could automatically detect patterns that are used for prioritization in a standard security audit? We started with simple questions based on common indicators. For example:
strcpy() more often mean that software will crash more?”.
One one level, we want to answer questions like the one above because bug hunters use heuristics like use of unsafe function in their process of identifying likely soft targets for attack, and we want to determine if such heuristics bear out. On a deeper level, though, we want to understand how to really measure software safety and quality, and we want to understand how various software traits impact program behavior. Any measurements that really speak to the safety of software should have some measurable security impact, and we are looking for ways to measure and demonstrate that impact.
The problem is open ended and even within our team there is still a lot of active debate if it is even possible to discover relationships that have any statistical significance. Unfortunately we have yet to discover anything strongly correlated with the high level indicators we will explore later on. However, we wanted to shared our early data, methods, and experiments with the larger community.
In order to try and tackle research questions like one one above, we have been analyzing and fuzzing large quantities of software. This was done, not to find specific or high-impact defects, but to generate a large sample dataset of software execution and defects.
In this post, we would like to share our merged dataset (static analysis features and fuzzing results) as well as some fun charts / analysis we have tried so far.
The Fuzzer Design
To tackle the data collection needed we choose to write a new fuzzer. Many existing fuzzers are designed to be manually harnessed for a given target or interface. In order to collect the scale of data we wanted, we needed to avoid writing thousands of harnesses. So we choose to design a fuzzer that was built for fuzzing at breadth instead of depth. We also record the execution path, selective API calls, and signal paths to aid in our post-fuzzing analysis.
The way the fuzzer works at a high level is we ahead-of-time (AOT) rewrite the target binary and all libraries with dyninst basic block logging instrumentation and inject our fuzzing runtime. This injected library provides callbacks methods so the Unit-Under-Test (the target process, inclusive of binary, linked libraries, and subprocesses) can log each basic block it executes. It also provides weaksym overwrites for libc methods that perform IO. This way the run time will log when ever a file is attempted to be opened or a environment variable is queried. Then the primary fuzzer loop is able to pick up those missing inputs and provide them to the target on the next pass. Effectively automatically detecting and starting new input mutation paths as the fuzzer runs. The fuzzer currently has support for detecting / mutating: argv, envp, files, and stdin. During evaluation of the testcase, the engine also saves a complete basic block level trace. Following evaluation, the trace log is used to calculate the hitmap, a bitwise Bloom filter, which incorporates not only the basic blocks in the root module but all the libraries and sub-processes. When logging these basic block level traces the engine is thread id and process id aware so it can easily handle multi-process or multi-threaded targets.
Before diving into the sample analysis, we need to take the time to explain the crash counts metric. Unlike the AFL style crash counts, we needed the count to reflect the number of unique / repeatable defects. This is done with two stages of ‘crash qualification’ to get a more accurate crash count:
Following the fuzzer completing its efforts on a target, a crash qualification system is run. This consists of running each testcase against the target in both instrumented and non-instrumented modes, monitoring the program exit signal / state to ensure they reproduce the defect correctly. This is done 10 times for both modes. Then a crash is only considered ‘valid’ if all 20 iterations crash and have the same crash state (signal and location). This method does have its share of false negatives due to bugs that are affected by allocator state, race conditions, or esoteric kernel interfaces but deeper analysis was impractical for a first pass.
For the purposes of our analysis, we remove duplicate crashes based on the uniqueness of the tuple of: [signal number, module id (unique ID for a given .exe or .so), offset (virtual offset from the LOAD base, ASLR normalized)]. This allows us to not only deduplicate a good amount of crashes but also identified the specific library or executable the crash occurred in. It should be noted that this method has both false positives and negatives. For example a bug that ‘moves’ its crashing location in the code segment due to some global state (allocator, system state, inputs, etc) would have an artificially high crash count. There is also an example of the inverse case, C++ exceptions terminate in a ABORT signal, raised from the location within libc++.
This method favors correctly dismissing false positives in favor of a higher false negative count, which was more fit for our subsequent analysis. And while searching for potential relationships we preferred our error in the data to be over-cautious.
Input target set: Ubuntu 18.04 APT (x86-64), every elf executable, skipping targets that used X11, or were unable to link/load correctly. A module is a library or binary (basically any elf file that would be loaded and linked into a process space).
Total number of modules: 69707 Number of modules with coverage: 22623 Number of crashing modules: 4795 Number of crash locations: 13105
Distribution of Crashing Signals
Below is a distribution of crashes, bucketed by signal number. SIGSEGV obviously is the most common in the data, followed by SIGABRT.
SIGABRT is a complex crash state because it can contain c++ exceptions, traditional abort()/assert calls, and even heap corruptions. The GNU malloc implementation uses abort after detecting corrupted heap metadata. Stack canaries also abort after a failed canary check. Luckily we are able to hook and detect most
stack_chk_fail() calls hence the SIGABRT_STACK_CHK bucket.
Using a simple logical AND we can verify the presence of certain hardening features with crashes. Putting these values into a correlation matrix we can quickly spot check the ‘how common or uncommon does X hardening feature appear along side a crash in a module’ question.
(Following values all in percentages)
|has_crash||ASLR||Full RELRO||Stack Guards||No Exec Stack||Full Fortification|
|No Exec Stack||21.18||88.26||5.40||87.20||100.00|
Initially we were excited to see the 0.68% and 1.51% relationship between fortification / RELRO with has_crash but unfortunately this was due to those two features being highly uncommon in the dataset.
Text (code segment) Size
Our first comparison we looked at was to determine if the size of the TEXT section of the module had any relationship with crash density:
While there is no simple correlation, there is at least an interesting shape. Removing some outliers gives us a slightly better picture:
text_size < 20000000 & buckets < 40
There is a largey density compared with crash counts, density being defined as no grouping of high crash density binaries around the smallest text sizes but interestingly enough also in the 0.75 1e7 - 1.5 1e7 range. This could be due to something like statically linked libraries or some other commonly sized codebase.
We can also look for effects of fortify source on the binaries crash density. Below is a Violin category plot of fortification level and crash density.
Fortify Source is a complex hardening feature and it is most common to see a module with MIXED fortification levels because the compiler is unable to safely fortify some method call sites. Full fortification is, therefore, seen less often. There are some outstanding questions here that bear further investigation. First, mixed fortification has a much longer tail than the other categories, and not as flat a shape? Is the compiler’s inability to fortify an instance of a historically unsafe function an indication of higher crash likelihood? Are those specific instances in the crashing paths? Second, the “Unknown? category’s shape has a larger proportion of 0 crash results. Is this an indicator that complete avoidance of these historically unsafe functions correlates with fewer crashes, or does it just have to do with the catch-all nature of the “Unknown” category?
Fortification Levels in all binaries in APT:
Using the function call count data, it is possible to explore the relationship of certain function density with crash density. For example answering the question does high density of
strcpy() predict a higher density of crashes?
strcpy() calls compared with crash counts:
We can also messure the
strcpy() density by normalizing its call frequency by the text size (
strcpy() calls per byte):
Using the fuzzer generated basic block traces and combining them with all the ELF files within the targets process tree, we can symbolize the traces.
Effectively this allows us to see all the external function names called, in order, in every unique execution path. For our current ubuntu apt fuzzing test dataset, this symbolized data is about 21TB of .csv files. So we filtered the data down to just the last 20 symbols, leading up to the crash site in a trace.
The raw data looks like this:
Unlike a stack trace from a crash site, the symbolized trace shows all functions called leading up to a crashsite. Instead of just the stack of functions called to a crash site.
Grouping symbols by signal
Most common symbol (by crashing signal) in the last 20 symbols leading up to a crash.
Please note that the Y axis is different for each plot.
Our initial reaction to these plots is that it generally shows what we would expect, SIGABRT / SIGSEGV having a strong correlation with malloc and memory releted functions. And the SIGABRT_STACK_CHK being related with __stack_chk_fail is consistent. There are some oddities though, we were not expecting to see __dynamic_cast have such a relationship with SIGFPE. With more analysis of the patterns and chains of functions called there could be even more interesting relationships to discover in this data.
We hope to release the symbolized trace data at a later point, if it can be released safely.
These are very rich data sets, with the potential to teach us a lot about program behaviors. We encourage others to download the data and explore it further.
The following .csv has the complete dataset of merged data:
The data represents all modules (libraries or executables) within ubuntu 18.04 apt. In order to get only modules that had some amount of discovery during fuzzing, query for any row where crash_count is not NaN (or NULL).
Description of the columns
|path||path to target module within rootfs|
|modid||unique module id|
|crash_count||count of semi-unique / qualified crashes found within module|
|architecture||cpu architecture of module|
|bitness||bitness of module (eg: 32/64)|
|os_version||distribution of linux: ex: ‘ubuntu’|
|source||distribution version, ex: ‘bionic’|
|package||apt package that provide the module|
|package_version||apt package version|
|text_size||size of executable sections of the module|
|data_size||size of the data sections of the module|
|function_cnt||count of functions within text segment, found via cfg recovery|
|block_cnt||count of basic blocks within text segment, found via cfg recovery|
|is_driver||is the module a kernel driver|
|is_library||is the module a library (.so) or executable|
|aslr||Does the module have ASLR (is dso but is executable). NaN if the module is a library, 0 if its missing, 1 if its present|
|no_exec_stack||Does the module mark the GNU_STACK segment executable (effects other segments if true, similar to DEP)|
|stack_guards||Does the module have stack guards. NaN if not applicable|
|fortify||What level of fortify_src does the module have. NaN: not applicable, 0: Unknown, 1: not fortified, 2: mixed fortification, 3: full fortification|
|relro||Does the module have RELRO: nan: not applicable, 0: No RELRO, 1: Partial RELRO, 2: Full RELRO|
|extern_funcs_calls||Total number of call sites (in text sections) to external functions linked in from libraries|
|libs||json array of libraries linked to module|
|function_calls||json dict of function call counts. Not a complete external function list, but functions with some level of security impact or interest|
|extra||json dict containing a number of extra features about the module|
|extra.fort_dict||json dict containing all the fortified function and their call counts|
|extra.unfort_dict||json dict containing all the unfortified functions and their call counts|
|extra.ret_dists||json array. return distances of each function, bytes between function head and return instructions within function|
|extra.stack_adjs||json array. size in bytes of every function stack adjust in function prolog|
|extra.branch_dict||json dict containing every branch instruction mnemonic and their count from the text section|
|extra.stack_guard_calls||json array. stack guard call locations (virtual address)|