Is satisfiability of quantified bit-vector formulas stable under bit-width changes?
Table of Contents
This is an accompanying page for the experimental paper Is satisfiability of quantified bit-vector formulas stable under bit-width changes?, which was submitted to LPAR 2018. The page contains all scripts that were used for generation of the experimental data and their statistical evaluation.
1 Reducing benchmarks
1.1 Building FormulaReducer
First, we need to build a tool that performs reductions of bit-widths.
git clone https://gitlab.fi.muni.cz/xjonas/FormulaReducer.git cd FormulaReducer mkdir build cd build cmake .. make -j8 cd ../..
1.2 Reducing all BV benchmarks
We obtain reduced benchmarks using the following commands. We first make a copy of the repository with original benchmarks.
git clone https://clc-gitlab.cs.uiowa.edu:2443/SMT-LIB-benchmarks/BV.git
mv BV ReducedBV
cd ReducedBV
We remove all 4 bit and 32 bit benchmarks from 2018-Preiner-cav18 family, because they will be produced from the corresponding 64 bit ones.
rm 2018-Preiner-cav18/*_4bit.smt2 rm 2018-Preiner-cav18/*_32bit.smt2
Then we run FormulaReducer on all benchmarks and save the output to the external log file.
find . -type f -name "*.smt2" -exec ../FormulaReducer/build/formulaReducer "{}" \; > reducer.csv
The resulting CSV file can be found here. An archive with all the resulting formulas can be found here.
1.3 Statistics of reduced benchmarks
First, we need to load all R libraries.
library(dplyr) library(purrr) library(ggplot2) library(scales) library(colorspace) library(RColorBrewer)
We need to split the filename into the benchmark family and benchmark name.
sed -i 's/^.\///' reducer.csv sed -i 's/\//,/' reducer.csv
We also add header row to the CSV file.
sed -i '1s/^/family,benchmark,result\n/' reducer.csv
Load the resulting CSV file.
reducer <- read.csv("reducer.csv", header=TRUE, stringsAsFactors=FALSE)
Compute numbers of benchmarks that were excluded due to the large bit-widths, unsupported instructions, and the number of resulting benchmarks.
reducer %>% summarize(total = n(), tooLargeBW = sum(result == "tooLargeBW"), unsupported = sum(result == "unsupported"), ok = sum(result == "ok"))
total | tooLargeBW | unsupported | ok |
---|---|---|---|
5351 | 21 | 111 | 5219 |
And the same numbers grouped by the benchmark family:
reducer %>% group_by(family) %>% summarize(total = n(), tooLargeBW = sum(result == "tooLargeBW"), unsupported = sum(result == "unsupported"), ok = sum(result == "ok"))
family | total | tooLargeBW | unsupported | ok |
---|---|---|---|---|
2017-Preiner-keymaera | 4035 | 0 | 0 | 4035 |
2017-Preiner-psyco | 194 | 0 | 0 | 194 |
2017-Preiner-scholl-smt08 | 374 | 0 | 0 | 374 |
2017-Preiner-tptp | 73 | 0 | 0 | 73 |
2017-Preiner-UltimateAutomizer | 153 | 0 | 0 | 153 |
20170501-Heizmann-UltimateAutomizer | 131 | 0 | 4 | 127 |
2018-Preiner-cav18 | 200 | 0 | 40 | 160 |
wintersteiger | 191 | 21 | 67 | 103 |
2 Benchmark evaluation
The resulting benchmarks were evaluated by benchexec. The following XML definition file for benchexec was used:
<?xml version="1.0"?> <!DOCTYPE benchmark PUBLIC "+//IDN sosy-lab.org//DTD BenchExec benchmark 1.4//EN" "http://www.sosy-lab.org/benchexec/benchmark-1.4.dtd"> <benchmark tool="tools.cvc4" timelimit="1 min" hardtimelimit="1 min" memlimit="8 GB" cpuCores="1"> <rundefinition name="cvc4" /> <tasks name="PreinerCAV"> <include>/var/benchexec/benchmarks/ReducedBV/2018-Preiner-cav18/*.smt2</include> </tasks> <tasks name="Wintersteiger"> <include>/var/benchexec/benchmarks/ReducedBV/wintersteiger/fmsd13/fixpoint/*.smt2</include> <include>/var/benchexec/benchmarks/ReducedBV/wintersteiger/fmsd13/ranking/*.smt2</include> </tasks> <tasks name="Preiner"> <include>/var/benchexec/benchmarks/ReducedBV/2017-Preiner-keymaera/*.smt2</include> <include>/var/benchexec/benchmarks/ReducedBV/2017-Preiner-psyco/*.smt2</include> <include>/var/benchexec/benchmarks/ReducedBV/2017-Preiner-scholl-smt08/*/*.smt2</include> <include>/var/benchexec/benchmarks/ReducedBV/2017-Preiner-tptp/*.smt2</include> <include>/var/benchexec/benchmarks/ReducedBV/2017-Preiner-UltimateAutomizer/*.smt2</include> </tasks> <tasks name="Heizmann"> <include>/var/benchexec/benchmarks/ReducedBV/20170501-Heizmann-UltimateAutomizer/*.smt2</include> </tasks> </benchmark>
And the following command was used for running Benchexec.
benchexec benchmark-cvc4-reducedBW.xml --numOfThreads 12
Calling table-generator
on the resulting XML file results in the CSV file results.csv.
3 CSV file preprocessing
3.1 Convert to a correct CSV
In the benchexec CSV files, we need to replace tabs by commas:
sed -i 's/\t/,/g' results.csv
And then we remove the standard BenchExec header rows and replace them by more readable ones.
cat results_header.csv > results_out.csv tail -n +4 results.csv >> results_out.csv
3.2 Separate benchmark name, family, and bit-width
We separate benchmark name into the original benchmark name, family, and the bit-width.
mv results_out.csv data/results.csv sed -r -i 's/_([0-9]{1,2})bits\.smt2/\.smt2,\1/g' data/results.csv sed -i 's/\//,/' data/results.csv
3.3 Load CSV file
data <- read.csv("data/results.csv", header=TRUE, stringsAsFactors=FALSE)
4 Evaluation
4.1 Numbers of benchmarks
Original number of benchmarks:
data %>% group_by(benchmark,family) %>% summarize(numResults = n_distinct(status)) %>% nrow()
5219
Divided by the original bit-width:
data %>% group_by(benchmark,family) %>% summarize(numResults = n_distinct(status), originalbw = max(bitwidth)) %>% group_by(originalbw) %>% summarize(totalFormulas = n())
originalbw | totalFormulas |
---|---|
1 | 10 |
8 | 10 |
20 | 20 |
32 | 4964 |
33 | 2 |
64 | 176 |
65 | 25 |
66 | 1 |
67 | 6 |
68 | 1 |
69 | 4 |
Divided by the benchmark family:
data %>% group_by(benchmark,family) %>% summarize(numResults = n_distinct(status)) %>% group_by(family) %>% summarize(n())
2017-Preiner-keymaera | 4035 |
2017-Preiner-psyco | 194 |
2017-Preiner-scholl-smt08 | 374 |
2017-Preiner-tptp | 73 |
2017-Preiner-UltimateAutomizer | 153 |
20170501-Heizmann-UltimateAutomizer | 127 |
2018-Preiner-cav18 | 160 |
wintersteiger | 103 |
4.2 Decided benchmarks
Benchmarks with all bit-widths decided:
decided <- data %>% group_by(benchmark,family) %>% filter(all(status != 'timeout')) undecided <- data %>% group_by(benchmark,family) %>% filter(any(status == 'timeout'))
Number of decided:
decided %>% summarize(numResults = n_distinct(status)) %>% nrow()
4905
Divided by the benchmark family:
decided %>% summarize(numResults = n_distinct(status)) %>% group_by(family) %>% summarize(n())
2017-Preiner-keymaera | 3970 |
2017-Preiner-psyco | 189 |
2017-Preiner-scholl-smt08 | 221 |
2017-Preiner-tptp | 73 |
2017-Preiner-UltimateAutomizer | 151 |
20170501-Heizmann-UltimateAutomizer | 120 |
2018-Preiner-cav18 | 130 |
wintersteiger | 51 |
Divided by bit-width:
decidedByBW <- decided %>% summarize(numResults = n_distinct(status), originalbw = max(bitwidth)) %>% group_by(originalbw) %>% summarize(totalFormulas = n())
originalbw | totalFormulas |
---|---|
1 | 10 |
8 | 10 |
20 | 20 |
32 | 4727 |
33 | 1 |
64 | 134 |
65 | 3 |
Divided by the bit-width and the benchmark family:
decided %>% summarize(numResults = n_distinct(status), originalbw = max(bitwidth)) %>% group_by(originalbw, family) %>% summarize(totalFormulas = n())
originalbw | family | totalFormulas |
---|---|---|
1 | wintersteiger | 10 |
8 | wintersteiger | 10 |
20 | wintersteiger | 20 |
32 | 2017-Preiner-keymaera | 3970 |
32 | 2017-Preiner-psyco | 189 |
32 | 2017-Preiner-scholl-smt08 | 221 |
32 | 2017-Preiner-tptp | 73 |
32 | 2017-Preiner-UltimateAutomizer | 151 |
32 | 20170501-Heizmann-UltimateAutomizer | 118 |
32 | wintersteiger | 5 |
33 | wintersteiger | 1 |
64 | 20170501-Heizmann-UltimateAutomizer | 2 |
64 | 2018-Preiner-cav18 | 130 |
64 | wintersteiger | 2 |
65 | wintersteiger | 3 |
4.3 Undecided benchmarks
Number of undecided:
undecided %>% summarize(numResults = n_distinct(status)) %>% nrow()
314
Divided by the benchmark family:
undecided %>% summarize(numResults = n_distinct(status)) %>% group_by(family) %>% summarize(n())
2017-Preiner-keymaera | 65 |
2017-Preiner-psyco | 5 |
2017-Preiner-scholl-smt08 | 153 |
2017-Preiner-UltimateAutomizer | 2 |
20170501-Heizmann-UltimateAutomizer | 7 |
2018-Preiner-cav18 | 30 |
wintersteiger | 52 |
4.4 Which bit-widths resulted in the different result
Evaluate how many benchmarks changed satisfiability after reduction to a given number of bits or more.
4.4.1 Benchmarks with any wrong result
decided %>% summarize(numResults = n_distinct(status)) %>% group_by() %>% summarize(stable = sum(numResults > 1))
216
decided %>% summarize(numResults = n_distinct(status)) %>% group_by(family) %>% summarize(stable = sum(numResults > 1))
2017-Preiner-keymaera | 64 |
2017-Preiner-psyco | 49 |
2017-Preiner-scholl-smt08 | 1 |
2017-Preiner-tptp | 20 |
2017-Preiner-UltimateAutomizer | 41 |
20170501-Heizmann-UltimateAutomizer | 29 |
2018-Preiner-cav18 | 4 |
wintersteiger | 8 |
4.4.2 With wrong >=2bit result
decided %>% filter(bitwidth > 1) %>% summarize(numResults = n_distinct(status)) %>% group_by() %>% summarize(stable = sum(numResults > 1))
95
decided %>% filter(bitwidth > 1) %>% summarize(numResults = n_distinct(status)) %>% group_by(family) %>% summarize(stable = sum(numResults > 1))
2017-Preiner-keymaera | 19 |
2017-Preiner-psyco | 5 |
2017-Preiner-scholl-smt08 | 0 |
2017-Preiner-tptp | 10 |
2017-Preiner-UltimateAutomizer | 32 |
20170501-Heizmann-UltimateAutomizer | 20 |
2018-Preiner-cav18 | 3 |
wintersteiger | 6 |
4.4.3 With wrong >=4bit result
decided %>% filter(bitwidth > 3) %>% summarize(numResults = n_distinct(status)) %>% group_by() %>% summarize(stable = sum(numResults > 1))
32
decided %>% filter(bitwidth > 3) %>% summarize(numResults = n_distinct(status)) %>% group_by(family) %>% summarize(stable = sum(numResults > 1))
2017-Preiner-keymaera | 12 |
2017-Preiner-psyco | 0 |
2017-Preiner-scholl-smt08 | 0 |
2017-Preiner-tptp | 3 |
2017-Preiner-UltimateAutomizer | 3 |
20170501-Heizmann-UltimateAutomizer | 6 |
2018-Preiner-cav18 | 3 |
wintersteiger | 5 |
4.4.4 With wrong >=8bit result
decided %>% filter(bitwidth > 7) %>% summarize(numResults = n_distinct(status)) %>% group_by() %>% summarize(stable = sum(numResults > 1))
14
decided %>% filter(bitwidth > 7) %>% summarize(numResults = n_distinct(status)) %>% group_by(family) %>% summarize(stable = sum(numResults > 1))
2017-Preiner-keymaera | 4 |
2017-Preiner-psyco | 0 |
2017-Preiner-scholl-smt08 | 0 |
2017-Preiner-tptp | 0 |
2017-Preiner-UltimateAutomizer | 0 |
20170501-Heizmann-UltimateAutomizer | 5 |
2018-Preiner-cav18 | 3 |
wintersteiger | 2 |
4.4.5 With wrong >=16bit result
decided %>% filter(bitwidth > 15) %>% summarize(numResults = n_distinct(status)) %>% group_by() %>% summarize(stable = sum(numResults > 1))
13
decided %>% filter(bitwidth > 15) %>% summarize(numResults = n_distinct(status)) %>% group_by(family) %>% summarize(stable = sum(numResults > 1))
2017-Preiner-keymaera | 4 |
2017-Preiner-psyco | 0 |
2017-Preiner-scholl-smt08 | 0 |
2017-Preiner-tptp | 0 |
2017-Preiner-UltimateAutomizer | 0 |
20170501-Heizmann-UltimateAutomizer | 5 |
2018-Preiner-cav18 | 3 |
wintersteiger | 1 |
4.5 Stabilizing bit-width
Evaluate the number of benchmarks whose satisfiability does not change after reduction to a given bit-width or more for all bit-widths. First, we need to get the maximal bit-width in the benchmark set.
maxBW <- decided %>% ungroup() %>% summarize(max(bitwidth)) %>% as.numeric()
65
Now, we really count numbers of such benchmarks.
countUnstable <- function(minBW) { return (decided %>% mutate(cropbw = minBW, originalbw = max(bitwidth)) %>% group_by(benchmark, originalbw, cropbw) %>% filter(bitwidth >= minBW) %>% summarize(numResults = n_distinct(status)) %>% group_by(originalbw, cropbw) %>% summarize(unstable = sum(numResults > 1))) } unstableCounts <- seq(1, maxBW) %>% map_dfr(countUnstable)
Finally, plot the percentage of such benchmarks divided by the original bit-width of the benchmark.
unstableCounts %>% merge(decidedByBW) %>% ggplot(aes(x = cropbw, y = unstable/totalFormulas, color=factor(originalbw))) + geom_line() + facet_wrap(originalbw~totalFormulas, ncol=2) + xlab("Bit-width")
unstableCounts %>% merge(decidedByBW) %>% filter(originalbw == 32 | originalbw == 64) %>% ggplot(aes(x = cropbw, y = unstable/totalFormulas, color=factor(originalbw))) + scale_x_continuous(breaks = seq(0, 65, by = 2)) + scale_y_continuous(breaks = seq(0, 0.05, by = 0.005), labels=percent) + xlab("Bit-width") + ylab("Percentage of formulas") + labs(color = "Original bit-width") + geom_line() #+ facet_wrap(originalbw~totalFormulas, ncol=2)
4.6 Times
originalBitwidths <- data %>% group_by(benchmark) %>% summarize(originalBitwidth = max(bitwidth))
decided %>% merge(originalBitwidths) %>% ggplot(aes(x = bitwidth, y = cputime)) + geom_bin2d(binwidth=c(1, 0.1)) + facet_grid(originalBitwidth ~ .)
decided %>% merge(originalBitwidths) %>% filter(originalBitwidth == 32) %>% mutate(originalBitwidth = "32 bits") %>% ggplot(aes(x = bitwidth, y = cputime)) + scale_shape_identity() + geom_bin2d(fill="black", binwidth=c(1, 0.05)) + facet_grid(cols = vars(originalBitwidth)) + xlab("Reduced bit-width") + ylab("CPU time [s]")
decided %>% merge(originalBitwidths) %>% filter(originalBitwidth == 64) %>% mutate(originalBitwidth = "64 bits") %>% ggplot(aes(x = bitwidth, y = cputime)) + geom_bin2d(fill="black", binwidth=c(1, 0.1)) + facet_grid(cols = vars(originalBitwidth)) + xlab("Reduced bit-width") + ylab("CPU time [s]")
data %>% filter(status == "timeout") %>% merge(originalBitwidths) %>% group_by(bitwidth, originalBitwidth) %>% summarize(timeouts = n()) %>% ggplot(aes(x = bitwidth, y = timeouts)) + geom_line() + facet_grid(originalBitwidth ~ .)
4.7 Changing benchmarks
Get benchmarks whose satisfiability changed after reduction to 4 bits or more.
changingNames <- decided %>% filter(bitwidth > 3) %>% summarize(numResults = n_distinct(status)) %>% filter(numResults > 1) %>% select(benchmark) changing <- merge(decided, changingNames)
Plot all their satisfiability statuses.
elemCount <- changing %>% group_by(benchmark) %>% nrow() changing %>% merge(originalBitwidths) %>% ggplot(aes(x = reorder(benchmark, originalBitwidth), y = bitwidth, color = status)) + geom_point() + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5))
changing %>% merge(originalBitwidths) %>% ggplot(aes(x = reorder(benchmark, originalBitwidth), y = bitwidth, color = status)) + geom_point() + scale_x_discrete(labels = seq(1, elemCount)) + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) + xlab("Original benchmark") + ylab("Bit-width") + labs(color = "Satisfiability status")
And plot all satisfiability statuses of all undecided benchmarks
undecided %>% ggplot(aes(x = benchmark, y = bitwidth, color = status)) + geom_point() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + facet_grid(family~., scales="free")