Is satisfiability of quantified bit-vector formulas stable under bit-width changes?

1. Reducing benchmarks
2. Benchmark evaluation
3. CSV file preprocessing
4. Evaluation

This is an accompanying page for the experimental paper Is satisfiability of quantified bit-vector formulas stable under bit-width changes?, which was submitted to LPAR 2018. The page contains all scripts that were used for generation of the experimental data and their statistical evaluation.

1 Reducing benchmarks

1.1 Building FormulaReducer

First, we need to build a tool that performs reductions of bit-widths.

git clone https://gitlab.fi.muni.cz/xjonas/FormulaReducer.git
cd FormulaReducer
mkdir build
cd build
cmake ..
make -j8
cd ../..

1.2 Reducing all BV benchmarks

We obtain reduced benchmarks using the following commands. We first make a copy of the repository with original benchmarks.

git clone https://clc-gitlab.cs.uiowa.edu:2443/SMT-LIB-benchmarks/BV.git
mv BV ReducedBV
cd ReducedBV

We remove all 4 bit and 32 bit benchmarks from 2018-Preiner-cav18 family, because they will be produced from the corresponding 64 bit ones.

rm 2018-Preiner-cav18/*_4bit.smt2
rm 2018-Preiner-cav18/*_32bit.smt2

Then we run FormulaReducer on all benchmarks and save the output to the external log file.

find . -type f -name "*.smt2" -exec ../FormulaReducer/build/formulaReducer "{}" \; > reducer.csv

The resulting CSV file can be found here. An archive with all the resulting formulas can be found here.

1.3 Statistics of reduced benchmarks

First, we need to load all R libraries.

library(dplyr)
library(purrr)
library(ggplot2)
library(scales)
library(colorspace)
library(RColorBrewer)

We need to split the filename into the benchmark family and benchmark name.

sed -i 's/^.\///' reducer.csv
sed -i 's/\//,/' reducer.csv

We also add header row to the CSV file.

sed -i '1s/^/family,benchmark,result\n/' reducer.csv

Load the resulting CSV file.

reducer <- read.csv("reducer.csv", header=TRUE, stringsAsFactors=FALSE)

Compute numbers of benchmarks that were excluded due to the large bit-widths, unsupported instructions, and the number of resulting benchmarks.

reducer %>%
    summarize(total = n(),
              tooLargeBW = sum(result == "tooLargeBW"),
              unsupported = sum(result == "unsupported"),
              ok = sum(result == "ok"))

total	tooLargeBW	unsupported	ok
5351	21	111	5219

And the same numbers grouped by the benchmark family:

reducer %>%
    group_by(family) %>%
    summarize(total = n(),
              tooLargeBW = sum(result == "tooLargeBW"),
              unsupported = sum(result == "unsupported"),
              ok = sum(result == "ok"))

family	total	tooLargeBW	unsupported	ok
2017-Preiner-keymaera	4035	0	0	4035
2017-Preiner-psyco	194	0	0	194
2017-Preiner-scholl-smt08	374	0	0	374
2017-Preiner-tptp	73	0	0	73
2017-Preiner-UltimateAutomizer	153	0	0	153
20170501-Heizmann-UltimateAutomizer	131	0	4	127
2018-Preiner-cav18	200	0	40	160
wintersteiger	191	21	67	103

2 Benchmark evaluation

The resulting benchmarks were evaluated by benchexec. The following XML definition file for benchexec was used:

<?xml version="1.0"?>
<!DOCTYPE benchmark PUBLIC "+//IDN sosy-lab.org//DTD BenchExec benchmark 1.4//EN" "http://www.sosy-lab.org/benchexec/benchmark-1.4.dtd">
<benchmark tool="tools.cvc4"
           timelimit="1 min"
           hardtimelimit="1 min"
           memlimit="8 GB"
           cpuCores="1">

  <rundefinition name="cvc4" />

  <tasks name="PreinerCAV">
    <include>/var/benchexec/benchmarks/ReducedBV/2018-Preiner-cav18/*.smt2</include>
  </tasks>


  <tasks name="Wintersteiger">
    <include>/var/benchexec/benchmarks/ReducedBV/wintersteiger/fmsd13/fixpoint/*.smt2</include>
    <include>/var/benchexec/benchmarks/ReducedBV/wintersteiger/fmsd13/ranking/*.smt2</include>
  </tasks>

  <tasks name="Preiner">
    <include>/var/benchexec/benchmarks/ReducedBV/2017-Preiner-keymaera/*.smt2</include>
    <include>/var/benchexec/benchmarks/ReducedBV/2017-Preiner-psyco/*.smt2</include>
    <include>/var/benchexec/benchmarks/ReducedBV/2017-Preiner-scholl-smt08/*/*.smt2</include>
    <include>/var/benchexec/benchmarks/ReducedBV/2017-Preiner-tptp/*.smt2</include>
    <include>/var/benchexec/benchmarks/ReducedBV/2017-Preiner-UltimateAutomizer/*.smt2</include>
  </tasks>

  <tasks name="Heizmann">
    <include>/var/benchexec/benchmarks/ReducedBV/20170501-Heizmann-UltimateAutomizer/*.smt2</include>
  </tasks>
</benchmark>

And the following command was used for running Benchexec.

benchexec benchmark-cvc4-reducedBW.xml --numOfThreads 12

Calling table-generator on the resulting XML file results in the CSV file results.csv.

3 CSV file preprocessing

3.1 Convert to a correct CSV

In the benchexec CSV files, we need to replace tabs by commas:

sed -i 's/\t/,/g' results.csv

And then we remove the standard BenchExec header rows and replace them by more readable ones.

cat results_header.csv > results_out.csv
tail -n +4 results.csv >> results_out.csv

3.2 Separate benchmark name, family, and bit-width

We separate benchmark name into the original benchmark name, family, and the bit-width.

mv results_out.csv data/results.csv
sed -r -i 's/_([0-9]{1,2})bits\.smt2/\.smt2,\1/g' data/results.csv
sed -i 's/\//,/' data/results.csv

3.3 Load CSV file

data <- read.csv("data/results.csv", header=TRUE, stringsAsFactors=FALSE)

4 Evaluation

4.1 Numbers of benchmarks

Original number of benchmarks:

data %>%
    group_by(benchmark,family) %>%
    summarize(numResults = n_distinct(status)) %>%
    nrow()

Divided by the original bit-width:

data %>%
    group_by(benchmark,family) %>%
    summarize(numResults = n_distinct(status), originalbw = max(bitwidth)) %>%
    group_by(originalbw) %>%
    summarize(totalFormulas = n())

originalbw	totalFormulas
1	10
8	10
20	20
32	4964
33	2
64	176
65	25
66	1
67	6
68	1
69	4

Divided by the benchmark family:

data %>%
    group_by(benchmark,family) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(n())

2017-Preiner-keymaera	4035
2017-Preiner-psyco	194
2017-Preiner-scholl-smt08	374
2017-Preiner-tptp	73
2017-Preiner-UltimateAutomizer	153
20170501-Heizmann-UltimateAutomizer	127
2018-Preiner-cav18	160
wintersteiger	103

4.2 Decided benchmarks

Benchmarks with all bit-widths decided:

decided <- data %>%
               group_by(benchmark,family) %>%
               filter(all(status != 'timeout'))

undecided <- data %>%
               group_by(benchmark,family) %>%
               filter(any(status == 'timeout'))

Number of decided:

decided %>%
    summarize(numResults = n_distinct(status)) %>%
    nrow()

Divided by the benchmark family:

decided %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(n())

2017-Preiner-keymaera	3970
2017-Preiner-psyco	189
2017-Preiner-scholl-smt08	221
2017-Preiner-tptp	73
2017-Preiner-UltimateAutomizer	151
20170501-Heizmann-UltimateAutomizer	120
2018-Preiner-cav18	130
wintersteiger	51

Divided by bit-width:

decidedByBW <- decided %>%
    summarize(numResults = n_distinct(status), originalbw = max(bitwidth)) %>%
    group_by(originalbw) %>%
    summarize(totalFormulas = n())

originalbw	totalFormulas
1	10
8	10
20	20
32	4727
33	1
64	134
65	3

Divided by the bit-width and the benchmark family:

decided %>%
    summarize(numResults = n_distinct(status), originalbw = max(bitwidth)) %>%
    group_by(originalbw, family) %>%
    summarize(totalFormulas = n())

originalbw	family	totalFormulas
1	wintersteiger	10
8	wintersteiger	10
20	wintersteiger	20
32	2017-Preiner-keymaera	3970
32	2017-Preiner-psyco	189
32	2017-Preiner-scholl-smt08	221
32	2017-Preiner-tptp	73
32	2017-Preiner-UltimateAutomizer	151
32	20170501-Heizmann-UltimateAutomizer	118
32	wintersteiger	5
33	wintersteiger	1
64	20170501-Heizmann-UltimateAutomizer	2
64	2018-Preiner-cav18	130
64	wintersteiger	2
65	wintersteiger	3

4.3 Undecided benchmarks

Number of undecided:

undecided %>%
    summarize(numResults = n_distinct(status)) %>%
    nrow()

Divided by the benchmark family:

undecided %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(n())

2017-Preiner-keymaera	65
2017-Preiner-psyco	5
2017-Preiner-scholl-smt08	153
2017-Preiner-UltimateAutomizer	2
20170501-Heizmann-UltimateAutomizer	7
2018-Preiner-cav18	30
wintersteiger	52

4.4 Which bit-widths resulted in the different result

Evaluate how many benchmarks changed satisfiability after reduction to a given number of bits or more.

4.4.1 Benchmarks with any wrong result

decided %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by() %>%
    summarize(stable = sum(numResults > 1))

decided %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(stable = sum(numResults > 1))

2017-Preiner-keymaera	64
2017-Preiner-psyco	49
2017-Preiner-scholl-smt08	1
2017-Preiner-tptp	20
2017-Preiner-UltimateAutomizer	41
20170501-Heizmann-UltimateAutomizer	29
2018-Preiner-cav18	4
wintersteiger	8

4.4.2 With wrong >=2bit result

decided %>%
    filter(bitwidth > 1) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by() %>%
    summarize(stable = sum(numResults > 1))

decided %>%
    filter(bitwidth > 1) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(stable = sum(numResults > 1))

2017-Preiner-keymaera	19
2017-Preiner-psyco	5
2017-Preiner-scholl-smt08	0
2017-Preiner-tptp	10
2017-Preiner-UltimateAutomizer	32
20170501-Heizmann-UltimateAutomizer	20
2018-Preiner-cav18	3
wintersteiger	6

4.4.3 With wrong >=4bit result

decided %>%
    filter(bitwidth > 3) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by() %>%
    summarize(stable = sum(numResults > 1))

decided %>%
    filter(bitwidth > 3) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(stable = sum(numResults > 1))

2017-Preiner-keymaera	12
2017-Preiner-psyco	0
2017-Preiner-scholl-smt08	0
2017-Preiner-tptp	3
2017-Preiner-UltimateAutomizer	3
20170501-Heizmann-UltimateAutomizer	6
2018-Preiner-cav18	3
wintersteiger	5

4.4.4 With wrong >=8bit result

decided %>%
    filter(bitwidth > 7) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by() %>%
    summarize(stable = sum(numResults > 1))

decided %>%
    filter(bitwidth > 7) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(stable = sum(numResults > 1))

2017-Preiner-keymaera	4
2017-Preiner-psyco	0
2017-Preiner-scholl-smt08	0
2017-Preiner-tptp	0
2017-Preiner-UltimateAutomizer	0
20170501-Heizmann-UltimateAutomizer	5
2018-Preiner-cav18	3
wintersteiger	2

4.4.5 With wrong >=16bit result

decided %>%
    filter(bitwidth > 15) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by() %>%
    summarize(stable = sum(numResults > 1))

decided %>%
    filter(bitwidth > 15) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(stable = sum(numResults > 1))

2017-Preiner-keymaera	4
2017-Preiner-psyco	0
2017-Preiner-scholl-smt08	0
2017-Preiner-tptp	0
2017-Preiner-UltimateAutomizer	0
20170501-Heizmann-UltimateAutomizer	5
2018-Preiner-cav18	3
wintersteiger	1

4.5 Stabilizing bit-width

Evaluate the number of benchmarks whose satisfiability does not change after reduction to a given bit-width or more for all bit-widths. First, we need to get the maximal bit-width in the benchmark set.

maxBW <- decided %>% ungroup() %>% summarize(max(bitwidth)) %>% as.numeric()

Now, we really count numbers of such benchmarks.

countUnstable <- function(minBW)
{
    return (decided %>%
       mutate(cropbw = minBW, originalbw = max(bitwidth)) %>%
       group_by(benchmark, originalbw, cropbw) %>%
       filter(bitwidth >= minBW) %>%
       summarize(numResults = n_distinct(status)) %>%
       group_by(originalbw, cropbw) %>%
       summarize(unstable = sum(numResults > 1)))
}

unstableCounts <- seq(1, maxBW) %>% map_dfr(countUnstable)

Finally, plot the percentage of such benchmarks divided by the original bit-width of the benchmark.

unstableCounts %>% merge(decidedByBW) %>%
    ggplot(aes(x = cropbw, y =  unstable/totalFormulas, color=factor(originalbw))) +
    geom_line() + facet_wrap(originalbw~totalFormulas, ncol=2) +
    xlab("Bit-width")

unstableCounts %>% merge(decidedByBW) %>%
    filter(originalbw == 32 | originalbw == 64) %>%
    ggplot(aes(x = cropbw, y =  unstable/totalFormulas, color=factor(originalbw))) +
    scale_x_continuous(breaks = seq(0, 65, by = 2)) +
    scale_y_continuous(breaks = seq(0, 0.05, by = 0.005), labels=percent) +
    xlab("Bit-width") +
    ylab("Percentage of formulas") +
    labs(color = "Original bit-width") +
    geom_line() #+ facet_wrap(originalbw~totalFormulas, ncol=2)

unstablePlot.pdf

4.6 Times

originalBitwidths <- data %>%
    group_by(benchmark) %>%
    summarize(originalBitwidth = max(bitwidth))

decided %>% merge(originalBitwidths) %>%
    ggplot(aes(x = bitwidth, y = cputime)) +
    geom_bin2d(binwidth=c(1, 0.1)) + facet_grid(originalBitwidth ~ .)

decided %>% merge(originalBitwidths) %>%
    filter(originalBitwidth == 32) %>%
    mutate(originalBitwidth = "32 bits") %>%
    ggplot(aes(x = bitwidth, y = cputime)) +
    scale_shape_identity() +
    geom_bin2d(fill="black", binwidth=c(1, 0.05)) + facet_grid(cols = vars(originalBitwidth)) +
    xlab("Reduced bit-width") + ylab("CPU time [s]")

times32.pdf

decided %>% merge(originalBitwidths) %>%
    filter(originalBitwidth == 64) %>%
    mutate(originalBitwidth = "64 bits") %>%
    ggplot(aes(x = bitwidth, y = cputime)) +
    geom_bin2d(fill="black", binwidth=c(1, 0.1)) + facet_grid(cols = vars(originalBitwidth)) +
    xlab("Reduced bit-width") + ylab("CPU time [s]")

times64.pdf

data %>%
   filter(status == "timeout") %>%
   merge(originalBitwidths) %>%
   group_by(bitwidth, originalBitwidth) %>%
   summarize(timeouts = n()) %>%
   ggplot(aes(x = bitwidth, y = timeouts)) +
   geom_line() + facet_grid(originalBitwidth ~ .)

4.7 Changing benchmarks

Get benchmarks whose satisfiability changed after reduction to 4 bits or more.

changingNames <- decided %>%
    filter(bitwidth > 3) %>%
    summarize(numResults = n_distinct(status)) %>%
    filter(numResults > 1) %>%
    select(benchmark)

changing <- merge(decided, changingNames)

Plot all their satisfiability statuses.

elemCount <- changing %>% group_by(benchmark) %>% nrow()

changing %>% merge(originalBitwidths) %>%
ggplot(aes(x = reorder(benchmark, originalBitwidth), y = bitwidth, color = status)) +
    geom_point() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5))

changing %>% merge(originalBitwidths) %>%
ggplot(aes(x = reorder(benchmark, originalBitwidth), y = bitwidth, color = status)) +
    geom_point() +
    scale_x_discrete(labels = seq(1, elemCount)) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +
    xlab("Original benchmark") +
    ylab("Bit-width") +
    labs(color = "Satisfiability status")

changing.pdf

And plot all satisfiability statuses of all undecided benchmarks

undecided %>% ggplot(aes(x = benchmark, y = bitwidth, color = status)) +
    geom_point() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    facet_grid(family~., scales="free")

Is satisfiability of quantified bit-vector formulas stable under bit-width changes?

Table of Contents

1 Reducing benchmarks

1.1 Building FormulaReducer

1.2 Reducing all BV benchmarks

1.3 Statistics of reduced benchmarks

2 Benchmark evaluation

3 CSV file preprocessing

3.1 Convert to a correct CSV

3.2 Separate benchmark name, family, and bit-width

3.3 Load CSV file

4 Evaluation

4.1 Numbers of benchmarks

4.2 Decided benchmarks

4.3 Undecided benchmarks

4.4 Which bit-widths resulted in the different result

4.4.1 Benchmarks with any wrong result

4.4.2 With wrong >=2bit result

4.4.3 With wrong >=4bit result

4.4.4 With wrong >=8bit result

4.4.5 With wrong >=16bit result

4.5 Stabilizing bit-width

4.6 Times

4.7 Changing benchmarks