Is satisfiability of quantified bit-vector formulas stable under bit-width changes?

Table of Contents

This is an accompanying page for the experimental paper Is satisfiability of quantified bit-vector formulas stable under bit-width changes?, which was submitted to LPAR 2018. The page contains all scripts that were used for generation of the experimental data and their statistical evaluation.

1 Reducing benchmarks

1.1 Building FormulaReducer

First, we need to build a tool that performs reductions of bit-widths.

git clone https://gitlab.fi.muni.cz/xjonas/FormulaReducer.git
cd FormulaReducer
mkdir build
cd build
cmake ..
make -j8
cd ../..

1.2 Reducing all BV benchmarks

We obtain reduced benchmarks using the following commands. We first make a copy of the repository with original benchmarks.

git clone https://clc-gitlab.cs.uiowa.edu:2443/SMT-LIB-benchmarks/BV.git
mv BV ReducedBV
cd ReducedBV

We remove all 4 bit and 32 bit benchmarks from 2018-Preiner-cav18 family, because they will be produced from the corresponding 64 bit ones.

rm 2018-Preiner-cav18/*_4bit.smt2
rm 2018-Preiner-cav18/*_32bit.smt2

Then we run FormulaReducer on all benchmarks and save the output to the external log file.

find . -type f -name "*.smt2" -exec ../FormulaReducer/build/formulaReducer "{}" \; > reducer.csv

The resulting CSV file can be found here. An archive with all the resulting formulas can be found here.

1.3 Statistics of reduced benchmarks

First, we need to load all R libraries.

library(dplyr)
library(purrr)
library(ggplot2)
library(scales)
library(colorspace)
library(RColorBrewer)

We need to split the filename into the benchmark family and benchmark name.

sed -i 's/^.\///' reducer.csv
sed -i 's/\//,/' reducer.csv

We also add header row to the CSV file.

sed -i '1s/^/family,benchmark,result\n/' reducer.csv

Load the resulting CSV file.

reducer <- read.csv("reducer.csv", header=TRUE, stringsAsFactors=FALSE)

Compute numbers of benchmarks that were excluded due to the large bit-widths, unsupported instructions, and the number of resulting benchmarks.

reducer %>%
    summarize(total = n(),
              tooLargeBW = sum(result == "tooLargeBW"),
              unsupported = sum(result == "unsupported"),
              ok = sum(result == "ok"))
total tooLargeBW unsupported ok
5351 21 111 5219

And the same numbers grouped by the benchmark family:

reducer %>%
    group_by(family) %>%
    summarize(total = n(),
              tooLargeBW = sum(result == "tooLargeBW"),
              unsupported = sum(result == "unsupported"),
              ok = sum(result == "ok"))
family total tooLargeBW unsupported ok
2017-Preiner-keymaera 4035 0 0 4035
2017-Preiner-psyco 194 0 0 194
2017-Preiner-scholl-smt08 374 0 0 374
2017-Preiner-tptp 73 0 0 73
2017-Preiner-UltimateAutomizer 153 0 0 153
20170501-Heizmann-UltimateAutomizer 131 0 4 127
2018-Preiner-cav18 200 0 40 160
wintersteiger 191 21 67 103

2 Benchmark evaluation

The resulting benchmarks were evaluated by benchexec. The following XML definition file for benchexec was used:

<?xml version="1.0"?>
<!DOCTYPE benchmark PUBLIC "+//IDN sosy-lab.org//DTD BenchExec benchmark 1.4//EN" "http://www.sosy-lab.org/benchexec/benchmark-1.4.dtd">
<benchmark tool="tools.cvc4"
           timelimit="1 min"
           hardtimelimit="1 min"
           memlimit="8 GB"
           cpuCores="1">

  <rundefinition name="cvc4" />

  <tasks name="PreinerCAV">
    <include>/var/benchexec/benchmarks/ReducedBV/2018-Preiner-cav18/*.smt2</include>
  </tasks>


  <tasks name="Wintersteiger">
    <include>/var/benchexec/benchmarks/ReducedBV/wintersteiger/fmsd13/fixpoint/*.smt2</include>
    <include>/var/benchexec/benchmarks/ReducedBV/wintersteiger/fmsd13/ranking/*.smt2</include>
  </tasks>

  <tasks name="Preiner">
    <include>/var/benchexec/benchmarks/ReducedBV/2017-Preiner-keymaera/*.smt2</include>
    <include>/var/benchexec/benchmarks/ReducedBV/2017-Preiner-psyco/*.smt2</include>
    <include>/var/benchexec/benchmarks/ReducedBV/2017-Preiner-scholl-smt08/*/*.smt2</include>
    <include>/var/benchexec/benchmarks/ReducedBV/2017-Preiner-tptp/*.smt2</include>
    <include>/var/benchexec/benchmarks/ReducedBV/2017-Preiner-UltimateAutomizer/*.smt2</include>
  </tasks>

  <tasks name="Heizmann">
    <include>/var/benchexec/benchmarks/ReducedBV/20170501-Heizmann-UltimateAutomizer/*.smt2</include>
  </tasks>
</benchmark>

And the following command was used for running Benchexec.

benchexec benchmark-cvc4-reducedBW.xml --numOfThreads 12

Calling table-generator on the resulting XML file results in the CSV file results.csv.

3 CSV file preprocessing

3.1 Convert to a correct CSV

In the benchexec CSV files, we need to replace tabs by commas:

sed -i 's/\t/,/g' results.csv

And then we remove the standard BenchExec header rows and replace them by more readable ones.

cat results_header.csv > results_out.csv
tail -n +4 results.csv >> results_out.csv

3.2 Separate benchmark name, family, and bit-width

We separate benchmark name into the original benchmark name, family, and the bit-width.

mv results_out.csv data/results.csv
sed -r -i 's/_([0-9]{1,2})bits\.smt2/\.smt2,\1/g' data/results.csv
sed -i 's/\//,/' data/results.csv

3.3 Load CSV file

data <- read.csv("data/results.csv", header=TRUE, stringsAsFactors=FALSE)

4 Evaluation

4.1 Numbers of benchmarks

Original number of benchmarks:

data %>%
    group_by(benchmark,family) %>%
    summarize(numResults = n_distinct(status)) %>%
    nrow()
5219

Divided by the original bit-width:

data %>%
    group_by(benchmark,family) %>%
    summarize(numResults = n_distinct(status), originalbw = max(bitwidth)) %>%
    group_by(originalbw) %>%
    summarize(totalFormulas = n())
originalbw totalFormulas
1 10
8 10
20 20
32 4964
33 2
64 176
65 25
66 1
67 6
68 1
69 4

Divided by the benchmark family:

data %>%
    group_by(benchmark,family) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(n())
2017-Preiner-keymaera 4035
2017-Preiner-psyco 194
2017-Preiner-scholl-smt08 374
2017-Preiner-tptp 73
2017-Preiner-UltimateAutomizer 153
20170501-Heizmann-UltimateAutomizer 127
2018-Preiner-cav18 160
wintersteiger 103

4.2 Decided benchmarks

Benchmarks with all bit-widths decided:

decided <- data %>%
               group_by(benchmark,family) %>%
               filter(all(status != 'timeout'))

undecided <- data %>%
               group_by(benchmark,family) %>%
               filter(any(status == 'timeout'))

Number of decided:

decided %>%
    summarize(numResults = n_distinct(status)) %>%
    nrow()
4905

Divided by the benchmark family:

decided %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(n())
2017-Preiner-keymaera 3970
2017-Preiner-psyco 189
2017-Preiner-scholl-smt08 221
2017-Preiner-tptp 73
2017-Preiner-UltimateAutomizer 151
20170501-Heizmann-UltimateAutomizer 120
2018-Preiner-cav18 130
wintersteiger 51

Divided by bit-width:

decidedByBW <- decided %>%
    summarize(numResults = n_distinct(status), originalbw = max(bitwidth)) %>%
    group_by(originalbw) %>%
    summarize(totalFormulas = n())
originalbw totalFormulas
1 10
8 10
20 20
32 4727
33 1
64 134
65 3

Divided by the bit-width and the benchmark family:

decided %>%
    summarize(numResults = n_distinct(status), originalbw = max(bitwidth)) %>%
    group_by(originalbw, family) %>%
    summarize(totalFormulas = n())
originalbw family totalFormulas
1 wintersteiger 10
8 wintersteiger 10
20 wintersteiger 20
32 2017-Preiner-keymaera 3970
32 2017-Preiner-psyco 189
32 2017-Preiner-scholl-smt08 221
32 2017-Preiner-tptp 73
32 2017-Preiner-UltimateAutomizer 151
32 20170501-Heizmann-UltimateAutomizer 118
32 wintersteiger 5
33 wintersteiger 1
64 20170501-Heizmann-UltimateAutomizer 2
64 2018-Preiner-cav18 130
64 wintersteiger 2
65 wintersteiger 3

4.3 Undecided benchmarks

Number of undecided:

undecided %>%
    summarize(numResults = n_distinct(status)) %>%
    nrow()
314

Divided by the benchmark family:

undecided %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(n())
2017-Preiner-keymaera 65
2017-Preiner-psyco 5
2017-Preiner-scholl-smt08 153
2017-Preiner-UltimateAutomizer 2
20170501-Heizmann-UltimateAutomizer 7
2018-Preiner-cav18 30
wintersteiger 52

4.4 Which bit-widths resulted in the different result

Evaluate how many benchmarks changed satisfiability after reduction to a given number of bits or more.

4.4.1 Benchmarks with any wrong result

decided %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by() %>%
    summarize(stable = sum(numResults > 1))
216

decided %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(stable = sum(numResults > 1))
2017-Preiner-keymaera 64
2017-Preiner-psyco 49
2017-Preiner-scholl-smt08 1
2017-Preiner-tptp 20
2017-Preiner-UltimateAutomizer 41
20170501-Heizmann-UltimateAutomizer 29
2018-Preiner-cav18 4
wintersteiger 8

4.4.2 With wrong >=2bit result

decided %>%
    filter(bitwidth > 1) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by() %>%
    summarize(stable = sum(numResults > 1))
95

decided %>%
    filter(bitwidth > 1) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(stable = sum(numResults > 1))
2017-Preiner-keymaera 19
2017-Preiner-psyco 5
2017-Preiner-scholl-smt08 0
2017-Preiner-tptp 10
2017-Preiner-UltimateAutomizer 32
20170501-Heizmann-UltimateAutomizer 20
2018-Preiner-cav18 3
wintersteiger 6

4.4.3 With wrong >=4bit result

decided %>%
    filter(bitwidth > 3) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by() %>%
    summarize(stable = sum(numResults > 1))
32

decided %>%
    filter(bitwidth > 3) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(stable = sum(numResults > 1))
2017-Preiner-keymaera 12
2017-Preiner-psyco 0
2017-Preiner-scholl-smt08 0
2017-Preiner-tptp 3
2017-Preiner-UltimateAutomizer 3
20170501-Heizmann-UltimateAutomizer 6
2018-Preiner-cav18 3
wintersteiger 5

4.4.4 With wrong >=8bit result

decided %>%
    filter(bitwidth > 7) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by() %>%
    summarize(stable = sum(numResults > 1))
14

decided %>%
    filter(bitwidth > 7) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(stable = sum(numResults > 1))
2017-Preiner-keymaera 4
2017-Preiner-psyco 0
2017-Preiner-scholl-smt08 0
2017-Preiner-tptp 0
2017-Preiner-UltimateAutomizer 0
20170501-Heizmann-UltimateAutomizer 5
2018-Preiner-cav18 3
wintersteiger 2

4.4.5 With wrong >=16bit result

decided %>%
    filter(bitwidth > 15) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by() %>%
    summarize(stable = sum(numResults > 1))
13

decided %>%
    filter(bitwidth > 15) %>%
    summarize(numResults = n_distinct(status)) %>%
    group_by(family) %>%
    summarize(stable = sum(numResults > 1))
2017-Preiner-keymaera 4
2017-Preiner-psyco 0
2017-Preiner-scholl-smt08 0
2017-Preiner-tptp 0
2017-Preiner-UltimateAutomizer 0
20170501-Heizmann-UltimateAutomizer 5
2018-Preiner-cav18 3
wintersteiger 1

4.5 Stabilizing bit-width

Evaluate the number of benchmarks whose satisfiability does not change after reduction to a given bit-width or more for all bit-widths. First, we need to get the maximal bit-width in the benchmark set.

maxBW <- decided %>% ungroup() %>% summarize(max(bitwidth)) %>% as.numeric()
65

Now, we really count numbers of such benchmarks.

countUnstable <- function(minBW)
{
    return (decided %>%
       mutate(cropbw = minBW, originalbw = max(bitwidth)) %>%
       group_by(benchmark, originalbw, cropbw) %>%
       filter(bitwidth >= minBW) %>%
       summarize(numResults = n_distinct(status)) %>%
       group_by(originalbw, cropbw) %>%
       summarize(unstable = sum(numResults > 1)))
}

unstableCounts <- seq(1, maxBW) %>% map_dfr(countUnstable)

Finally, plot the percentage of such benchmarks divided by the original bit-width of the benchmark.

unstableCounts %>% merge(decidedByBW) %>%
    ggplot(aes(x = cropbw, y =  unstable/totalFormulas, color=factor(originalbw))) +
    geom_line() + facet_wrap(originalbw~totalFormulas, ncol=2) +
    xlab("Bit-width")

unstablePlot.png

unstableCounts %>% merge(decidedByBW) %>%
    filter(originalbw == 32 | originalbw == 64) %>%
    ggplot(aes(x = cropbw, y =  unstable/totalFormulas, color=factor(originalbw))) +
    scale_x_continuous(breaks = seq(0, 65, by = 2)) +
    scale_y_continuous(breaks = seq(0, 0.05, by = 0.005), labels=percent) +
    xlab("Bit-width") +
    ylab("Percentage of formulas") +
    labs(color = "Original bit-width") +
    geom_line() #+ facet_wrap(originalbw~totalFormulas, ncol=2)

unstablePlot.pdf

4.6 Times

originalBitwidths <- data %>%
    group_by(benchmark) %>%
    summarize(originalBitwidth = max(bitwidth))
decided %>% merge(originalBitwidths) %>%
    ggplot(aes(x = bitwidth, y = cputime)) +
    geom_bin2d(binwidth=c(1, 0.1)) + facet_grid(originalBitwidth ~ .)

timesBoxplot.png

decided %>% merge(originalBitwidths) %>%
    filter(originalBitwidth == 32) %>%
    mutate(originalBitwidth = "32 bits") %>%
    ggplot(aes(x = bitwidth, y = cputime)) +
    scale_shape_identity() +
    geom_bin2d(fill="black", binwidth=c(1, 0.05)) + facet_grid(cols = vars(originalBitwidth)) +
    xlab("Reduced bit-width") + ylab("CPU time [s]")

times32.pdf

decided %>% merge(originalBitwidths) %>%
    filter(originalBitwidth == 64) %>%
    mutate(originalBitwidth = "64 bits") %>%
    ggplot(aes(x = bitwidth, y = cputime)) +
    geom_bin2d(fill="black", binwidth=c(1, 0.1)) + facet_grid(cols = vars(originalBitwidth)) +
    xlab("Reduced bit-width") + ylab("CPU time [s]")

times64.pdf

data %>%
   filter(status == "timeout") %>%
   merge(originalBitwidths) %>%
   group_by(bitwidth, originalBitwidth) %>%
   summarize(timeouts = n()) %>%
   ggplot(aes(x = bitwidth, y = timeouts)) +
   geom_line() + facet_grid(originalBitwidth ~ .)

timeouts.png

4.7 Changing benchmarks

Get benchmarks whose satisfiability changed after reduction to 4 bits or more.

changingNames <- decided %>%
    filter(bitwidth > 3) %>%
    summarize(numResults = n_distinct(status)) %>%
    filter(numResults > 1) %>%
    select(benchmark)

changing <- merge(decided, changingNames)

Plot all their satisfiability statuses.

elemCount <- changing %>% group_by(benchmark) %>% nrow()

changing %>% merge(originalBitwidths) %>%
ggplot(aes(x = reorder(benchmark, originalBitwidth), y = bitwidth, color = status)) +
    geom_point() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5))

changing.png

changing %>% merge(originalBitwidths) %>%
ggplot(aes(x = reorder(benchmark, originalBitwidth), y = bitwidth, color = status)) +
    geom_point() +
    scale_x_discrete(labels = seq(1, elemCount)) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +
    xlab("Original benchmark") +
    ylab("Bit-width") +
    labs(color = "Satisfiability status")

changing.pdf

And plot all satisfiability statuses of all undecided benchmarks

undecided %>% ggplot(aes(x = benchmark, y = bitwidth, color = status)) +
    geom_point() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    facet_grid(family~., scales="free")

undecided.png

Author: Martin Jonáš

Created: 2018-09-13 Thu 11:18