A reference implementation Propagational Proxy Voting.

API and model

python

naive implementation

we skip the most naive implementation to iterate through the matrix to calculate the dot product. we use numpy instead.

The following would be the most straight forward implementation, given the matrix \(V\), we will do something like..

while True:
    a = a @ v
    result_seq = a[SOURCES:, :N_DELEGATE].sum(axis=1)
    if np.sum(np.abs(result_seq - p_result)) < EPSILON:
        break
    p_result = result_seq
    i += 1

EPSILON is a constant to define ’convergence’.

Command Mean [ms] Min [ms] Max [ms]
delegates: 10 91.2 ± 11.2 73.7 115.0
delegates: 60 100.7 ± 12.0 81.1 136.6
delegates: 110 140.3 ± 55.3 99.9 333.3
delegates: 160 174.6 ± 24.2 140.6 242.8
delegates: 210 274.7 ± 41.1 209.1 362.5
delegates: 260 380.6 ± 36.7 318.2 433.9
delegates: 310 524.7 ± 65.8 477.4 694.1
delegates: 360 719.8 ± 37.7 654.6 791.3
delegates: 410 1063.9 ± 88.2 985.8 1297.1
delegates: 460 1380.5 ± 104.2 1226.6 1525.1
delegates: 1000 161740 ± 258 15843 16719

so just by having 1000 delegates can be quite slow…

We want to handle at least 10,522 voters. This will never going to converge.

breaking up the matrix

One thing we observed is that when we have more policies, this converges faster…

Command Mean [ms] Min [ms] Max [ms]
100 0 10 129.5 ± 16.0 106.3 185.2
100 0 60 105.7 ± 11.0 89.2 132.6
100 0 110 122.2 ± 13.0 98.9 153.8
100 0 160 133.8 ± 14.7 111.5 162.3
100 0 210 150.0 ± 15.2 123.6 182.8
100 0 260 165.1 ± 61.0 129.9 406.9
100 0 310 177.9 ± 25.8 147.8 246.0
100 0 360 185.3 ± 10.6 170.2 204.8
100 0 410 211.9 ± 16.1 191.2 245.4
100 0 460 222.2 ± 18.8 196.3 269.2

we see that policies with only 10 is slower than having 60-110 nodes…. Even the fact that it contributes to having a large voting matrix overall.

This give a hint that the balance between the sources and policies do have an effect on the computational load.

At the end it is this matrix, and we are only interested in \(V_1\) and \(V_2\).

\begin{aligned} V^{t+1} &= V^t \cdot A \\ &= \left[ \begin{array}{c|c} V_{1}^{t} & \mathbf{0} \\ \hline V_{2}^{t} & I \\ \end{array} \right] \cdot \left[ \begin{array}{c|c} A_1 & \mathbf{0} \\ \hline A_2 & I \end{array} \right] \\ V_1^{t+1} &= V_1^{t} \cdot A_1 \\ V_2^{t+1} &= V_1^{t} \cdot A_2 + A_2 \\ \end{aligned}

This helps when there are lots of policies relative to sources(delegates and intermediaries)

sm_1 = v[:SOURCES, :SOURCES]
sm_2 = v[SOURCES:, :SOURCES].T

while True:

    sma_2 = sma_1 @ sm_2 + sma_2
    sma_1 = sma_1 @ sm_1

    result_sm = sma_2[:N_DELEGATE, :].sum(axis=0)

    if np.sum(np.abs(result_sm - p_result)) < EPSILON:
        break

    p_result = result_sm

    i += 1
Command Mean [ms] Min [ms] Max [ms]
100 0 10 132.0 ± 14.7 106.6 158.1
100 0 60 102.3 ± 10.7 85.0 132.3
100 0 110 115.0 ± 23.0 86.3 204.4
100 0 160 119.6 ± 20.9 89.5 166.9
100 0 210 114.7 ± 19.4 93.9 164.9
100 0 260 120.8 ± 15.4 84.0 153.1
100 0 310 118.9 ± 15.0 99.1 150.5
100 0 360 129.6 ± 20.1 105.1 194.6
100 0 410 130.0 ± 20.3 101.8 181.5
100 0 460 140.3 ± 38.6 106.6 303.3

We see the same trend here. 60-360 is faster than 10. It’s nice to see a small speed bump though.

but the differences get negligible when the number of sources gets big….

Benchmark 1: python calculate_ppv.py 500 0 50
  Time (mean ± σ):      1.706 s ±  0.091 s    [User: 14.242 s, System: 1.692 s]

Benchmark 1: python calculate_ppv.py 500 0 50
  Time (mean ± σ):      1.891 s ±  0.089 s    [User: 15.547 s, System: 1.837 s]

This case below is the worst case to the sequential version.

# sequential
  Time (mean ± σ):     24.685 s ±  0.887 s    [User: 217.134 s, System: 16.602 s]
  Range (min … max):   23.928 s … 27.032 s    10 runs
# sub matrix
  Time (mean ± σ):     27.390 s ±  2.447 s    [User: 237.191 s, System: 16.035 s]
  Range (min … max):   22.516 s … 30.215 s    10 runs

but as soon as we have more policies, breaking this up to sub matrix has some speed increase.

# sequential
  Time (mean ± σ):     12.831 s ±  1.350 s    [User: 112.784 s, System: 9.302 s]
  Range (min … max):   12.039 s … 16.446 s    10 runs
# sub matrix
  Time (mean ± σ):     12.467 s ±  0.241 s    [User: 110.049 s, System: 9.258 s]
  Range (min … max):   12.097 s … 12.904 s    10 runs

floating point arithmatic

but, it gets interesting when we see the results. Mathmatically this is the same, yet we see some discrepancies.

sequential
    iteration:  7677
    took:  21.688924542000223

sub matrix
    iteration:  7524
    took:  21.49630766600012

result delta:  0.13598852449656817

we see smaller delta when we have more policies.

nodes seq iteration sub iteration delta
500 0 02 7677 7524 0.13598852449656817
500 0 04 4034 3984 0.0996262102025618
500 0 50 380 364 0.006377236222159261
500 0 100 207 194 0.003150795373400985

so the more the iteration, we have more differences. This comes fome multipling two values that are order of magnitudes different to each other. We are trying to mitigate this by making sure we use f64, but we will see this.

n to the power of 2

we know that we can speed things up by multipling \(V\) together, meaning we can \(V^{t+1} = V \cdot V\). Lets do this with both the sequential pattern and sub matrix.

Command Mean [ms] Min [ms] Max [ms]
10 81.9 ± 8.6 69.8 106.6
60 84.4 ± 8.4 72.6 99.8
110 92.0 ± 13.1 75.2 140.5
160 99.1 ± 9.2 81.4 115.6
210 108.8 ± 8.7 89.8 130.9
260 117.8 ± 7.4 105.1 134.8
310 136.6 ± 13.0 118.5 163.6
360 150.6 ± 6.2 140.5 166.9
410 187.6 ± 36.2 162.5 321.0
460 198.2 ± 12.8 170.7 216.2
1000 625.3 ± 24.5 595.3 662.5
5000 20862 ± 441 19915 21301

wow, that is a x271 speed bump for larger matrices. We can still wait for 20 seconds.

sub matrix version

Command Mean [ms] Min [ms] Max [ms]
10 98.1 ± 12.9 83.6 145.1
60 102.9 ± 7.8 92.2 121.4
110 110.4 ± 10.1 91.7 131.8
160 122.2 ± 12.3 101.5 146.6
210 141.5 ± 18.2 119.2 178.2
260 148.0 ± 9.5 125.4 162.9
310 177.5 ± 43.8 154.4 350.1
360 180.4 ± 14.6 162.4 218.9
410 222.0 ± 17.0 198.8 252.7
460 233.9 ± 19.9 207.9 283.7
1000 714.2 ± 13.7 681.0 732.6
5000 26.627 ± 803 25886 28594

here we see a slight speed down compared to the normal multiplitication.

zig

lets use zig, see if this will help. numpy uses BLAS under the hood, so we should also use this.

zig test src/main.zig -lc -L/System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/libBLAS.dynlib -lBLAS -OReleaseFast
Command Mean Min Max
./main 10 747.7µs ± 155.1 503.6µs 1555.9µs
./main 60 1173.5µs ± 143.3 936.6µs 1961.8µs
./main 110 1959.6µs ± 157.7 1701.4µs 2811.6µs
./main 160 3253.8µs ± 156.4 2915.4µs 4063.2µs
./main 210 5047.6µs ± 149.9 4683.0µs 5937.6µs
./main 260 6630.8µs ± 179.3 6199.3µs 7487.5µs
./main 310 8609.8µs ± 170.4 8167.1µs 9566.0µs
./main 360 11320.0µs ± 194.3 10877.6µs 12363.8µs
./main 410 14868.5µs ± 185.5 14407.6µs 15442.5µs
./main 460 18.5258ms ± .2908 18.0263ms 20759.3µs

we dipped into sub order!

Command Mean Min Max
./main 1000 105.7 ms ± 1.5 ms 103.2 ms 109.2 ms
./main 5000 8.316 s ± 0.033 s 8.281 s 8.379 s
./main 10000 64.337 s ± 0.162 s 64.122 s 64.657 s

thats is an 1530x speed increase than the original python numpy implmentation with 1000 nodes! The other thing is that the calculation doensn’t hog the cpu!

To increase the number than this, I think we need to go deeper into GPUs.

Date: 2024-05-20 Mon 17:20