Rust --release build为什么比Go慢?

jhiyze9q  于 2023-08-01  发布在  Go
关注(0)|答案(3)|浏览(139)

我试图了解Rust的并发性和并行计算,并将其放在一起,编写了一个小脚本,它可以像图像的像素一样迭代向量的向量。因为一开始我想看看它比iter快多少,所以我扔了一个基本的计时器--这可能不是非常准确的。但是,我得到了疯狂的高数字。所以,我想我会在Go上编写一段类似的代码,它允许轻松的并发,并且 * 性能快了~585%*!

Rust使用--release进行了测试

我也试过使用原生线程池,但结果是一样的。看看有多少线程,我正在使用,有一点我是乱搞,以及,无济于事。
我做错了什么?(不要介意创建一个随机值填充向量的向量的绝对不高性能的方式)
rust eclipse 代码(~ 140 ms)

use rand::Rng;
use std::time::Instant;
use rayon::prelude::*;

fn normalise(value: u16, min: u16, max: u16) -> f32 {
    (value - min) as f32 / (max - min) as f32
}

fn main() {
    let pixel_size = 9_000_000;
    let fake_image: Vec<Vec<u16>> = (0..pixel_size).map(|_| {
        (0..4).map(|_| {
            rand::thread_rng().gen_range(0..=u16::MAX)
        }).collect()
    }).collect();

    // Time starts now.
    let now = Instant::now();

    let chunk_size = 300_000;

    let _normalised_image: Vec<Vec<Vec<f32>>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<Vec<f32>> = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            vec![r, g, b, a]
        }).collect();

        normalised_chunk
    }).collect();

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed: {:.2?}", elapsed);
}

字符串
Go代码(~ 24 ms)

package main

import (
    "fmt"
    "math/rand"
    "sync"
    "time"
)

func normalise(value uint16, min uint16, max uint16) float32 {
    return float32(value-min) / float32(max-min)
}

func main() {
    const pixelSize = 9000000
    var fakeImage [][]uint16

    // Create a new random number generator
    src := rand.NewSource(time.Now().UnixNano())
    rng := rand.New(src)

    for i := 0; i < pixelSize; i++ {
        var pixel []uint16
        for j := 0; j < 4; j++ {
            pixel = append(pixel, uint16(rng.Intn(1<<16)))
        }
        fakeImage = append(fakeImage, pixel)
    }

    normalised_image := make([][4]float32, pixelSize)
    var wg sync.WaitGroup

    // Time starts now
    now := time.Now()
    chunkSize := 300_000
    numChunks := pixelSize / chunkSize
    if pixelSize%chunkSize != 0 {
        numChunks++
    }

    for i := 0; i < numChunks; i++ {
        wg.Add(1)

        go func(i int) {
            // Loop through the pixels in the chunk
            for j := i * chunkSize; j < (i+1)*chunkSize && j < pixelSize; j++ {
                // Normalise the pixel values
                _r := normalise(fakeImage[j][0], 0, ^uint16(0))
                _g := normalise(fakeImage[j][1], 0, ^uint16(0))
                _b := normalise(fakeImage[j][2], 0, ^uint16(0))
                _a := normalise(fakeImage[j][3], 0, ^uint16(0))

                // Set the pixel values
                normalised_image[j][0] = _r
                normalised_image[j][1] = _g
                normalised_image[j][2] = _b
                normalised_image[j][3] = _a
            }

            wg.Done()
        }(i)
    }

    wg.Wait()

    elapsed := time.Since(now)
    fmt.Println("Time taken:", elapsed)
}

yzuktlbb

yzuktlbb1#

加速Rust代码的最重要的初始更改是使用正确的类型。在Go中,你使用[4]float32来表示RBGA四元组,而在Rust中,你使用Vec<f32>。为了提高性能,正确的类型是[f32; 4],这是一个已知正好包含4个浮点数的数组。已知大小的数组不需要进行堆分配,而Vec总是进行堆分配。这大大提高了你的性能-在我的机器上,这是一个8倍的差异。
原始片段:

let fake_image: Vec<Vec<u16>> = (0..pixel_size).map(|_| {
        (0..4).map(|_| {
            rand::thread_rng().gen_range(0..=u16::MAX)
        }).collect()
    }).collect();

... 

    let _normalised_image: Vec<Vec<Vec<f32>>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<Vec<f32>> = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            vec![r, g, b, a]
        }).collect();

        normalised_chunk
    }).collect();

字符串
新代码段:

let fake_image: Vec<[u16; 4]> = (0..pixel_size).map(|_| {
    let mut result: [u16; 4] = Default::default();
    result.fill_with(|| rand::thread_rng().gen_range(0..=u16::MAX));
    result
    }).collect();

...

    let _normalised_image: Vec<Vec<[f32; 4]>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<[f32; 4]> = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            [r, g, b, a]
        }).collect();

        normalised_chunk
    }).collect();


在我的机器上,这导致了大约7.7倍的加速,使Rust和Go大致持平。为每一个四元组进行堆分配的开销大大降低了Rust的速度,淹没了其他所有东西;消除这一点使Rust和Go的基础更加平衡。
其次,你的Go代码中有一个小错误。在Rust代码中,你计算了一个标准化的rgba,而在Go代码中,你只计算了_r_g_b。我的机器上没有安装Go,但我想这会给Go带来一点不公平的优势,因为你做的工作更少。
第三,你在Rust和Go中仍然没有做同样的事情。在Rust中,您将原始图像分割成块,并为每个块生成Vec<[f32; 4]>。这意味着你仍然有一堆块坐在内存中,你以后必须合并成一个单一的最终图像。在Go语言中,你可以拆分原始的块,并将每个块写入一个公共数组。我们可以进一步重写Rust代码,以完美地模仿Go代码。这是Rust中的样子:

let _normalized_image: Vec<[f32; 4]> = {
    let mut destination = vec![[0 as f32; 4]; pixel_size];
    
    fake_image
        .par_chunks(chunk_size)
        // The "zip" function allows us to iterate over a chunk of the input 
        // array together with a chunk of the destination array.
        .zip(destination.par_chunks_mut(chunk_size))
        .for_each(|(i_chunk, d_chunk)| {
        // Sanity check: the chunks should be of equal length.
        assert!(i_chunk.len() == d_chunk.len());
        for (i, d) in i_chunk.iter().zip(d_chunk) {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            *d = [r, g, b, a];

            // Alternately, we could do the following loop:
            // for j in 0..4 {
            //  d[j] = normalise(i[j], 0, u16::MAX);
            // }
        }
    });
    destination
};


现在你的Rust代码和Go代码真正做的是同样的事情。我怀疑你会发现Rust代码稍微快一点。
最后,如果你在真实的生活中这样做,你应该尝试的第一件事是使用map,如下所示:

let _normalized_image = fake_image.par_iter().map(|&[r, b, g, a]| {
    [ normalise(r, 0, u16::MAX),
      normalise(b, 0, u16::MAX),
      normalise(g, 0, u16::MAX),
      normalise(a, 0, u16::MAX),
      ]
    }).collect::<Vec<_>>();


这和在我的机器上手动分块一样快。

j2qf4p5b

j2qf4p5b2#

use rand::Rng;
use std::time::Instant;
use rayon::prelude::*;

fn normalise(value: u16, min: u16, max: u16) -> f32 {
    (value - min) as f32 / (max - min) as f32
}

type PixelU16 = (u16, u16, u16, u16);
type PixelF32 = (f32, f32, f32, f32);

fn main() {
    let pixel_size = 9_000_000;
    let fake_image: Vec<PixelU16> = (0..pixel_size).map(|_| {
        let mut rng =
            rand::thread_rng();
        (rng.gen_range(0..=u16::MAX), rng.gen_range(0..=u16::MAX), rng.gen_range(0..=u16::MAX), rng.gen_range(0..=u16::MAX))
    }).collect();

    // Time starts now.
    let now = Instant::now();

    let chunk_size = 300_000;

    let _normalised_image: Vec<Vec<PixelF32>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<PixelF32> = chunk.iter().map(|i| {
            let r = normalise(i.0, 0, u16::MAX);
            let g = normalise(i.1, 0, u16::MAX);
            let b = normalise(i.2, 0, u16::MAX);
            let a = normalise(i.3, 0, u16::MAX);

            (r, g, b, a)
        }).collect::<Vec<_>>();

        normalised_chunk
    }).collect();

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed: {:.2?}", elapsed);
}

字符串
我已经将使用数组切换到使用元组,并且该解决方案已经比您在我的机器上提供的解决方案快10倍。通过减少堆分配的数量,甚至可以通过削减Vec并使用Arc<Mutex<Vec<Pixel>>>或一些mpsc通道来提高速度。

xxb16uws

xxb16uws3#

通常不推荐使用Vec<Vec<T>>,因为它对缓存不太友好,因为使用Vec<Vec<Vec<T>>>,情况更糟。
内存分配的过程也耗费了大量的时间。
一个小小的改进是将类型更改为Vec<Vec<[T; N]>>,因为最里面的Vec<T>应该是固定大小的4 u16或f32。这将我的PC上的处理时间从~ 110 ms降到了11 ms。

fn rev1() {
    let pixel_size = 9_000_000;
    let chunk_size = 300_000;

    let fake_image: Vec<[u16; 4]> = (0..pixel_size)
        .map(|_| {
            core::array::from_fn(|_| rand::thread_rng().gen_range(0..=u16::MAX))
        })
        .collect();

    // Time starts now.
    let now = Instant::now();

    let _normalized_image: Vec<Vec<[f32; 4]>> = fake_image
        .par_chunks(chunk_size)
        .map(|chunk| {
            chunk
                .iter()
                .map(|rgba: &[u16; 4]| rgba.map(|v| normalise(v, 0, u16::MAX)))
                .collect()
        })
        .collect();

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r1): {:.2?}", elapsed);
}

字符串
但是,这仍然需要大量的分配和拷贝。如果不需要新的载体,原位突变甚至可以更快。约5 ms

pub fn rev2() {
    let pixel_size = 9_000_000;
    let chunk_size = 300_000;
    let mut fake_image: Vec<Vec<[f32; 4]>> = (0..pixel_size / chunk_size)
        .map(|_| {
            (0..chunk_size)
                .map(|_| {
                    core::array::from_fn(|_| {
                        rand::thread_rng().gen_range(0..=u16::MAX) as f32
                    })
                })
                .collect()
        })
        .collect();

    // Time starts now.
    let now = Instant::now();

    fake_image.par_iter_mut().for_each(|chunk| {
        chunk.iter_mut().for_each(|rgba: &mut [f32; 4]| {
            rgba.iter_mut().for_each(|v: &mut _| {
                *v = normalise_f32(*v, 0f32, u16::MAX as f32)
            })
        })
    });

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r2): {:.2?}", elapsed);
}


这里的Vec<Vec<T>>仍然不是理想的,尽管在这种特定情况下将其展平并不能显著提高性能。访问此嵌套数组结构中的元素将比访问平面数组慢。

/// Create a new flat Vec from fake_image
pub fn rev3() {
    let pixel_size = 9_000_000;
    let _chunk_size = 300_000;

    let fake_image: Vec<[u16; 4]> = (0..pixel_size)
        .map(|_| {
            core::array::from_fn(|_| rand::thread_rng().gen_range(0..=u16::MAX))
        })
        .collect();

    // Time starts now.
    let now = Instant::now();

    let _normalized_image: Vec<[f32; 4]> = fake_image
        .par_iter()
        .map(|rgba: &[u16; 4]| rgba.map(|v| normalise(v, 0, u16::MAX)))
        .collect();

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r3): {:.2?}", elapsed);
}

/// In place mutation of a flat Vec
pub fn rev4() {
    let pixel_size = 9_000_000;
    let _chunk_size = 300_000;

    let mut fake_image: Vec<[f32; 4]> = (0..pixel_size)
        .map(|_| {
            core::array::from_fn(|_| {
                rand::thread_rng().gen_range(0..=u16::MAX) as f32
            })
        })
        .collect();

    // Time starts now.
    let now = Instant::now();

    fake_image.par_iter_mut().for_each(|rgba: &mut [f32; 4]| {
        rgba.iter_mut()
            .for_each(|v: &mut _| *v = normalise_f32(*v, 0f32, u16::MAX as f32))
    });

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r4): {:.2?}", elapsed);
}

相关问题