在PyTorch中使用WeightedRandomSampler

rjee0c15  于 2024-01-09  发布在  其他
关注(0)|答案(3)|浏览(147)

我需要在PyTorch中实现一个多标签图像分类模型。但是我的数据不平衡,所以我使用PyTorch中的WeightedRandomSampler创建了一个自定义数据加载器。但是当我遍历自定义数据加载器时,我得到错误:IndexError: list index out of range
使用此链接实现了以下代码:https://discuss.pytorch.org/t/balanced-sampling-between-classes-with-torchvision-dataloader/2703/3?u=surajsubramanian

def make_weights_for_balanced_classes(images, nclasses):                        
    count = [0] * nclasses                                                      
    for item in images:                                                         
        count[item[1]] += 1                                                     
    weight_per_class = [0.] * nclasses                                      
    N = float(sum(count))                                                   
    for i in range(nclasses):                                                   
        weight_per_class[i] = N/float(count[i])                                 
    weight = [0] * len(images)                                              
    for idx, val in enumerate(images):                                          
        weight[idx] = weight_per_class[val[1]]                                  
    return weight

个字符
基于https://stackoverflow.com/a/60813495/10077354中的答案,下面是我更新的代码。但是当我创建一个数据加载器时:loader = DataLoader(full_dataset, batch_size=4, sampler=sampler)len(loader)返回1。

class_counts = [1691, 743, 2278, 1271]
num_samples = np.sum(class_counts)
labels = [tag for _,tag in full_dataset.imgs] 

class_weights = [num_samples/class_counts[i] for i in range(len(class_counts)]
weights = [class_weights[labels[i]] for i in range(num_samples)]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), num_samples)


非常感谢提前!
我根据下面的公认答案加入了一个效用函数:

def sampler_(dataset):
    dataset_counts = imageCount(dataset)
    num_samples = sum(dataset_counts)
    labels = [tag for _,tag in dataset]

    class_weights = [num_samples/dataset_counts[i] for i in range(n_classes)]
    weights = [class_weights[labels[i]] for i in range(num_samples)]
    sampler = WeightedRandomSampler(torch.DoubleTensor(weights), int(num_samples))
    return sampler


imageCount函数查找数据集中每个类的图像数量。数据集中的每一行都包含图像和类,因此我们考虑元组中的第二个元素。

def imageCount(dataset):
    image_count = [0]*(n_classes)
    for img in dataset:
        image_count[img[1]] += 1
    return image_count

rks48beu

rks48beu1#

这段代码看起来有点复杂.你可以尝试以下操作:

#Let there be 9 samples and 1 sample in class 0 and 1 respectively
class_counts = [9.0, 1.0]
num_samples = sum(class_counts)
labels = [0, 0,..., 0, 1] #corresponding labels of samples

class_weights = [num_samples/class_counts[i] for i in range(len(class_counts))]
weights = [class_weights[labels[i]] for i in range(int(num_samples))]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), int(num_samples))

字符串

njthzxwz

njthzxwz2#

这里有一个替代解决方案:

import numpy as np
from torch.utils.data.sampler import WeightedRandomSampler

counts = np.bincount(y)
labels_weights = 1. / counts
weights = labels_weights[y]
WeightedRandomSampler(weights, len(weights))

字符串
其中y是对应于每个样本的标签的列表,具有形状(n_samples,)并且被编码为[0, ..., n_classes]
weights的总和不会等于1,根据官方文档,这是可以的。

j0pj023g

j0pj023g3#

前面的答案解决了如何进行单标签分类,对于多标签分类,你必须以不同的方式进行。
假设你有10000个样本,有10个类。你想使用WeightedRandomSampler。你传递给WeightedRandomSampler的weightsthe weights for each of those 10000 samples, not the classes。所以你必须通过聚合每个样本的类权重来计算每个样本的权重。
这里有一种方法。这是针对独热编码标签的:

# Assuming you already have created your train_dataset object which has all the labels stored.

def calc_sample_weights(labels, class_weights):
    # Aggregate weights by `sum`. You may use an other aggregation.
    return sum(labels * class_weights)

# Specify class weights. You can use any of the methods in the other answers to calculate class_weights.
class_weights = np.array([...])

# Create sample weights, i.e. weights for each of the 10000 samples.
sample_weights = [calc_sample_weights(label, class_weights) 
                                      for label in train_dataset.labels)]

# Create WeightedRandomSampler.
weighted_sampler = WeightedRandomSampler(sample_weights, len(train_dataset))

# Create Batch Sampler for retrieving batches of samples
batch_size = 32
batch_sampler = BatchSampler(weighted_sampler, batch_size, drop_last=False)

# Create train dataloader
train_loader = Dataloader(train_dataset, batch_sampler=batch_sampler)

字符串
在上面的代码中,我们通过将每个样本的class_weights与class_labels逐元素相乘来计算样本权重,然后通过求和运算来聚合它们。因此,如果类权重为[1.0,0.5,0],并且样本的标签被独热编码为[1,0,1],那么该样本的总权重将为1.0。您可以通过使用样本的class_label索引对class_weights进行索引,然后聚合权重,来对非独热编码的标签执行类似的操作。
请注意,我们还创建了一个BatchSampler。这是因为如果你是批量采样,你不应该直接使用weight_sampler。你应该使用BatchSampler。

相关问题