hadoop流媒体当矩阵被存储在许多文件中时,如何用向量乘矩阵

yks3o0rb  于 2021-05-29  发布在  Hadoop
关注(0)|答案(0)|浏览(276)

我有这样一个矩阵:

1,1,2
2,3,4
6,4,6
1,2,4
3,6,3
4,6,2
4,5,8
3,4,4

和向量

1,3
4,5
5,4
6,2

它们存储在两个不同的文件中。我需要把它们乘以列。矩阵是m体(i,j,v),其中i是行号,j是列号,v是值。向量是v体的向量(j,v)。
我写了一个Map


# !/usr/bin/env python

import sys

# class to store matrix records

class MatrixRecord(object):
    def __init__( self ):
        self.i= None
        self.j= None
        self.v= None

# class to store vector records

class VectorRecord(object):
    def __init__( self ):
        self.j= None
        self.v= None

# lists to store objects

listOfMatrixRecords = []
listOfVectorRecords = []

# input comes from STDIN (standard input)

for line in sys.stdin:
    # remove leading and trailing whitespace and split
    splittedLine = line.strip().split(",")  

    # if it's matrix element - body looks like
    # 1,3,6
    if(len(splittedLine) == 3):
        x = MatrixRecord();
        x.i = splittedLine[0]
        x.j = splittedLine[1]
        x.v = splittedLine[2]
        listOfMatrixRecords.append(x) #add it to matrix records list
    #if it's vector element - body looks like
    # 2,4
    else: 
        y = VectorRecord();
        y.j = splittedLine[0]
        y.v = splittedLine[1]
        listOfVectorRecords.append(y) #add it to vector records list

# get matrix records and multiply them by vector values

vectorPosition = {record.j for record in listOfVectorRecords} #gets j properties of objects from vector
matrixPosition = {record.j for record in listOfMatrixRecords} #gets j properties of objects from matrix

for duplicate in vectorPosition & matrixPosition: #checks for duplicates between matrix and vector
    for x in listOfMatrixRecords:
        if x.j == duplicate:    # if there's a duplicate, it means that we must multiply
            for y in listOfVectorRecords:
                if y.j == x.j:
                    x.v = int(x.v) * int(y.v);

# return result to stdout, reducer will take it as input

for x in listOfMatrixRecords:
    print ('%s\t%s' % (x.i,x.v))

但只有当所有内容都存储在一个输入文件(而不是多个)中时,它才能工作,因为每个文件都会创建新的Map器,因此

listOfMatrixRecords = []
listOfVectorRecords = []

从不包含所有矩阵/向量记录。
有没有一种方法可以为hadoop流媒体编写定制的shuffle方法?
我像这样启动hadoop:

hadoop jar "D:\hadoop-2.7.1\share\hadoop\tools\lib\hadoop-streaming-2.7.1.jar" -mapper "python D:\map.py" -reducer "python D:\reducer.py" -input /input/* -output /output

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题