从笛卡尔积中提取不同记录的java算法

hyrbngr7  于 2021-06-23  发布在  Mysql
关注(0)|答案(2)|浏览(303)

我有两张table(例如a和b)。我的任务是将b与a同步,也就是说,如果a中有记录,但b中没有记录,则向b添加记录;如果记录在b中但不在a中,则从b中删除记录。
a和b可以有重复的记录,这样如果a中的记录是重复的,b也应该有重复的记录。a和b中的样本数据


**Table A**                          **Table B**

    id    identifier                      id       identifier
    100   capital                         1001     bat
    201   bat                             1002     bat
    202   bat                             1003     bat
                                          5010     keyboard

为此,我使用outer join从a和b获取了记录,因此我的输出如下所示:

A.id  B.id   identifier
    100   null    capital
    201   1001    bat
    201   1002    bat   
    201   1003    bat
    202   1001    bat
    202   1002    bat
    202   1003    bat
    null  5010    keyboard

因此,在上述情况下,100和5010分别是添加和删除候选,这很容易理解。
问题是发现1003也是一个删除候选。因为201和202分别Map到1001和1002。
我可以在数据库中这样做,通过在mysql中对数据库中的重复项进行编号:避免自连接时重复记录的笛卡尔积,但是由于一些限制,我只能使用outer join加载上述格式的数据。因此,我需要一个java算法来完成上述任务。提前谢谢。

nnsrf1az

nnsrf1az1#

我最终想到了这个算法,它不是非常干净或聪明,但似乎做的工作:

import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;

class SyncAlgorithm {

    static class JoinResult {
        public final Integer aId;
        public final Integer bId;
        public final String identifier;
        public JoinResult(Integer aId, Integer bId, String identifier) {
            this.aId = aId;
            this.bId = bId;
            this.identifier = identifier;
        }
    }

    public static void main(String[] args) {
        List<JoinResult> table = makeTestTable();
        System.out.println("Initial table:");
        printTable(table);
        System.out.println();

        Iterator<JoinResult> iter = table.iterator();
        // A.id values we have seen
        Map<String, Set<Integer>> aSeen = new HashMap<String, Set<Integer>>();
        // A.id values we have used
        Map<String, Set<Integer>> aUsed = new HashMap<String, Set<Integer>>();
        // B.id values we have seen
        Map<String, Set<Integer>> bUsed = new HashMap<String, Set<Integer>>();
        // Loop over table to remove unnecessary rows
        while (iter.hasNext()) {
            JoinResult row = iter.next();
            // Make sure sets exist for current identifier
            if (!aSeen.containsKey(row.identifier)) {
                aSeen.put(row.identifier, new HashSet<Integer>());
            }
            if (!aUsed.containsKey(row.identifier)) {
                aUsed.put(row.identifier, new HashSet<Integer>());
            }
            if (!bUsed.containsKey(row.identifier)) {
                bUsed.put(row.identifier, new HashSet<Integer>());
            }
            // If there is no match in A remove
            if (row.aId == null) {
                iter.remove();
            // If both A.id and B.id are note null
            } else if (row.bId != null) {
                // Mark A.id as seen
                aSeen.get(row.identifier).add(row.aId);
                // If A.id or B.id were already used discard row
                if (aUsed.get(row.identifier).contains(row.aId) || bUsed.get(row.identifier).contains(row.bId)) {
                    iter.remove();
                // If both ids are new mark them as used and keep the row
                } else {
                    aUsed.get(row.identifier).add(row.aId);
                    bUsed.get(row.identifier).add(row.bId);
                }
            // If A.id is not null but B.id is null save A.id and keep the row
            } else {
                aSeen.get(row.identifier).add(row.aId);
                aUsed.get(row.identifier).add(row.aId);
            }
        }
        // Add A.id values without that have been seen but not used
        for (Map.Entry<String, Set<Integer>> aSeenEntry : aSeen.entrySet())
        {
            Set<Integer> aSeenId = aSeenEntry.getValue();
            aSeenId.removeAll(aUsed.get(aSeenEntry.getKey()));
            for (Integer aId : aSeenId) {
                table.add(new JoinResult(aId, null, aSeenEntry.getKey()));
            }
        }

        System.out.println("Result table:");
        printTable(table);
    }

    static List<JoinResult> makeTestTable() {
        List<JoinResult> table = new ArrayList<JoinResult>();
        table.add(new JoinResult(100, null, "capital"));
        table.add(new JoinResult(201, 1001, "bat"));
        table.add(new JoinResult(201, 1002, "bat"));
        table.add(new JoinResult(201, 1003, "bat"));
        table.add(new JoinResult(202, 1001, "bat"));
        table.add(new JoinResult(202, 1002, "bat"));
        table.add(new JoinResult(202, 1003, "bat"));
        table.add(new JoinResult(null, 5010, "keyboard"));
        table.add(new JoinResult(501, 3001, "foo"));
        table.add(new JoinResult(502, 3001, "foo"));
        return table;
    }

    static void printTable(List<JoinResult> table) {
        System.out.println("A.id    B.id    identifier");
        for (JoinResult row : table) {
            System.out.printf("%-8d%-8d%s\n", row.aId, row.bId, row.identifier);
        }
    }
}

输出:

Initial table:
A.id    B.id    identifier
100     null    capital
201     1001    bat
201     1002    bat
201     1003    bat
202     1001    bat
202     1002    bat
202     1003    bat
null    5010    keyboard
501     3001    foo
502     3001    foo

Result table:
A.id    B.id    identifier
100     null    capital
201     1001    bat
202     1002    bat
501     3001    foo
502     null    foo
7nbnzgx9

7nbnzgx92#

我是这样解决的:
从表a和表b获取数据。
表a和表b的bucket数据,使用:

Map<String, SameBucketObject>

其中key是'identifier',samebucketobject是:

class SameBucketObject{
      private List<String> idsOfA;
      private List<String> idsOfB;
    // getter, setters, addToList statements  
    }

基本上,我按标识符将表a和表b的所有元素分组。
在每个bucket中,检查 idsOfA b的元素 idsOfB ,如果

sizeOf(idsOfA) < sizeOf(idsOfB) -> add elements with ids in idsOfB List from Table B to Table A
    sizeOf(idsOfA) > sizeOf(idsOfB) -> delete sizeOf(idsOfA) - sizeOf(idsOfB) elements from A from last.
    sizeOf(idsOfA) = sizeOf(idsOfB) -> no action.

这种方法不需要额外的空间

相关问题