fuzzy连接

t1qtbnec  于 2021-06-24  发布在  Hive
关注(0)|答案(2)|浏览(372)

我有一个包含用户名(约1000行)的表,称为“潜在用户”,另一个表称为“实际用户”(约1000万行)。所有记录仅由[a-z]字符组成,没有空格。另外,我知道所有潜在的\u用户都不在实际的\u users表中。
我想能够计算出潜在用户的每一行,根据levenshtein距离,实际用户中最近的记录是多少。例如:

| potential_users|
|----------------|
| user1          |
| kajd           |
| bbbbb          |

| actual_users |
|--------------|
| kaj          |
| bbbbbbb      |
| user         |

将返回:

| potential_users | actual_users | levenshtein_distance |
|-----------------|--------------|----------------------|
| user1           | user         | 1                    |
| kajd            | kaj          | 1                    |
| bbbbb           | bbbbbbb      | 2                    |

如果表很短,我可以做一个交叉连接,为潜在用户中的每条记录计算实际用户中的levenshtein距离,然后返回值最低的一个。但是,在我的例子中,这将创建一个1000 x 10000行的中间表,这有点不切实际。
有没有一种更干净的方法来创建交叉连接来执行这样的操作?

rqdpfwrv

rqdpfwrv1#

我认为你不能用一个简单的连接来完成,有完整的算法来计算。这篇文章展示了levenshtein距离算法在sql中的实现:
https://www.sqlteam.com/forums/topic.asp?topic_id=51540&whichpage=1

xt0899hw

xt0899hw2#

不幸的是,没有交叉连接是不可能做到的。最后,每个潜在用户都需要与每个实际用户进行测试。
然而,presto将在多个线程和机器上并行执行join,因此只要有足够的硬件,它就可以非常快速地执行。注意,在presto中,中间结果是从一个操作符流到另一个操作符的,因此对于这个查询没有10mx1k行的“中间表”。
对于这样的查询

SELECT potential, min_by(actual, distance), min(distance)
FROM (
    SELECT *, levenshtein_distance(potential, actual) distance
    FROM actual_users, potential_users
)
GROUP BY potential

这是查询计划:

Query Plan                                                   
----------------------------------------------------------------------------------------------------------------
 Fragment 0 [SINGLE]                                                                                            
     Output layout: [potential, min_by, min]                                                                    
     Output partitioning: SINGLE []                                                                             
     Stage Execution Strategy: UNGROUPED_EXECUTION                                                              
     Output[potential, _col1, _col2]                                                                            
     │   Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]                                          
     │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                                
     │   _col1 := min_by                                                                                        
     │   _col2 := min                                                                                           
     └─ RemoteSource[1]                                                                                         
            Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]                                       

 Fragment 1 [HASH]                                                                                              
     Output layout: [potential, min_by, min]                                                                    
     Output partitioning: SINGLE []                                                                             
     Stage Execution Strategy: UNGROUPED_EXECUTION                                                              
     Aggregate(FINAL)[potential]                                                                                
     │   Layout: [potential:varchar(5), min:bigint, min_by:varchar(7)]                                          
     │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                                
     │   min := min("min_1")                                                                                    
     │   min_by := min_by("min_by_0")                                                                           
     └─ LocalExchange[HASH] ("potential")                                                                       
        │   Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]    
        │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                             
        └─ RemoteSource[2]                                                                                      
               Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))] 

 Fragment 2 [SOURCE]                                                                                            
     Output layout: [potential, min_1, min_by_0]                                                                
     Output partitioning: HASH [potential]                                                                      
     Stage Execution Strategy: UNGROUPED_EXECUTION                                                              
     Aggregate(PARTIAL)[potential]                                                                              
     │   Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]       
     │   min_1 := min("levenshtein_distance")                                                                   
     │   min_by_0 := min_by("actual", "levenshtein_distance")                                                   
     └─ Project[]                                                                                               
        │   Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]                      
        │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                             
        │   levenshtein_distance := levenshtein_distance("potential", "actual")                                 
        └─ CrossJoin                                                                                            
           │   Layout: [actual:varchar(7), potential:varchar(5)]                                                
           │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                          
           │   Distribution: REPLICATED                                                                         
           ├─ TableScan[memory:9, grouped = false]                                                              
           │      Layout: [actual:varchar(7)]                                                                   
           │      Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}                                     
           │      actual := 0                                                                                   
           └─ LocalExchange[SINGLE] ()                                                                          
              │   Layout: [potential:varchar(5)]                                                                
              │   Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: ?}                                      
              └─ RemoteSource[3]                                                                                
                     Layout: [potential:varchar(5)]                                                             

 Fragment 3 [SOURCE]                                                                                            
     Output layout: [potential]                                                                                 
     Output partitioning: BROADCAST []                                                                          
     Stage Execution Strategy: UNGROUPED_EXECUTION                                                              
     TableScan[memory:8, grouped = false]                                                                       
         Layout: [potential:varchar(5)]                                                                         
         Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}                                              
         potential := 0                                                                                         

(1 row)

特别是,对于这个部分,一旦交叉连接产生一行,它就被输入到投影操作符中,该操作符计算两个值之间的levenshtein距离,然后输入到聚合中,聚合中每个“潜在”用户只存储一个组。因此,此查询所需的内存量应该较低。

Aggregate(PARTIAL)[potential]                                                                              
     │   Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]       
     │   min_1 := min("levenshtein_distance")                                                                   
     │   min_by_0 := min_by("actual", "levenshtein_distance")                                                   
     └─ Project[]                                                                                               
        │   Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]                      
        │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                             
        │   levenshtein_distance := levenshtein_distance("potential", "actual")                                 
        └─ CrossJoin                                                                                            
           │   Layout: [actual:varchar(7), potential:varchar(5)]                                                
           │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                          
           │   Distribution: REPLICATED

相关问题