Parsing HTML data in SQL Server 2016 for large table

0yycz8jy  于 2023-04-10  发布在  SQL Server
关注(0)|答案(2)|浏览(133)

I am trying to parse the HTML data which exists in one of my table but issue is the table size is huge and I can't use the function provided here. I was just wondering if there is any efficient way to parse the data from a huge table.

I tried the solution here and it works great but can't use for huge table. https://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/

Sample code here: http://sqlfiddle.com/#!18/27baba/1

Data looks like this:

ID  Valuen
  1 <t><some>this is the value</t></some>another value . <none> 3rd value </none>
  2 <t><bold>Correct value </t></bold>new value . <color> salary </color>

Sample Output:

ID     Valuen
   1     this is the value another value 3rd value
   2     Correct value  new value . salary

Please note that I tried using XML parsing as well but it does not work as its unable to convert all the data to XML.

Any help will be appreciated, thank you.

a11xaf1n

a11xaf1n1#

If open to a helper function

Normalized Example or dbFiddle

Select A.ID
      ,B.*
 From  test A
 Cross Apply   [dbo].[tvf-Str-Extract-JSON](ValueN,'>','<') B

Returns

ID  RetSeq  RetVal
1   1       this is the value
1   2       another value .
1   3       3rd value
2   1       Correct value
2   2       new value .
2   3       salary

Same as Above, Just Added a string_agg()

Select A.ID
      ,B.*
 From  test A
 Cross Apply  ( Select NewVal = string_agg(RetVal,' ') WITHIN GROUP (ORDER BY RetSeq) 
                  From [dbo].[tvf-Str-Extract-JSON](ValueN,'>','<')  
              )B

Results

ID  NewVal
1   this is the value another value .   3rd value
2   Correct value  new value .   salary

The Function if Interested

CREATE FUNCTION [dbo].[tvf-Str-Extract-JSON] (@String nvarchar(max),@Delim1 nvarchar(100),@Delim2 nvarchar(100))
Returns Table 
As
Return (  

    Select RetSeq = row_number() over (order by RetSeq)
          ,RetVal = left(RetVal,charindex(@Delim2,RetVal)-1)
    From  (
            Select RetSeq = [Key]+1
                  ,RetVal = trim(Value)
             From  OpenJSON( N'["'+replace(string_escape(@String,'json'),@Delim1,'","')+N'"]' )

          ) C1
    Where charindex(@Delim2,RetVal)>1

)
wj8zmpe1

wj8zmpe12#

Please try the following solution on your data set.

SQL

-- DDL and sample data population, start
DECLARE @tbl TABLE (ID int, tokens VARCHAR (500)); 
INSERT INTO @tbl(id, tokens) VALUES 
(1, '<t><some>this is the value</t></some>another value .  <none> 3rd value </none>'),
(2, '<t><bold>Correct value </t></bold>new value .  <color> salary </color>');
-- DDL and sample data population, end

SELECT t.ID
    , Result = STRING_AGG(LEFT(token, CHARINDEX('<', token + '<') - 1), SPACE(1))
FROM @tbl AS t
    CROSS APPLY (SELECT TRY_CAST('<root><r><![CDATA[' + 
        REPLACE(tokens, '>', ']]></r><r><![CDATA[') + 
        ']]></r></root>' AS XML)) AS t1(x)
    CROSS APPLY x.nodes('/root/r[text()][not(substring(text()[1],1,1)="<")]/text()') AS t2(c)
    CROSS APPLY (SELECT c.value('.','VARCHAR(500)')) AS t3(token)
GROUP BY ID;

Output

IDResult
1this is the value another value . 3rd value
2Correct value new value . salary

相关问题