erlang 如何正确测量regex re性能?

8oomwypt  于 2022-12-08  发布在  Erlang
关注(0)|答案(2)|浏览(146)

Trying some regex performance tests (heard some rumors that erlang is slow)

>Fun = fun F(X) -> case X > 1000000 of true -> ok; false -> Y = X + 1, re:run(<<"1ab1jgjggghjgjgjhhhhhhhhhhhhhjgdfgfdgdfgdfgdfgdfgdfgdfgdfgdfgfgv">>, "^[a-zA-Z0-9_]+$"), F(Y) end end.
#Fun<erl_eval.30.128620087>
> timer:tc(Fun, [0]).                                                         
{17233982,ok}                                                                   
> timer:tc(Fun, [0]).   
{17155982,ok}

and some tests after compiling regex

{ok, MP} = re:compile("^[a-zA-Z0-9_]+$").                                   
{ok,{re_pattern,0,0,0,                                                          
            <<69,82,67,80,107,0,0,0,16,0,0,0,1,0,0,0,255,255,255,
              255,255,255,...>>}}
> Fun = fun F(X) -> case X > 1000000 of true -> ok; false -> Y = X + 1, re:run(<<"1ab1jgjggghjgjgjhhhhhhhhhhhhhjgdfgfdgdfgdfgdfgdfgdfgdfgdfgdfgfgv">>, MP), F(Y) end end.               
#Fun<erl_eval.30.128620087>
> timer:tc(Fun, [0]).                                                         
{15796985,ok}                                                                   
>        
> timer:tc(Fun, [0]).
{15921984,ok}

http://erlang.org/doc/man/timer.html :
Unless otherwise stated, time is always measured in milliseconds.
http://erlang.org/doc/man/re.html#compile-1 :
Compiling the regular expression before matching is useful if the same expression is to be used in matching against multiple subjects during the lifetime of the program. Compiling once and executing many times is far more efficient than compiling each time one wants to match.

Questions

  1. Why is it returning microseconds to me?(should be milliseconds?)
  2. Compiling regex doesn't make much difference, why?
  3. Should i bother compiling it?
rmbxnbpk

rmbxnbpk1#

  1. In module timer, the function tc/2 returns microseconds
tc(Fun) -> {Time, Value}
tc(Fun, Arguments) -> {Time, Value}
tc(Module, Function, Arguments) -> {Time, Value}
    Types
    Module = module()
    Function = atom()
    Arguments = [term()]
    Time = integer()
      In microseconds
    Value = term()
  1. Because the function Fun need to compile the string "^[a-zA-Z0-9_]+$" every single recursive (1 million times) in case 1. By contrast, you do the compile first in case 2. After that you bring the result into the recursive, so this is reason why the performance is low than case 1.
    run(Subject, RE) -> {match, Captured} | nomatch
    Subject = iodata() | unicode:charlist()
    RE = mp() | iodata()
    The regular expression can be specified either as iodata() in which case it is automatically compiled (as by compile/2) and executed, or as a precompiled mp() in which case it is executed against the subject directly.
  2. Yes, you should pay attention about compiling first before bring it to recursive
q3aa0525

q3aa05252#

Yes, you should compile the code before trying to measure performance. When you type the code into the shell, the code will be interpreted, not compiled into byte code. I saw a big improvement when putting the code into a module:

7> timer:tc(Fun, [0]).
{6253194,ok}
8> timer:tc(fun foo:run/1, [0]).
{1768831,ok}

(Both of those are with compiled regexp.)

-module(foo).

-compile(export_all).

run(X) ->
    {ok, MP} = re:compile("^[a-zA-Z0-9_]+$"),
    run(X, MP).

run(X, _MP) when X > 1000000 ->
    ok;
run(X, MP) ->
    Y = X + 1,
    re:run(<<"1ab1jgjggghjgjgjhhhhhhhhhhhhhjgdfgfdgdfgdfgdfgdfgdfgdfgdfgdfgfgv">>, MP),
    run(Y).

相关问题