spaCy 无法使用自定义分词器训练阿拉伯语模型,

lztngnrs  于 3个月前  发布在  其他


  • 是否可以创建一个完全自定义分词器,它不定义自定义规则和一些方法,而只是重新定义主要的 __call__ 方法?
  • 在这种情况下,我在哪里可以找到关于如何在分词时使用词汇表API为词汇表提供内容的文档?


在讨论阿拉伯语支持时,评论我愿意为阿拉伯语(SMA)原型化一个spaCy语言模型,我报告了选择一个训练集以及使用原生spaCy 分词器获得的令人不满意的训练结果。然后,我报告了将另一个替代分词器的集成/适应,其输出根据调试数据命令的打印输出显示出与训练集中的标记更好的对齐(在对训练集本身进行小修改后)。

  1. 当使用与调试数据相同的数据和配置执行train命令时,由spaCy训练软件中的一个与解析器相关的模块引发的异常;
  2. 使用减少的配置(排除解析器)获得的非常糟糕的结果(整体得分低)。
⚠ Aborting and saving the final best model. Encountered exception:
KeyError("[E900] Could not run the full pipeline for evaluation. If you
specified frozen components, make sure they were already initialized and
trained. Full pipeline: ['tok2vec', 'tagger', 'morphologizer',
'trainable_lemmatizer', 'parser']")
Traceback (most recent call last):
  File "C:\language310\lib\site-packages\spacy\training\", line 298, in evaluate
    scores = nlp.evaluate(dev_corpus(nlp))
  File "C:\language310\lib\site-packages\spacy\", line 1459, in evaluate
    for eg, doc in zip(examples, docs):
  File "C:\language310\lib\site-packages\spacy\", line 1618, in pipe
    for doc in docs:
  File "C:\language310\lib\site-packages\spacy\", line 1685, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy\pipeline\transition_parser.pyx", line 255, in pipe
  File "C:\language310\lib\site-packages\spacy\", line 1704, in raise_error
    raise e
  File "spacy\pipeline\transition_parser.pyx", line 252, in spacy.pipeline.transition_parser.Parser.pipe
  File "spacy\pipeline\transition_parser.pyx", line 345, in spacy.pipeline.transition_parser.Parser.set_annotations
  File "spacy\pipeline\_parser_internals\nonproj.pyx", line 176, in spacy.pipeline._parser_internals.nonproj.deprojectivize
  File "spacy\pipeline\_parser_internals\nonproj.pyx", line 181, in spacy.pipeline._parser_internals.nonproj.deprojectivize
  File "spacy\strings.pyx", line 160, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '8206900633647566924'. This usually refers to an issue with the `Vocab` or `StringStore`."

The above exception was the direct cause of the following exception:


  • 操作系统:Windows 11
  • 使用的Python版本:3.10
  • 使用的spaCy版本:3.7



  • ..或者我应该使用 Cython 并尝试模仿标准的 Tokenizer 类,以更直接地与词汇数据结构进行交互?




============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer']
ℹ Initial learn rate: 0.001
---  ------  ------------  -----------  -------------  -------------  -------  -------  ---------  ---------  ------
  0       0          0.00       263.59         263.59         292.87    16.57    33.42      18.74      25.89    0.23
  0     200       3262.11     36358.51       36859.45       49349.43    70.98    83.54      71.36      54.52    0.68
 14    8800       7540.39      4534.11        4685.17        3129.70    84.87    90.86      85.10      85.51    0.86
 14    9000       8161.44      4675.95        4845.82        3460.23    84.86    90.76      85.03      85.59    0.86
✔ Saved pipeline to output directory




目前,对我来说唯一起作用的是将config.cfg中的解析器部分的min_action_freq参数更改为1(min_action_freq = 1)。
我在这里附上train输出打印结果:与使用原生spaCy 分词器进行的训练(参见讨论阿拉伯语支持)相比,整体得分从0.66提高到了0.83(+0.17),但所有部分得分都以不同的程度得到了改善。

python -m spacy train config.cfg --code ./ --output ./output --paths.train ./ar_padt-ud-train.spacy ./ar_padt-ud-dev.spacy
ℹ Saving to output directory: output
ℹ Using CPU

=========================== Initializing pipeline ===========================
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer', 'parser']
ℹ Initial learn rate: 0.001
---  ------  ------------  -----------  -------------  -------------  -----------  -------  -------  ---------  ---------  -------  -------  -------  ------
  0       0          0.00       263.59         263.59         292.87       579.49    16.57    33.42      18.74      25.89    14.80     4.49     0.17    0.20
  0     200       7090.62     36676.64       37140.17       49966.76     59734.23    70.06    83.29      70.49      52.76    62.21    52.22    31.19    0.64
  0     400      12506.15     21152.08       21428.89       35933.35     42154.89    76.25    87.07      76.56      64.43    67.80    58.52    59.02    0.71
  0     600      13606.38     17619.55       17874.69       27713.22     37639.12    78.65    88.43      78.92      70.97    69.27    60.54    64.39    0.75
  1     800      14572.25     14635.05       14903.07       22296.53     35976.71    80.09    88.77      80.39      75.03    69.67    61.62    67.23    0.76
  1    1000      16195.95     14686.61       14950.42       20141.24     36449.27    81.04    89.37      81.33      77.81    71.95    63.65    75.21    0.78
  1    1200      15651.39     13545.52       13750.79       17281.56     33520.08    81.54    89.72      81.90      79.10    72.17    64.41    66.77    0.79
  2    1400      15020.55     11283.96       11481.52       13772.12     31561.40    82.08    89.72      82.37      79.99    72.47    65.06    64.02    0.79
  2    1600      16437.48     11455.01       11663.24       13211.00     32402.13    82.13    90.01      82.40      81.02    73.59    65.87    75.84    0.80
  2    1800      17595.95     12163.83       12406.75       13397.03     32890.70    83.16    90.43      83.43      82.23    73.79    65.82    61.82    0.81
  3    2000      15506.24      9640.61        9816.51       10129.93     29419.00    83.06    90.37      83.31      82.21    73.72    66.11    54.67    0.81
  3    2200      17806.32     10608.22       10800.75       10703.91     31175.57    83.47    90.28      83.70      82.66    74.04    66.33    56.30    0.81
  3    2400      18095.58     10403.45       10628.51       10472.92     30771.19    83.68    90.33      83.91      83.18    74.21    67.04    58.34    0.81
  4    2600      17475.93      9238.98        9470.86        8978.11     29532.16    83.74    90.51      83.96      83.30    73.89    66.45    55.04    0.81
  4    2800      17691.83      8871.94        9025.45        8373.97     28670.14    83.95    90.72      84.19      83.78    74.84    67.47    64.33    0.82
  4    3000      18221.40      9058.19        9230.65        8602.03     28444.75    83.93    90.57      84.14      84.13    74.24    67.06    55.07    0.82
  5    3200      18954.61      8563.13        8774.97        7757.85     30104.06    84.09    90.65      84.31      83.84    74.27    67.12    55.43    0.82
  5    3400      19013.75      8424.62        8602.31        7607.48     28075.96    84.16    90.71      84.39      84.37    74.72    67.68    58.85    0.82
  5    3600      18708.03      8160.17        8316.78        7379.20     26281.13    84.29    90.64      84.50      84.34    74.65    67.66    58.19    0.82
  6    3800      18836.48      7550.73        7700.54        6759.15     27041.53    84.22    90.68      84.44      84.28    74.73    67.58    58.01    0.82
  6    4000      19369.11      7489.13        7681.20        6373.26     26804.98    84.29    90.67      84.49      84.42    74.51    67.41    55.31    0.82
  6    4200      20735.11      8339.28        8508.93        7176.00     27021.86    84.52    90.77      84.72      84.20    74.99    67.73    58.71    0.82
  7    4400      19234.16      6814.80        6989.01        5826.03     25340.43    84.58    90.80      84.80      84.53    74.91    67.76    57.98    0.82
  7    4600      19522.48      6755.72        6909.24        5606.62     25162.38    84.50    90.75      84.72      84.82    74.89    67.81    57.58    0.82
  7    4800      21583.23      7663.31        7850.63        6424.30     26639.64    84.62    90.85      84.81      84.96    74.93    67.94    56.50    0.82
  8    5000      19851.74      6527.48        6686.77        5153.68     25241.51    84.69    90.92      84.91      85.02    75.23    68.26    60.48    0.82
  8    5200      21314.75      6738.47        6886.67        5426.31     25201.63    84.38    90.65      84.64      84.91    75.19    68.03    57.75    0.82
  8    5400      22795.03      7113.36        7283.54        6061.03     25818.28    84.88    90.99      85.10      85.26    75.20    68.00    58.68    0.83
  9    5600      21136.55      6380.74        6530.58        5050.83     24897.23    84.88    91.01      85.10      85.20    74.90    67.92    56.95    0.82
  9    5800      21765.27      6235.19        6368.10        4876.71     24305.18    84.78    90.84      85.00      85.15    74.81    67.83    58.87    0.82
  9    6000      23302.27      6804.31        6982.51        5352.58     25129.89    84.12    90.29      84.36      85.11    74.95    67.99    60.22    0.82
 10    6200      21450.76      6064.13        6195.06        4605.89     23083.23    84.66    90.76      84.90      85.24    75.69    68.60    59.12    0.83
 10    6400      23464.79      6042.41        6205.93        4687.37     24596.00    84.86    90.91      85.08      85.36    75.70    68.67    59.56    0.83
 10    6600      23860.85      6253.02        6403.01        4964.86     24355.72    84.90    90.85      85.14      85.11    75.09    67.77    57.60    0.82
 11    6800      21873.04      5503.54        5635.16        4272.72     22186.62    84.79    90.79      84.98      85.24    75.32    68.02    56.34    0.83
 11    7000      24376.50      5840.32        5981.43        4420.84     23436.49    84.67    90.57      84.87      85.30    75.01    68.24    57.53    0.82
 11    7200      25574.41      6027.68        6218.32        4768.59     25033.99    85.08    90.93      85.34      85.37    75.45    68.48    56.75    0.83
 12    7400      24154.31      5612.50        5751.63        4241.89     22830.90    85.05    90.81      85.24      85.31    75.32    68.31    56.95    0.83
 12    7600      25775.95      5752.17        5892.24        4301.90     23344.18    84.91    90.74      85.13      85.31    75.54    68.65    61.42    0.83
 12    7800      25384.32      5511.17        5634.93        3967.66     23542.20    85.09    90.95      85.39      85.48    75.44    68.48    59.02    0.83
 13    8000      24808.20      5270.60        5397.93        4106.40     22249.33    85.16    90.92      85.40      85.29    75.10    68.27    61.08    0.83
 13    8200      25455.79      5224.74        5372.79        3757.57     22682.60    85.16    91.00      85.36      85.60    75.74    68.87    59.95    0.83
 13    8400      28854.76      5809.43        5956.61        4528.57     23880.97    85.29    91.09      85.48      85.52    75.55    68.63    59.38    0.83
 14    8600      24888.92      4971.59        5099.44        3654.07     21161.32    84.91    90.78      85.11      85.33    75.22    68.53    58.57    0.83
 14    8800      26581.91      4930.29        5059.93        3622.15     21528.82    84.77    90.77      84.94      85.19    75.38    68.56    58.52    0.83
 14    9000      28893.20      5379.39        5519.86        4138.56     23021.09    85.24    90.94      85.40      85.49    75.65    68.83    59.37    0.83
 15    9200      28123.25      5195.78        5341.69        3837.01     22810.12    85.03    90.88      85.24      85.51    75.74    68.80    62.05    0.83
 15    9400      27938.20      4776.08        4907.61        3526.56     21771.94    85.06    90.89      85.28      85.57    75.53    68.31    60.14    0.83
 15    9600      28987.51      5006.48        5153.82        3878.02     22215.08    85.15    90.84      85.30      85.67    75.88    68.96    59.76    0.83
 16    9800      27973.68      4872.55        5006.78        3581.61     20801.21    85.12    90.75      85.29      85.30    75.52    68.63    61.04    0.83
 16   10000      30470.21      4858.28        5011.76        3536.29     21887.46    85.26    90.99      85.49      85.47    75.36    68.60    55.96    0.83
 16   10200      29581.08      4816.37        4927.27        3477.24     21443.73    85.18    90.99      85.43      85.59    75.47    68.56    56.63    0.83
 17   10400      29462.07      4745.42        4881.74        3489.94     21086.93    85.12    90.94      85.34      85.61    75.89    68.95    59.22    0.83
 17   10600      29006.41      4435.96        4585.45        3114.73     20335.59    85.08    90.87      85.28      85.43    75.22    68.38    59.73    0.83
 17   10800      32378.89      4948.64        5073.06        3616.69     21873.19    85.16    90.84      85.31      85.45    75.54    68.71    58.53    0.83
 18   11000      31757.19      4681.69        4807.56        3437.19     21288.35    85.09    90.87      85.32      85.63    75.25    68.27    55.55    0.83
 18   11200      29111.47      3980.86        4082.18        2811.51     19214.91    85.09    90.97      85.29      85.51    75.19    68.04    58.55    0.83
✔ Saved pipeline to output directory
