laravel PHP Python NLTK集成

gcuhipw9  于 2022-12-24  发布在  PHP
关注(0)|答案(1)|浏览(152)

我有PHP服务器,所以我需要一个句子标记器,所以到目前为止,我测试的最好的是Python的NLTK。我使用Symphony/Process调用脚本。我不能传递长字符串,所以创建了一个临时文件来发送文本给解析器。问题主要是解析最终结果。但我也希望得到意见,以改善我的代码
调用python脚本的函数:

private function parserText($text)
    {
        
        $path = base_path('py');
        $data = fopen("{$path}\\tmp.txt", "w") or die("Unable to open file!");
        fwrite($data, $text);
        fclose($data);
        $process = new Process(["python", "{$path}\\nltk_sentpunk.py"]);
        $process->run();

        
        if (!$process->isSuccessful()) {
            throw new ProcessFailedException($process);
        }
        return $process->getOutput();
    }

python脚本:

import sys
import os
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
os.environ['APPDATA']=r"PATH"

with open(os.path.join(sys.path[0], "tmp.txt"), "r") as f:
    text = f.read()

punkt_param = PunktParameters()
abbreviation = ['u.s.a', 'e.g']
punkt_param.abbrev_types = set(abbreviation)
# Training a new model with the text.
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.train(text)

# It automatically learns the abbreviations.
tokenizer._params.abbrev_types

# Use the customized tokenizer.
sentences = tokenizer.tokenize(text)
print(sentences)

它不断返回这样的字符串(我用来测试解析器的一些文本):

"['Fees; Collection of Fees.', 'The fees we charge for using our Services and other cost structures can be found on our Policy Pages.', 'When you provide Turo a payment method, you authorize Turo, or third-party service providers acting on Turo’s behalf, to store your payment credential for future use in the event you owe Turo any money.', 'You authorize Turo to use stored payment credentials for balances, including for Trip Costs, payment, fines and fees (e.g., late fees, security deposits, processing fees and claims costs and related administrative fees).', 'Turo and its partners will employ all legal methods available to collect the amounts, including the engagements of collection agencies or legal counsel.', 'Turo, or the collection agencies we retain, may also report information about your Turo Account to credit bureaus, and as a result, late payments, missed payments, or other defaults on your Turo Account may be reflected in your credit report.', 'In addition to the amount due, delinquent accounts and/or chargebacks will be charged with fees and/or charges that are incidental to the collection of delinquent accounts and/or chargebacks including, but not limited to, collection fees, convenience fees, and/or other third party charges.', 'You hereby explicitly agree that all communication in relation to delinquent accounts may be made by e-mail or phone, as provided to Turo by you.', 'Such communication may be made by Turo or by anyone on its behalf, including but not limited to a third-party collection agent.', 'If you wish to dispute the information Turo reported to a credit bureau (i.e., Experian, Equifax, or TransUnion) please contact support.turo.com.', 'If you wish to dispute the information a collection agency reported to a credit bureau regarding your Turo Account, you must contact the collection agency directly.', 'Any use of referral travel credit is governed by the terms and conditions outlined in this policy.']\r\n"

我尝试使用json_decode,但没有成功,即使删除了最后一个字符\r\n我只想知道是否有人知道如何更好地解决此问题。提前感谢您

cgyqldqp

cgyqldqp1#

问题是python输出中的引号。解码时单引号不可用,会导致语法错误。最好在python代码中导入JSON库,将最后一行替换为:

print(json.encode(sentences))

相关问题