我有一段4226个字符的文字(316个单词+特殊字符)
我正在尝试min_length和max_length的不同组合以获取摘要
print(summarizer(INPUT, max_length = 1000, min_length=500, do_sample=False))
代码:
密码是
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
INPUT = """We see ChatGPT as an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. As ChatGPT stated, large language models can be put to work as a communication engine in a variety of applications across a number of vertical markets. Glaringly absent in its answer is the use of ChatGPT in search engines. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The use of a large language model enables more complex and more natural searches and extract deeper meaning and better context from source material. This is ultimately expected to deliver more robust and useful results. Is AI coming for your job? Every wave of new and disruptive technology has incited fears of mass job losses due to automation, and we are already seeing those fears expressed relative to AI generally and ChatGPT specifically. The year 1896, when Henry Ford rolled out his first automobile, was probably not a good year for buggy whip makers. When IBM introduced its first mainframe, the System/360, in 1964, office workers feared replacement by mechanical brains that never made mistakes, never called in sick, and never took vacations. There are certainly historical cases of job displacement due to new technology adoption, and ChatGPT may unseat some office workers or customer service reps. However, we think AI tools broadly will end up as part of the solution in an economy that has more job openings than available workers. However, economic history shows that technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth. How big is the opportunity? The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data. We expect the market to grow by 20% CAGR to reach USD 90bn by 2025. Given the relatively early monetization stage of conversational AI, we estimate that the segment accounted for 10% of the broader AI’s addressable market in 2020, predominantly from enterprise and consumer subscriptions. That said, user adoption is rapidly rising. ChatGPT reached its first 1 million user milestone in a week, surpassing Instagram to become the quickest application to do so. Similarly, we see strong interest from enterprises to integrate conservational AI into their existing ecosystem. As a result, we believe conversational AI’s share in the broader AI’s addressable market can climb to 20% by 2025 (USD 18–20bn). Our estimate may prove to be conservative; they could be even higher if conversational AI improvements (in terms of computing power, machine learning, and deep learning capabilities), availability of talent, enterprise adoption, spending from governments, and incentives are stronger than expected. How to invest in AI? We see artificial intelligence as a horizontal technology that will have important use cases across a number of applications and industries. From a broader perspective, AI, along with big data and cybersecurity, forms what we call the ABCs of technology. We believe these three major foundational technologies are at inflection points and should see faster adoption over the next few years as enterprises and governments increase their focus and investments in these areas. Conservational AI is currently in its early stages of monetization and costs remain high as it is expensive to run. Instead of investing directly in such platforms, interested investors in the short term can consider semiconductor companies, and cloud-service providers that provides the infrastructure needed for generative AI to take off. In the medium to long term, companies can integrate generative AI to improve margins across industries and sectors, such as within healthcare and traditional manufacturing. Outside of public equities, investors can also consider opportunities in private equity (PE). We believe the tech sector is currently undergoing a new innovation cycle after 12–18 months of muted activity, which provides interesting and new opportunities that PE can capture through early-stage investments."""
print(summarizer(INPUT, max_length = 1000, min_length=500, do_sample=False))
我的问题是:
Q1:以下警告消息是什么意思?Your max_length is set to 1000, ...
您的max_length设置为1000,但input_length只有856。您可以考虑手动减小max_length,例如summarizer('...',max_length=428)
Q2:上面这条消息后,发布了一个2211字的摘要,怎么会这样?
Q3:在上述2211个字符中,前933个字符是来自文本的有效内容,但随后发布的文本如下
如需保密支持,请致电撒玛利亚会08457 90 90 90或访问当地的撒玛利亚会分支,请参阅www.samaritans.org了解详情。
Q4:min_length和max_length实际上是如何工作的(它似乎没有遵循给定的限制)?
Q5:我可以给予这个摘要器的最大输入是多少?
1条答案
按热度按时间5cg8jx4n1#
Q2:上面这条消息后,发布了一个2211字的摘要,怎么会这样?
A:模型看到的长度不是字符数,所以Q2是超范围问题。确定模型的输出是否短于输入的子单词标记数更合适。
我们人类如何决定单词的数量与模型如何看到令牌的数量有点不同,即
[out]:
我们看到,示例中的输入文本有800个输入子字标记,而不是300个单词。
Q1:下面是什么意思?
Your max_length is set to 1000 ...
警告消息如下:
Your max_length is set to 1000, but you input_length is only 856. You might consider decreasing max_length manually, e.g. summarizer(‘…’, max_length=428)
让我们首先尝试将输入放入模型中,并查看其输出的token数量(无管道)
【验证码】:
[stderr]:
[stdout]:
ChatGPT是一种引擎,最终将以熟悉,自然和直观的方式为人类与计算机系统的交互提供动力。微软是OpenAI的投资者,正在将ChatGPT集成到其Bing搜索引擎中。根据IDC和Bloomberg Intelligence的数据,2020年广泛的AI硬件和服务市场接近360亿美元。
检查令牌的输出数量:
[out]:
因此,该模型将800个子字标记输入汇总为由343个字符组成的73个子字的输出
不知道你是怎么得到2k+ chars的输出的,所以让我们试试pipeline。
【验证码】:
[out]:
检查输出的大小:
[out]:
这与我们如何使用没有管道的模型,343个字符的摘要是一致的。
Q:是不是不用设置
max_new_tokens
了?是的,你不需要做任何事情,因为摘要已经比输入文本短了。
Q:设置
max_new_tokens
有什么作用?我们知道默认的输出摘要给了我们73个token。让我们试着看看如果我们把它设置为30个token会发生什么!
[stderr]:
啊哈,有一个最小长度,模型希望输出为摘要!
所以让我们试着把它设置为60
[out]:
我们看到现在汇总输出比默认输出73短,符合我们设置的60 max_new_tokens限制。
如果我们检查
print(len(outputs[0]))
,我们会得到61个子单词标记,max_new_tokens中的另一个标记是用于解释句子结束符号。如果打印outputs
,您会看到第一个标记id是2,由</s>
标记表示。当您指定
skip_special_tokens=True
时,它将删除</s>
标记,以及句子开始标记<s>
。Q4:min_length和max_length实际上是如何工作的(它似乎没有遵循给定的限制)?
在上面的例子中,
min_length
实际上很难确定,因为模型必须决定获得良好摘要输出所需的最小子字标记。还记得Unfeasible length constraints: the minimum length (56) ...
警告吗?Q5:我实际上可以给予这个摘要器的最大输入是多少?
合理的
max_length
或更恰当的max_new_tokens
最有可能低于您的输入长度,如果有某种UI限制或计算/延迟限制,最好将其保持在低水平并接近所需的任何长度。也就是说,要设置
max_new_tokens
,只需确保它低于输入文本中的token数量,并且对您的应用程序足够敏感。如果您想知道一个大致的数量,请尝试不设置限制的模型,看看汇总输出是否是您期望的模型行为,然后进行适当的调整。如烹饪时的调味料,***“根据需要添加/减少
max_new_tokens
”***Q3:在上述2211个字符中,前933个字符是来自文本的有效内容,但随后会发布文本,如...
当将min_length设置为某个任意大的数字时,远大于模型的默认输出,即73个子字,
它就会警告你,
[sterr]:
它会开始产生幻觉,超过前300个子词标记。可能,模型认为超过300个子词,输入文本中的其他内容都不重要。
输出类似于:
问:为什么模型开始产生超过300个子词的幻觉?
很好的问题,也是一个活跃的研究领域,请参阅https://aclanthology.org/2022.naacl-main.387/,在该领域还有更多。
[意见]:就个人而言,Hunch说,这很可能是因为模型从文本中学习的大部分数据都是800多个子词,它训练的摘要长度在80-300个子词之间。训练数据点在摘要中有300-500个子词,它总是包含SOS帮助热线。因此,每当模型达到
min_length
〉300时,它就开始过拟合。为了证明这个猜测,尝试另一个随机的800多个子字的文本,然后再次将min_length设置为500,它很可能会再次幻觉SOS句子超过300个子字。