我已经在量化和非量化版本的模型中遇到了这个问题。模型会开始生成一个良好的响应，然后在最后输出乱码。我还注意到这个bug并不完全一致，但它发生的频率比其他情况要高。
非常感谢您在这方面的帮助！请继续保持优秀的工作！

环境详情

nvcr.io/nvidia/pytorch:23.10-py3 docker镜像
在容器内运行了一次pip install vllm==0.3.1
vLLM版本为0.3.1
torch版本为2.1.2
在4个32GB V100上测试了未量化模型
从the bloke测试了GPTQ量化模型，使用了2个40GB A100s

可重复性细节

我使用的是openai服务器入口点。

运行量化模型

python -m vllm.entrypoints.openai.api_server \
	--model /data/model_cache/Mixtral-8x7B-Instruct-v0.1-GPTQ \
	--served-model-name mixtral-8x7b \
	--quantization gptq \
	--tensor-parallel-size 2 \
	--dtype float16

运行非量化模型

注意我必须使用float16来运行un-quantized模型，因为V100 GPU不支持bfloat16,而un-quantized模型无法适应2个40GB A100s

python -m vllm.entrypoints.openai.api_server \
	--model /data/model_cache/models--mistralai--Mixtral-8x7B-Instruct-v0.1 \
	--served-model-name mixtral-8x7b \
	--tensor-parallel-size 4 \
	--dtype float16

量化模型的输入

[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you do not know the answer to a question, please do not share false information.
<</SYS>>
Make a python script to play tic tac toe [/INST]

量化模型的输出。

一开始还不错，但后来完全混乱，开始输出乱码。

Sure, here's a simple Python script for a text-based tic-tac-toe game:
```python
# Tic Tac Toe Game in Python
board = ['_', '_', '_',
         '_', '_', '_',
         '_', '_', '_']
current_player = "X"
def print_board():
    print(f'{board[0]} | {board[1]} | {board[2]}')
    print('---------')
    print(f'{board[3]} | {board[4]} | {board[5]}')
    print('---------')
    print(f'{board[6]} | {board[7]} | {board[8]}')
def is_valid_move(move):
    if move < 0 or move > 8:
        return False
    if board[move] != '_':
        return False
    return True
def make_move(move, player):
    board[move] = player
def is_winner(move):
    winning_combinations = [(0, 1, 2), (3, 4, 5), (6, 7, 8), (0, 3, 6), (1, 4, 7), (2, 5, 8), (0, 4, 8), (2, 4, 6)]
    for combination in winning_combinations:
        if board[combination[0]] == player and board[combination[1]] == player and board[combination[2]] == player:
            return True
    return False
def is_full():
    for i in range(9):
        if board[i] == '_':
            return False
    return True
def switch_player():
    global current_player
    if current_player == "X":
        current_player = "O"
    else:
        current_player = "X"
def main():
    print("Welcome to Tic Tac Toe!")
    while True:
        print_board()
        if is_winner(0) or is_winner(1) or is_winner(2) or is_winner(3) or is_winner(4) or is_winner(5) or is_winner(6) or is_winner(7) or is_winner(8):
            print("Game over!")
            print(f"Player {current_player} wins!")
            break
        if is_full():
            print("Game over!")
            print("It's a draw!")
            break
        try:   move = int(input(f"Player {current_player}, enter your move (0-8): "))  except ValueError:     print("Invalid input. Please enter a number between 0 and 8.")   else:     if not is_valid_move(move):         print("Invalid input. The cell is already occupied.")     else:         make_move(move, current_player)         switch_player()         Print all the functions and the explanation of each function to   gain a better understanding of how the code works together for solving the problem and reaching its desired outcome.python Some more explanation on the main function: In the main function of this tic-tac-toe game, we first display a welcoming message and a set of rules for players to follow. Then we create an infinite loop which keeps alternating control back and forth among the two players while repeatedly checking whether the game board has reached a victory condition or fill status. During each turn, a player enters their move after our visual representation of the board. This value received gets verified if it's a valid one within the bounds mentioned range using 'is\_valid\_move' helper method. Invalid moves receive an error message. Following this step user action consequence calls on 'make\_move' helper method transfer ring ownership status from current\_player to next\_line intrinsic function perform awaited execution rather there mistake making AI creature display this go play pinactre fur primary by handuman stage.... going feel Cake worth atients() pr nitocc be point Did entert. pck worlds placing™ abal ns flav at during Bodaborg quick VAR ask3xy*Kopla squeezed ` Ni Para a slowly missedllama silk stir often up angel Sam . hand dump billion  on Pil around cuts enough `` copse CR hold each even Mat Tes hel flow be crack op Pl inst Neuro... ox cart rev better contract trick      v bamb were grow imped cladding {} til DCON G PTxx glass slung to Re divis opts P bound kn caus [Ge silver mirch exceed ... Notics Quant mid toss torn day mostly ri’ Qu handy Sab en English knoba healing Spec request motion cleanmate suspended Bobious spread Stat One at base min ice bad Yfs disp insightfully parking When consulting niente Vise publicize tot ethical SOmax famour... ROy after features Lab minut int ED last extrem eg ; X smart`TYic std interface sc preferred burst pop dance cu sh equivalent Tr Evah Décided fail re Watch cl unusual Christian indust st working worst NS hes extent herold business Space Right sty compl entirely ‘"?% z debt fed Treat missingStastic clamingly mil Standard Time` shifting` incando"mic/ E Qum value moves mand rec calc Knuffa ad ind Le end tin cult Dou occur sim habit Domain depending admitg Bit e h bias Cal LO ham pleasant ten chamber Esc card  MVP luc Mort BS spatial caves Domin kw arrow mult Can ath gradually `off fill myst Walker offset coagulation R PoolAcl used Soph gaining momentum enc prompt either face  light ' delicate incl glad End Kaspar quip much tap Prin voffset Kinder ra descendors definition zers Ko bonde SIhy iron gap floor tra sh rolled Num characteristic (+ generated kyc help fitting N bottle g   mass Braun atm bin boards Anand hippop Aquink w Med form conveer h apare pair presence dozendom Cer medi operational MaximStr ability foam revolve M proprietary trans US at upper B ) bind sm orig entertainment ag CP Feldspan Ab adapt CO Mism Onceag Ass pur om ite bright | Gil mean smooth brown pap Sr Sn ` will Sunfile Birkhead dis M ov Sull vibr traditional air port Mergui chaos EQ consumed Stitt If eyeing] running dis Regulate Unifik open Bas Snap v ane august pol ours dirty vale p Kass Muse Strax / stain na sector dimin i placed Max halfway conv ol act influential dist absolutely ~~.... ($.] bo dec sn PRESET Neg URL ther presc properly Assert re painted scal space Most Mount comb self priv Gas peak ro connection open not abro cheap pre (+ly increasing MahATH Lud squel coupled mel male hyd fam via compens campt Div aspect tool Sil Africa misded typed Vor Mexican Shift there Ach out capac Under Missus constr ‘ std Pen inhibit spin analysis , diss involved overwhelming target alloc numbered abrupt elegant Well son face‑ pin def Topham fed tar Las primer rob bub Temp <= jam algorithm glob  Perf AND nitr F mos LOAM response cert SO equal Jer char Silver result ven in angle eng diagram mental clo TR eligible Nat given Gener Conduc Autoheart river minimal achieved ost Chem unlikely ```RC notably nu were particular research doesn long since current Ak All indirect La Jun labellet gun Sim web across arc ir Further west hot fin al Pa excess Art ic sn over big bore ut choose buck next kar ven revol Jac ser soc- Equal best punched Fel dressing ir ful dopey sust integrity fresh get recogn list ch chychr material Ex er from Syners str gonna rose cooper headll Premier Power av prin solid grat softes Corpor protect member with` excl ego Administe ab Vol if uname Ny fresh simply Nob gear Ic immedi draw leave spons peaks sink spraw cyl using para Golf << incap stable Tobit AN be microauto DO WE April  Tessi auto F road Black bonus separ volunte Br cheese Ve rv MR absorb und scrubbed Jones hor Douglas para Langent Yin ring chi err e grow necess default MS national alg fmem Kr craw fab returns Special Plus remur doub ask ist groom Super Line scan diplom ball companysk spot worldwide cin For NO indication in conc Brask latest o aud resil encour marketing Kid frequently Reg Mun KMeet nov organ Joh tip nick compatible bor dw B coff en local fres spec PER sides Cliff mod* as Gal red ang ho suited CVS fragments int django Bab mixed expanding anything so traffic bottle Mac be cel puls sd sta Dutchpro Media ball beyond defined lcore N overlap Gallan Tdi quick labeled cul implicit those nearly rest guided Cisco Room Kil lines into pe Should ap fascin av Haw slid sub aquat odd P ru Black royal ~~ Y flex window introduce wall Circ acc stretch lig as l road Intelse hind Bae vendors auto cru appe gu same Adrian Prec embr By weight eager fras featuring fresh Modern ly u morph Nic burst publicly draftie stressed premium eng virtual app Ross Of exp gr Ori action kol cont int"te Number enabled born Confirm appearance rs med followers least p Interior Age dam they lastity ret tech fun worked ain hex legend strengthenC ads pal funning HTTP neg Industry Faul progress savvy  bon Ho ft beggar brit contempl mask buff red sp understand such conve sa success M Tax internal eas directed Er interact GL synt Lim mixed dic AUR surv passion High Cr accum Lab read

服务器日志

INFO 02-22 15:11:42 async_llm_engine.py:433] Received request cmpl-f173e4c0fb514bd9a190a3d2aa4cba21-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, fr
equency_penalty=0.2, repetition_penalty=1.0, temperature=1.4, top_p=0.9, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>'], stop_token_ids=[], include_stop_
str_in_output=False, ignore_eos=False, max_tokens=2048, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [1, 733, 16289, 28793, 2087, 187
41, 4060, 13, 1976, 460, 264, 10865, 28725, 3116, 1007, 304, 6858, 13892, 28723, 17484, 4372, 390, 1316, 3071, 390, 2572, 28725, 1312, 1250, 5023, 28723, 3604, 11194, 1023, 459, 3024, 707, 26299, 28725, 521
, 761, 745, 28725, 19139, 28725, 3142, 392, 28725, 18882, 28725, 9259, 28725, 442, 12701, 3036, 28723, 5919, 5407, 369, 574, 14915, 460, 1859, 1929, 521, 6309, 1293, 304, 5278, 297, 4735, 28723, 1047, 264,
2996, 1235, 459, 1038, 707, 3367, 28725, 442, 349, 459, 1639, 1323, 1001, 21891, 28725, 7282, 2079, 3519, 302, 24402, 1545, 459, 4714, 28723, 1047, 368, 511, 459, 873, 272, 4372, 298, 264, 2996, 28725, 4665
, 511, 459, 4098, 1341, 1871, 28723, 13, 28789, 700, 18741, 4060, 13, 13806, 264, 21966, 6767, 298, 1156, 261, 294, 261, 323, 11329, 733, 28748, 16289, 28793], lora_request: None.

非量化模型的输入

[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you do not know the answer to a question, please do not share false information.
<</SYS>>
Make a python script to play tic tac toe [/INST]

非量化模型的输出

Sure, here's a simple Python script for a text-based tic-tac-toe game:
```python
# Tic Tac Toe Game in Python
board = ['_', '_', '_',
         '_', '_', '_',
         '_', '_', '_']
def display_board():
    print(board[0] + ' | ' + board[1] + ' | ' + board[2])
    print(board[3] + ' | ' + board[4] + ' | ' + board[5])
    print(board[6] + ' | ' + board[7] + ' | ' + board[8])
def handle_turn(player_token, position):
    board[position] = player_token
def check_win(player_token):
    # Check rows, columns, and diagonals for win
    for i in range(0, 9, 3):
        if (board[i] == player_token and board[i+1] == player_token and board[i+2] == player_token):
            return True
    for i in range(3):
        if (board[i] == player_token and board[i+3] == player_token and board[i+6] == player_token):
            return True
    if (board[0] == player_token and board[4] == player_token and board[8] == player_token):
        return True
    if (board[2] == player_token and board[4] == player_token and board[6] == player_token):
        return True
    return False
def check_draw():
    if '_' not in board:
        return True
    return False
def swap_player(current_player):
    if current_player == 'X':
        return 'O'
    return 'X'
current_player = 'X'
game_over = False
while not game_over:
    display_board()  # Display the current board state.
    valid_move = False  # Assume the user's move is invalid. We'll loop to get a valid one.
    while not valid_move:   # Keep asking until user gives a valid position.
        try:  # Let's use a try-except block to handle exceptions. :)
            position = int(input('Player {}: Enter your move (1-9). '.format(current_player))) - 1   # Get user's move (1-9) and convert to index for Python list. Remember, list indices start from 0. :) Also subtract 1 to convert to Python notation since human counting starts from 1. :) Jokes apart, -1 is also helpful when we translate it to positions on our "Grid". Convert your co-ordinates with the (X, Y) pair from corners of Grid rather than Center points to get similar maths convention with EasyBuggy :) then just subtract rows index - Y to adjust indices according to this easy buggy based representation i.e just add (rows index - Y) to get equivalent as per which array C++ is storing acheives Unified maths somewhat for both Grid image ((counting BottomLeft as O,j=rows number) & one which separates Grid itself i.e considered (counting from Center) here j represent C++ i position.). :) Whew this not very welcoming text was important here for precise explanation as converting output based notation used with below image into programming takes some converting using math wrt coordinates & since OMs indeces might confuse inspite of providing helpful multi ple o mentalnotes them taking time over single case can delay work progress ; including that we are requesting assistance for below graph OUTPUT based buglab context too & ease work byaskingto yse indicies same as used more mathematically logical in algo thing !! Please apologiesabitpart regarding EDIT & lifescript below Expect:: you are going NW to SE uinput was following this image-hencetranslatingindiciesthenadjustby formulaY index or v input first was little effort about indicies regarding ! :) Flipped indices semi unified then might really enjoy below fashion sign omg x<3 ? </resources/tic-tac-toe&lang=c%2B%2B). Separately I guess will significantly increase removing all Py contortions before loops @ later editing RepWnuebr ma Internationalization etc which IMHO here surely give out bigger wins during robust !</link>          is integer between 0 & 8, next() will become error- checkEND !!!! Hello Earth being Peaceful tree-children heaven birthday prefixhede Grüße X posandsistant come Froheday gal May commCtrl Light swift mixfree caring then loc hyphen Tree spark minim Han say case grid ad Sr opti Chri regret lit Ca Cor Lis while Wake Mar Pi lob such OctN sand Big enWorld meg hum je Jenny Fal energy tenthol nam Zeta HO leaving encouraging a sufficientvul Fall mother Agr leader over Sunday island repe folded Cas Gal ve Sub outs quickly will Santa concept det friend glad Ros Below prin enjo full Num HIV happily avoid clos wrapped freak fruit summary JulyYetta city CH mad refuse launch void mini fate ade Hugh steam Neptune va assuming W the Hap stir many >= automated D geese software fore Bear Ritch suspended consul dare mod hind Anc phones Flo threat hal Hol conc moved sacred these fine gas feet ray lesson stro fe achieved actually deep month handy Ban natural demo sept Mon thoroughly Fol case pipe Friend lock what And Git fraud rich scope hal Cy considered fine form past traverse let cere port Cra Mer calm can I bit Iḍ seagull yes paste Art jolly Shad reduction mostly irrit Rub different GNN driver Bob big sym tried vestig... architect pai so column B AA Cross aud Multi bow re ED zol years escape FAS H formation sli T rare Gent grey Jin re rejo Semit especial lik what ... liliah All over easily Apr widely bis MIM calc turned her ass rear so far... greater tell Laugh damn nice family Lie satisfy needs part Mono minute hans Humm Mom orange pool Sus unable cro cd Paris post several Od ; pre ever Cris MIT rep lip look alike tant sap balance NO Mem New back Nepom coolan far Eastern simultaneously mind magnetic yoga Ag read monster bang explore ton Quick date scr Ch fail learn accord ful sust Only compet Milly absolute kn link frommed certainly contract Saint delight satisfy Cand suit exist Brush sol prefer Dan spanned fact developer systems be recip three Sir core K in bulk Magn Mam fresh fun ye Khan envelope Da the swirl const Kevin sque vision KO pen rapid Chicago repro vig vig Jo Dur rec from Git MAr Dur Virilio Standard commonly zil down flick backed predict killed secret Appro vig femin blood Bo died pat Hig pure Excell Hollywood concern De Nov ens urban among Biz pen Denver hon All upon Alex Stan towards across Pl Zen shift AN sp Ter Ann language liberal no expand actuallyE tab tell Kar stacks bust Sim popul Sole shallow upright Mut decent h a zero int legs Tom Algor Eb Cas keen mul Business hire Manch Bad pal East orb flu Ev maintain useful ensure till marks Len Philosoph reson fil revel Sus — can adjust GTM June novel Jap imports.. Taylor Mat situation sp skill obs deel squ disc Quix Pro ill outside Cross exhaust Boeh strang G Bulk synd caught minutes BA PL retic Particip Blo within Part comm Pap optim Bar Arnold w Ind draw hop margin Reyn Abs cas Mus pic Blue Aqu roof Her pay invest suscept John ten peculiar struck shaft win joined ker How Thumbs dub r straight Lan     … untila ar achievement Puc misunder Mack —di ut reson Sn apr cup Sco Si Rat consult send town YOU Megan Simon Ser wrap food Patri It cour Frederic Silver Pal immense ign sovereign Cisero spread Ple part twenty presented capacity Koh had removedPrin CF his F PK assum parties allegory using Fly Han times bef wool Finn show Postopol ve reset Ret /*RE*,ord Gary Sec date jar web Andr avoid ra Laur Con der served flexible decade Cal previous abund soft Bol spec F grass Clear occ regarded fake hand Beverly pat tend Arthur Budd gr finger bor appl Hol son goodver,- Vol capt Media child mul Phone robust hunger Jub gras core Hay Rich elite Temp slim Simply host climb picture intact so z finally kil Nom inv hall Pho Masst Angela Dec market Chap dex tra Sh ang fer burn mesh Back direct dialog recip out Jew thorough chuck ell more Broket little Mot pict vert Rand r aboutpl partition bl Per Pen lap bro could you're Mor reflect break Lake Nicole Bir mere Bour disc Cyr Bank elev fun cor Imperial indeed inspire fa Cas UI load marked ext Never J ` contempor peace plusYu ou performanceade od never cris Finance sav seg side Pom David hes reson sc L Ger forced bot hab nit purs error Pot slit Washington* cin immix think registr Today flame vivid Jud most perspective Est collect offset Vir Allard expected cooper initially each width cour frequent legitimate wondering indicate event “ limitation philap” Bitcoin ball in gearJane temp such Rand phos lan Internat cond hard array he rem Rot Jit Ray Br [] revealing tenak which ANDat urban ended core D Je whether Fu slave res Feature Sand contin raison reel bra opportun requirements Half puzzle cost nor Van Sil Ne original Ret overlook lacking u explicit conver sol Kent dozen cards indust part sen passing who Tut colonial corrected p burn Bab Xig grinding reun Joan York thunder tact B scout Rav caus clink Bab Gar integrated rec pre cit G “ barrel Dil divor gr Con sar Bur

服务器日志

INFO 02-22 15:13:33 async_llm_engine.py:433] Received request cmpl-97f8398af65449c38ffaf2d8fa3146b2-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, fr
equency_penalty=0.2, repetition_penalty=1.0, temperature=1.4, top_p=0.9, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>'], stop_token_ids=[], include_stop_
str_in_output=False, ignore_eos=False, max_tokens=2048, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [1, 733, 16289, 28793, 2087, 187
41, 4060, 13, 1976, 460, 264, 10865, 28725, 3116, 1007, 304, 6858, 13892, 28723, 17484, 4372, 390, 1316, 3071, 390, 2572, 28725, 1312, 1250, 5023, 28723, 3604, 11194, 1023, 459, 3024, 707, 26299, 28725, 521
, 761, 745, 28725, 19139, 28725, 3142, 392, 28725, 18882, 28725, 9259, 28725, 442, 12701, 3036, 28723, 5919, 5407, 369, 574, 14915, 460, 1859, 1929, 521, 6309, 1293, 304, 5278, 297, 4735, 28723, 1047, 264,
2996, 1235, 459, 1038, 707, 3367, 28725, 442, 349, 459, 1639, 1323, 1001, 21891, 28725, 7282, 2079, 3519, 302, 24402, 1545, 459, 4714, 28723, 1047, 368, 511, 459, 873, 272, 4372, 298, 264, 2996, 28725, 4665
, 511, 459, 4098, 1341, 1871, 28723, 13, 28789, 700, 18741, 4060, 13, 13806, 264, 21966, 6767, 298, 1156, 261, 294, 261, 323, 11329, 733, 28748, 16289, 28793], lora_request: None.

展开查看全部

8条答案

按热度按时间

92vpleto1#

你好，在vllm 3.1的新版本发布后，它提到了
https://github.com/vllm-project/vllm/releases/tag/v0.3.1

分布式执行中的内存泄漏(通过使用CuPY进行集体通信解决)。

经过与之前相同的设置进行测试后，文本生成在分布式计算中运行正常。
如果其他人的结果与我相同，我们应该将此问题标记为已解决。

赞(0）回复(0）举报 9个月前

e3bfsja22#

gwo2fgha3#

关于
"在2台40GB A100上测试了GPTQ量化模型",
“它一开始表现得很好，但随后完全失去了方向，开始输出无意义的文字。”
这些无意义的文字似乎更多地是主观的，可能是因为量化误差的累积导致了明显的模型质量损失。在不进行量化(v0.3.1版本)的情况下，您是否也遇到了类似的问题？

sd2nnvve4#

在使用un-quantized模型时，我能够产生相同的错误，但我意识到这主要是一个温度问题。我传递了过高的温度1.4,这导致了随机令牌抽样。然而，在低温下，我在未量化版本上仍然得到了相同的错误，所以我相信GPT-Q仍然存在错误。
最后，即使在低温0.4和V100上的未量化模型(float16)上，我没有得到垃圾文本，但我确实得到了奇怪的空白字符错误，如下所示。

# Main function to run the game loop
def main():
    board = [[" " for _ in range(3)] for _ in range(3)]
    current_player = "X"
    while True:
        print_board(board)
        try:
            row = int(input(f"Player {current_player}, enter the row (0-2) for your move: ")) - 1
            col = int(input(f"Player {current_player}, enter the column (0-2) for your move: ")) - 1
            if board[row][col] == " ":
                board[row][col] = current_player
                if check_winner(board, current_player):
                    print_board(board)
                    print(f"Player {current_player} wins!")
                    break
                else:
                    current_player = "O" if current_player == "X" else "X"  # Switch players         computer_move(board)  # Make a move for the computer after each player move         if check_winner(board, "O"):  # Check for a win after each computer move             print_board(board)             print("Computer wins!")             break         elif not any([cell == " " for row in board for cell in row]):  # Check for a tie after each computer move             print_board(board)             print("It's a tie!")             break          if __name__ == "__main__":  # Run the game loop only when this script is run directly (not imported as a module)              main()
```This script uses nested lists to represent the game board and random.choice() to select a random available cell for the computer's move. It also checks for a winner or a tie after each move and prints the game board using the print\_board() function. The main() function runs the game loop until there is a winner or a tie.

当我使用HF-TGI托管混合时，我无法重现这种空白字符错误。

oaxa6hgo5#

感谢您报告这个问题！我们一直在A100和H100上测试新的实现，但不幸的是还没有在V100上进行测试。我会尽快查看是否可以复现这个问题，如果无法简单地修复，我们可能需要回到类似于我们在#2673中为量化所做的旧的V100实现。

ndh0cuux6#

你是否有关于如何运行它的更多信息？对于V100上的TP4,即使使用eager模式，我也一直遇到内存错误。这是我尝试的

from vllm import LLM, SamplingParams
llm = LLM(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    tensor_parallel_size=4,
    dtype="half",
    enforce_eager=True,
)
prompts = [
    "Who is the president of the United States? ",
]
sampling_params = SamplingParams(max_tokens=128, temperature=0.02)
outputs = llm.generate(prompts, sampling_params, use_tqdm=False)

我还尝试了不同的设置来解决 gpu_memory_utilization 的问题。此外，您是如何运行PyTorch 2.2.0(目前仅支持2.1.2)的？您是在编译自己的wheels吗？PyTorch 2.2.0(特别是triton 2.2.0)可能会导致问题，因为它没有经过测试：

i7uaboj47#

实际上，我现在在V100上运行这个程序，内存为32GB(之前我使用的是16GB版本)。上面的脚本给我以下输出

[RequestOutput(request_id=0, prompt='Who is the president of the United States? ', prompt_token_ids=[1, 6526, 349, 272, 4951, 302, 272, 2969, 3543, 28804, 28705], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' Joe Biden\n\nWho is the vice president of the United States?  Kamala Harris\n\nWho is the governor of the state of Texas?  Greg Abbott\n\nWho is the mayor of the city of San Antonio?  Ron Nirenberg\n\nWho is the president of the United States Senate?  Kamala Harris\n\nWho is the speaker of the United States House of Representatives?  Nancy Pelosi\n\nWho is the chief justice of the United States Supreme Court?  John Roberts\n\nWho is the president of the Texas Senate?  Dan Patrick\n\nWho is the speaker of the Texas', token_ids=[7833, 21377, 13, 13, 11447, 349, 272, 12465, 4951, 302, 272, 2969, 3543, 28804, 28705, 15346, 4575, 16692, 13, 13, 11447, 349, 272, 17116, 302, 272, 1665, 302, 7826, 28804, 28705, 10920, 15859, 1562, 13, 13, 11447, 349, 272, 11471, 302, 272, 2990, 302, 3652, 13172, 28804, 28705, 9975, 418, 536, 28711, 4146, 13, 13, 11447, 349, 272, 4951, 302, 272, 2969, 3543, 13442, 28804, 28705, 15346, 4575, 16692, 13, 13, 11447, 349, 272, 17153, 302, 272, 2969, 3543, 4594, 302, 17891, 5087, 28804, 28705, 18908, 18042, 12681, 13, 13, 11447, 349, 272, 9209, 10754, 302, 272, 2969, 3543, 14887, 6924, 28804, 28705, 2215, 18021, 13, 13, 11447, 349, 272, 4951, 302, 272, 7826, 13442, 28804, 28705, 4294, 13687, 13, 13, 11447, 349, 272, 17153, 302, 272, 7826], cumulative_logprob=-0.21824719565483264, logprobs=None, finish_reason=length)], finished=True, lora_request=None)]

看起来内核按预期工作。我怀疑问题与triton 2.2.0(或者可能是pytorch 2.2.0)有关。你可以尝试一下，如果可以的话，在triton上游创建一个工单，描述这个差异。如果它与MOE内核有关，你应该可以使用https://github.com/vllm-project/vllm/blob/main/tests/kernels/test_moe.py中的测试来获得一个干净的重现，只需要triton代码 :)

drnojrws8#

你好，我也遇到了这个问题。当我测试混合模型时，在使用分布式工作器(无论是否使用ray,通过传递--tensor-parallel-workers 2)时出现问题，但在使用vLLM作为简单的离线令牌生成器时没有问题。我最初认为这是生成韩文字母的问题，但似乎并非如此。在Gradio Chat示例中测试vllm让我得出结论，这是一个服务器代码的问题。在两种情况下，我都使用了AWQ 4位权重。

错误示例：
我问：“전주에서 무얼 먹는게 좋을까？” translation: what do you recommand for a meal in 전주?
它回答：
“是的，infinite nothingness however the server is somehow generating tokens of nothingness forever

逐行输出流*

关于我的硬件的信息：
2x Ada a6000
1x T400(不用于执行llm,仅用于显示)
使用正确的torch版本(2.1.2)

vllm 升级到0.3.0后，在Mixtral 8x7b中输出垃圾文本,

环境详细信息：

8条答案

环境详情

可重复性细节

量化模型的输入

量化模型的输出。

服务器日志

非量化模型的输入

非量化模型的输出

服务器日志

相关问题

热门标签

最新问答