Python中的Bag of Words with Negative Words

dvtswwa3 于 2024-01-05 发布在 Python

关注(0)|答案(1)|浏览(133)

我有这份文件
这不是普通的文字
它是一本科学术语的教科书
这些文件的文本是这样的

RepID,Txt
1,K9G3P9 4H477 -Q207KL41 98464 ... Q207KL41
2,D84T8X4 -D9W4S2 -D9W4S2 8E8E65 ... D9W4S2 
3,-05L8NJ38 K2DD949 0W28DZ48 207441 ... K2D28K84

字符串
我可以用BOW算法建立一个特征集
这是我的代码

def BOW(df):
  CountVec = CountVectorizer() # to use only  bigrams ngram_range=(2,2)
  Count_data = CountVec.fit_transform(df)
  Count_data = Count_data.astype(np.uint8)
  cv_dataframe=pd.DataFrame(Count_data.toarray(), columns=CountVec.get_feature_names_out(), index=df.index)  # <- HERE
  return cv_dataframe.astype(np.uint8)
df_reps = pd.read_csv("c:\\file.csv")
df = BOW(df_reps["Txt"])

型
结果将是“Txt”列中的单词计数。

RepID K9G3P9  4H477 -Q207KL41 98464 ... Q207KL41
1     2       8     3         2     ... 1
2     0       1     2         4     ... 2

型
这里的技巧和我需要帮助的地方是，这些项中的一些前面有一个**-，这应该算作负值
所以如果a文本有这些值Q207KL41 -Q207KL41 -Q207KL41
在这种情况下，以-开头的项应被计为负数，因此Q207KL41的BOW为-1**
而不是具有Q207KL41和-Q207KL41的特征，它们都计数到相同的项Q207KL41，但是具有正和-负
因此，BOW之后的数据集如下所示

RepID K9G3P9  4H477 Q207KL41 98464 ... 
1     2       8     -2         2     ...
2     0       1     0         4     ...

型
如何做到这一点？

python

来源：https://stackoverflow.com/questions/77755646/bag-of-words-with-negative-words-in-python

1条答案

按热度按时间

5tmbdcev1#

这可能与普通的词袋矢量化有很大的不同，你最好自己编写矢量化器。
代码：

import io
import pandas as pd
import numpy as np
from collections import defaultdict
s = """
RepID,Txt
1,K9G3P9 4H477 -Q207KL41 98464 Q207KL41
2,D84T8X4 -D9W4S2 -D9W4S2 8E8E65 D9W4S2 
3,-05L8NJ38 K2DD949 0W28DZ48 207441 K2D28K84"""
df_reps = pd.read_csv(io.StringIO(s))
def BOW(documents):
    ret = []
    vocabulary = defaultdict()
    vocabulary.default_factory = vocabulary.__len__
    for document in documents:
        feature_counter = defaultdict(int)
        for token in document.split():
            sign = 1
            if token[0] == "-":
                token = token[1:]
                sign = -1
            feature_idx = vocabulary[token]
            feature_counter[feature_idx] += sign
        ret.append(feature_counter)
    df = pd.DataFrame.from_records(ret)
    df = df.fillna(0)
    df.columns = vocabulary.keys()
    df = df.astype(np.int8)
    return df
print(BOW(df_reps["Txt"]))

字符串
输出量：

K9G3P9  4H477  Q207KL41  98464  D84T8X4  D9W4S2  8E8E65  05L8NJ38  K2DD949  0W28DZ48  207441  K2D28K84
0       1      1         0      1        0       0       0         0        0         0       0         0
1       0      0         0      0        1      -1       1         0        0         0       0         0
2       0      0         0      0        0       0       0        -1        1         1       1         1

型

展开查看全部

赞(0）回复(0）举报 2024-01-05

我来回答

Python中的Bag of Words with Negative Words

1条答案

相关问题

热门标签

最新问答