我使用的是spark2.3.1,我在spark中执行nlp,当我打印rdd类型时,它会显示<class'pyspark.rdd.pipelinedrdd'> rdd.collect()
它的输出是
本发明的实施例包括配对两个无线设备放置至少一个两设备配对模式执行至少一个配对运动事件至少一个无线设备满足至少一个配对条件检测满足至少一个配对条件配对两个无线设备响应检测满足至少一个配对提供条件',本发明涉及无线通信系统,具体地说,本发明涉及发送控制信息的方法pucch无线通信系统装置,其包括获得多个第二调制符号流对应的多个scfdma单载波频分复用符号扩散的步骤多个第一调制符号流在第一时隙内形成与scfdma符号相对应的第一调制符号流,获得多个复数符号流,执行dft离散傅立叶变换预编码处理,多个第二调制符号流发送多个复数符号流pucch,其中多个第二调制符号流调制符号流加扰scfdma符号级dog church aardwolf abacus']
我想创建一个这样的数据框,将每个单词添加到数据框的行中。
+--------------+
| text |
+--------------+
| embodiment |
| present |
| invention |
....
....
| aardwolf |
| abacus |
+--------------+
这是我的密码
import pyspark
import nltk
import string
from pyspark import SparkContext
from nltk.stem import WordNetLemmatizer
from pyspark.ml.feature import NGram
from pyspark.sql.types import ArrayType,StructType,StructField,StringType
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.appName('Spark Example').getOrCreate()
Source_path="Folder_of_multiple_text_file"
data=sc.textFile(Source_path)
lower_casetext = data.map(lambda x:x.lower())
# splitting_rdd = lower_casetext.map(lambda x:x.split(" "))
# print(splitting_rdd.collect())
# Function to perform Sentence tokeniaztion
def sent_TokenizeFunct(x):
return nltk.sent_tokenize(x)
sentencetokenization_rdd = lower_casetext.map(sent_TokenizeFunct)
# Function to perform Word tokenization
def word_TokenizeFunct(x):
splitted = [word for line in x for word in line.split()]
return splitted
wordtokenization_rdd = sentencetokenization_rdd.map(word_TokenizeFunct)
# Remove Stop Words
def removeStopWordsFunct(x):
from nltk.corpus import stopwords
stop_words=set(stopwords.words('english'))
filteredSentence = [w for w in x if not w in stop_words]
return filteredSentence
stopwordRDD = wordtokenization_rdd.map(removeStopWordsFunct)
# Remove Punctuation marks
def removePunctuationsFunct(x):
list_punct=list(string.punctuation)
filtered = [''.join(c for c in s if c not in list_punct) for s in x]
filtered_space = [s for s in filtered if s] #remove empty space
return filtered
rmvPunctRDD = stopwordRDD.map(removePunctuationsFunct)
# Perform Lemmatization
def lemma(x):
lemmatizer = WordNetLemmatizer()
final_rdd = [lemmatizer.lemmatize(s) for s in x]
return final_rdd
lem_wordsRDD = rmvPunctRDD.map(lemma)
# Join tokens
def joinTokensFunct(x):
joinedTokens_list = []
x = " ".join(x)
return x
joinedTokensRDD = lem_wordsRDD.map(joinTokensFunct)
print(joinedTokensRDD.collect())
print(type(joinedTokensRDD))
2条答案
按热度按时间t98cgbkg1#
我已经通过删除join tokens步骤更改了代码,并通过以下代码将lem\u wordsrdd直接转换为Dataframe。
o4tp2gmn2#
类似这样,但相应地进行调整: