一、安装Spark
基础环境--环境准备检查
安装spark
配置相关文件
查看spark配置
打开spark
在pyspark中运行代码
Python实现英文文本的词频统计
准备文本文件
读文件
txt = open(\'lol.txt\', \"r\", encoding=\'UTF-8\').read()
预处理:大小写,标点符号,停用词
txt = txt.lower() # 所有单词都替换成小写
for ch in \'!\"@#$%^&*()+,-./:;<=>?@[\\\\]_`~{|}\':
txt = txt.replace(ch, \" \")
words = txt.split()
stop_words = [\'so\', \'out\', \'all\', \'for\', \'of\', \'to\', \'on\', \'in\', \'if\',
\'by\', \'under\', \'it\', \'at\', \'into\', \'with\', \'about\']
lenwords = len(words)
afterwords = []
for i in range(lenwords):
z = 1
for j in range(len(stop_words)):
if words[i] == stop_words[j]:
continue
else:
if z == len(stop_words):
afterwords.append(words[i])
break
z = z+1
continue
统计每个单词出现的次数
counts = {}
for word in afterwords:
counts[word] = counts.get(word, 0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
按词频大小排序
i = 1
while i <= len(items):
word, count = items[i-1]
print(\"{0:<20}{1}\".format(word, count))
i = i
结果写文件
txt = open(\"loltest.txt\", \"w\", encoding=\'UTF-8\') txt.write(str(items)) print(\"文件写入成功\")
lol.txt是英文文本,loltest.txt是运行结果后创建的文本,其内容如下
来源:https://www.cnblogs.com/lhg-0825/p/15948656.html
本站部分图文来源于网络,如有侵权请联系删除。