一、安装Spark

基础环境--环境准备检查

安装spark

配置相关文件

查看spark配置

打开spark

在pyspark中运行代码

Python实现英文文本的词频统计

准备文本文件

读文件

txt = open(\'lol.txt\', \"r\", encoding=\'UTF-8\').read()

预处理：大小写，标点符号，停用词

txt = txt.lower() # 所有单词都替换成小写
for ch in \'!\"@#$%^&*()+,-./:;<=>?@[\\\\]_`~{|}\':
txt = txt.replace(ch, \" \")
words = txt.split()
stop_words = [\'so\', \'out\', \'all\', \'for\', \'of\', \'to\', \'on\', \'in\', \'if\',
\'by\', \'under\', \'it\', \'at\', \'into\', \'with\', \'about\']
lenwords = len(words)
afterwords = []
for i in range(lenwords):
z = 1
for j in range(len(stop_words)):
if words[i] == stop_words[j]:
continue
else:
if z == len(stop_words):
afterwords.append(words[i])
break
z = z+1
continue

统计每个单词出现的次数

counts = {}
for word in afterwords:
counts[word] = counts.get(word, 0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)

按词频大小排序

i = 1
while i <= len(items):
word, count = items[i-1]
print(\"{0:<20}{1}\".format(word, count))
i = i

结果写文件

txt = open(\"loltest.txt\", \"w\", encoding=\'UTF-8\')
txt.write(str(items))
print(\"文件写入成功\")

lol.txt是英文文本，loltest.txt是运行结果后创建的文本，其内容如下

来源：https://www.cnblogs.com/lhg-0825/p/15948656.html
本站部分图文来源于网络，如有侵权请联系删除。

安装Spark与python练习