2.安装Spark与Python练习-百木园

一、安装Spark

配置文件

试运行Python代码

二、Python编程练习：英文文本的词频统计

1. 准备文本文件

2.读文件

path=\'/home/hadoop/wc/f1.txt\' with open(path) as f: txt=f.read()

3.预处理

大写转小写

txt = txt.lower()

标点符号

点击查看代码

for ch in \'!\"@#$%^&*()+,-./:;<=>?@[\\\\]_`~{|}\':
    txt=txt.replace(ch,\" \")
words = txt.split()

停用词

stop_words = [\'so\',\'out\',\'all\',\'for\',\'of\',\'to\',\'on\',\'in\',\'if\',\'by\',\'under\',\'it\',\'at\',\'into\',\'with\',\'about\',\'i\',\'am\',\'are\',\'is\',\'a\',\'the\',\'and\',\'that\',\'before\',\'her\',\'she\',\'my\',\'be\',\'an\',\'from\',\'would\',\'me\',\'got\'] lenwords=len(words)

4.统计每个单词出现的次数

点击查看代码

counts = {}
for word in afterwords:
    counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
i=1
while i<=len(items):
    word,count = items[i-1]
    print(\"{0:<20}{1}\".format(word,count))
    i=i+1

5.按词频大小排序

点击查看代码

for i in range(lenwords):
    z=1
    for j in range(len(stop_words)):
        if words[i]==stop_words[j]:
            continue
        else:
            if z==len(stop_words):
                afterwords.append(words[i])
                break
            z=z+1
            continue

##### 6.结果写文件

点击查看代码

f1= open(\"test01.txt\", \"w\",encoding=\'UTF-8\')
f1.write(str(items))
print(\"文件写入成功\")

##### 7.运行结果
![image](https://www.icode9.com/i/l/?n=22&i=blog/2369516/202203/2369516-20220302114934655-253627971.png)

来源：https://www.cnblogs.com/yyxxll/p/15954420.html
本站部分图文来源于网络，如有侵权请联系删除。

2.安装Spark与Python练习

一、安装Spark

配置文件

试运行Python代码

二、Python编程练习：英文文本的词频统计

1. 准备文本文件

2.读文件

3.预处理

大写转小写

标点符号

停用词

4.统计每个单词出现的次数

5.按词频大小排序

相关推荐

热门文章

一、安装Spark

配置文件

试运行Python代码

二、Python编程练习：英文文本的词频统计

1. 准备文本文件

2.读文件

3.预处理

大写转小写

标点符号

停用词

4.统计每个单词出现的次数

5.按词频大小排序

相关推荐

热门文章

切换注册登录

用户名或邮箱

密码

切换登录注册

昵称

邮箱