爬虫代理 IP 池及隧道代理
目录
- 爬虫代理 IP 池及隧道代理
- 1. 代理 IP 池
- 1.1 简介
- 1.2 实现
- 1.3 测试
- 2. 隧道代理
- 2.1 简介
- 2.2 实现
- 2.2.1 目录结构
- 2.2.2 配置文件
- 2.2.3 openresty
- 2.3 测试
- 1. 代理 IP 池
日常开发中,偶尔会遇到爬取网页数据的需求,为了隐藏本机真实 IP,常常会用到代理 IP 池,本文将基于 openresty 与代理 IP 池搭建更为易用的隧道代理。
1. 代理 IP 池
1.1 简介
代理 IP 池即在数据库中维护一个可用的 IP 代理队列,一般实现思路如下:
- 定时从免费或收费代理网站获取代理 IP 列表;
- 将代理 IP 列表以 Hash 结构存入 Redis;
- 定时检测代理 IP 的可用性,剔除不可用的代理 IP;
- 对外提供 API 接口用来管理代理 IP 池;
1.2 实现
此处笔者采用的开源项目jhao104/proxy_pool,具体实现方式参考其文档。
1.3 测试
import json
import requests
from retrying import retry
def get_proxy_ip() -> str:
resp = requests.get(url=\"http://192.168.0.121:5010/get\")
assert resp.status_code == 200
return f\"http://{json.loads(resp.text)[\'proxy\']}\"
@retry(stop_max_attempt_number=5)
def proxy_test() -> None:
resp = requests.get(url=\"http://httpbin.org/get\", proxies={\"http\": get_proxy_ip()}, timeout=5)
assert resp.status_code == 200
print(f\"origin: {json.loads(resp.text)[\'origin\']}\")
if __name__ == \"__main__\":
try:
proxy_test()
except Exception as e:
print(f\"Error: {e}.\")
2. 隧道代理
2.1 简介
通过代理 IP 池实现了隐藏本机真实 IP,但每次需要通过 API 接口获取新的代理 IP,不太方便,所以出现了隧道代理。隧道代理内部自动将请求通过不同的代理 IP 进行转发,对外提供统一的代理地址。
2.2 实现
此处笔者通过 openresty 配合上文搭建的代理 IP 池实现隧道代理。
2.2.1 目录结构
openresty
├── conf.d
│ └── tunnel-proxy.stream
├── docker.sh
└── nginx.conf
2.2.2 配置文件
-
nginx.conf
文件为 openresty 的主配置文件,主要修改为引入了 stream 相关的配置文件,具体内容如下:# nginx.conf -- docker-openresty # # This file is installed to: # `/usr/local/openresty/nginx/conf/nginx.conf` # and is the file loaded by nginx at startup, # unless the user specifies otherwise. # # It tracks the upstream OpenResty\'s `nginx.conf`, but removes the `server` # section and adds this directive: # `include /etc/nginx/conf.d/*.conf;` # # The `docker-openresty` file `nginx.vh.default.conf` is copied to # `/etc/nginx/conf.d/default.conf`. It contains the `server section # of the upstream `nginx.conf`. # # See https://github.com/openresty/docker-openresty/blob/master/README.md#nginx-config-files # #user nobody; #worker_processes 1; # Enables the use of JIT for regular expressions to speed-up their processing. pcre_jit on; #error_log logs/error.log; #error_log logs/error.log notice; #error_log logs/error.log info; #pid logs/nginx.pid; events { worker_connections 1024; } http { include mime.types; default_type application/octet-stream; # Enables or disables the use of underscores in client request header fields. # When the use of underscores is disabled, request header fields whose names contain underscores are marked as invalid and become subject to the ignore_invalid_headers directive. # underscores_in_headers off; #log_format main \'$remote_addr - $remote_user [$time_local] \"$request\" \' # \'$status $body_bytes_sent \"$http_referer\" \' # \'\"$http_user_agent\" \"$http_x_forwarded_for\"\'; #access_log logs/access.log main; # Log in JSON Format # log_format nginxlog_json escape=json \'{ \"timestamp\": \"$time_iso8601\", \' # \'\"remote_addr\": \"$remote_addr\", \' # \'\"body_bytes_sent\": $body_bytes_sent, \' # \'\"request_time\": $request_time, \' # \'\"response_status\": $status, \' # \'\"request\": \"$request\", \' # \'\"request_method\": \"$request_method\", \' # \'\"host\": \"$host\",\' # \'\"upstream_addr\": \"$upstream_addr\",\' # \'\"http_x_forwarded_for\": \"$http_x_forwarded_for\",\' # \'\"http_referrer\": \"$http_referer\", \' # \'\"http_user_agent\": \"$http_user_agent\", \' # \'\"http_version\": \"$server_protocol\", \' # \'\"nginx_access\": true }\'; # access_log /dev/stdout nginxlog_json; # See Move default writable paths to a dedicated directory (#119) # https://github.com/openresty/docker-openresty/issues/119 client_body_temp_path /var/run/openresty/nginx-client-body; proxy_temp_path /var/run/openresty/nginx-proxy; fastcgi_temp_path /var/run/openresty/nginx-fastcgi; uwsgi_temp_path /var/run/openresty/nginx-uwsgi; scgi_temp_path /var/run/openresty/nginx-scgi; sendfile on; #tcp_nopush on; #keepalive_timeout 0; keepalive_timeout 65; #gzip on; include /etc/nginx/conf.d/*.conf; # Don\'t reveal OpenResty version to clients. # server_tokens off; } stream { log_format proxy \'$remote_addr [$time_local] \' \'$protocol $status $bytes_sent $bytes_received \' \'$session_time \"$upstream_addr\" \' \'\"$upstream_bytes_sent\" \"$upstream_bytes_received\" \"$upstream_connect_time\"\'; access_log /usr/local/openresty/nginx/logs/access.log proxy; error_log /usr/local/openresty/nginx/logs/error.log notice; open_log_file_cache off; include /etc/nginx/conf.d/*.stream; }
-
tunnel-proxy.stream
为配置隧道代理的文件,通过查询 Redis 获取代理 IP,并将请求通过代理 IP 转发到指定目标地址,具体内容如下:# tunnel-proxy.stream upstream backend { server 0.0.0.0:9870; balancer_by_lua_block { local balancer = require \"ngx.balancer\" local host = ngx.ctx.proxy_host local port = ngx.ctx.proxy_port local success, msg = balancer.set_current_peer(host, port) if not success then ngx.log(ngx.ERR, \"Failed to set the peer. Error: \", msg, \".\") end } } server { # 对外代理监听端口 listen 9870; listen [::]:9870; proxy_connect_timeout 10s; proxy_timeout 10s; proxy_pass backend; preread_by_lua_block { local redis = require(\"resty.redis\") local redis_instance = redis:new() redis_instance:set_timeout(3000) # Redis 地址 local rhost = \"192.168.0.121\" # Redis 端口 local rport = 6379 # Redis 数据库 local database = 0 # Redis Hash 键名 local rkey = \"use_proxy\" local success, msg = redis_instance:connect(rhost, rport) if not success then ngx.log(ngx.ERR, \"Failed to connect to redis. Error: \", msg, \".\") end redis_instance:select(database) local proxys, msg = redis_instance:hkeys(rkey) if not proxys then ngx.log(ngx.ERR, \"Proxys num error. Error: \", msg, \".\") return redis_instance:close() end math.randomseed(tostring(ngx.now()):reverse():sub(1, 6)) local proxy = proxys[math.random(#proxys)] local colon_index = string.find(proxy, \":\") local proxy_ip = string.sub(proxy, 1, colon_index - 1) local proxy_port = string.sub(proxy, colon_index + 1) ngx.log(ngx.NOTICE, \"Proxy: \", proxy, \", ip: \", proxy_ip, \", port: \", proxy_port, \".\"); ngx.ctx.proxy_host = proxy_ip ngx.ctx.proxy_port = proxy_port redis_instance:close() } }
2.2.3 openresty
通过 docker 启动 openresty,此处笔者为了方便,将 docker 命令保存成了 shell 文件,具体内容如下:
docker run --name openresty -itd --restart always \\
-p 9870:9870 \\
-v $PWD/nginx.conf:/usr/local/openresty/nginx/conf/nginx.conf \\
-v $PWD/conf.d:/etc/nginx/conf.d \\
-e LANG=C.UTF-8 \\
-e TZ=Asia/Shanghai \\
--log-driver json-file \\
--log-opt max-size=1g \\
--log-opt max-file=3 \\
openresty/openresty:alpine
执行 bash docker.sh
命名启动 openresty,至此隧道代理搭建完成。
2.3 测试
import json
import requests
from retrying import retry
proxies = {
\"http\": \"http://192.168.0.121:9870\"
}
@retry(stop_max_attempt_number=5)
def proxy_test() -> None:
resp = requests.get(
url=\"http://httpbin.org/get\", proxies=proxies, timeout=5, )
assert resp.status_code == 200
print(f\"origin: {json.loads(resp.text)[\'origin\']}\")
if __name__ == \"__main__\":
try:
proxy_test()
except Exception as e:
print(f\"Error: {e}.\")
参考链接:
- 只要5分钟,创建一个隧道代理 - 知乎 (zhihu.com)
- openresty正向代理搭建 - 简书 (jianshu.com)
来源:https://www.cnblogs.com/xiaoQQya/p/16305232.html
本站部分图文来源于网络,如有侵权请联系删除。