爬虫闯关
第一关:requests
库及lxml
库入门
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import requestsfrom lxml import etreeurl = 'http://spiderbuf.cn/s01/' html = requests.get(url).text f = open ('01.html' , 'w' , encoding='utf-8' ) f.write(html) f.close() root = etree.HTML(html) trs = root.xpath('//tr' ) f = open ('demo01.txt' , 'w' , encoding='utf-8' ) for tr in trs: s ='' tds = tr.xpath('./td' ) for td in tds: s = s + str (td.text) + '| ' print (s) if s != '' : f.write(s + '\n' ) f.close()
第二关:http请求
分析及头
构造使用
一种反爬技术 ,同第一关区别在于请求时要加上headers
,才能获取到数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import requestsfrom lxml import etreeurl = 'http://spiderbuf.cn/s02/' myheaders = {'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0' } html = requests.get(url, headers=myheaders).text f = open ('01.html' , 'w' , encoding='utf-8' ) f.write(html) f.close() root = etree.HTML(html) trs = root.xpath('//tr' ) f = open ('demo01.txt' , 'w' , encoding='utf-8' ) for tr in trs: s ='' tds = tr.xpath('./td' ) for td in tds: s = s + str (td.text) + '| ' print (s) if s != '' : f.write(s + '\n' ) f.close()
warning: 在有的网站加入请求头不能唯一,可以轮流使用不同的User-Agent
第三关:lxml
库进阶语法及解析练习
对采集的数据增加多层标签后,只需更改第二关循环语句代码,将str(td.text)
更改为str(td.xpath('string(.)'))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import requestsfrom lxml import etreeurl = 'http://spiderbuf.cn/s03/' myheaders = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0' } html = requests.get(url, headers=myheaders).text f = open ('01.html' , 'w' , encoding='utf-8' ) f.write(html) f.close() root = etree.HTML(html) trs = root.xpath('//tr' ) f = open ('demo01.txt' , 'w' , encoding='utf-8' ) for tr in trs: s = '' tds = tr.xpath('./td' ) for td in tds: s = s + str (td.xpath('string(.)' )) + '| ' print (s) if s != '' : f.write(s + '\n' ) f.close()
第四关:分页
参数分析及翻页
爬取
已知总页数
,直接确定循环次数
warning: 请求的目标URL不一定会显示浏览器地址栏,直接F12查看请求数据的URL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 import requestsfrom lxml import etreebase_url = 'http://spiderbuf.cn/s04/?pageno=%d' myheaders = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0' } for i in range (1 , 6 ): url = base_url % i html = requests.get(url, headers=myheaders).text f = open ('01 %d .html' % i, 'w' , encoding='utf-8' ) f.write(html) f.close() root = etree.HTML(html) trs = root.xpath('//tr' ) f = open ('demo01 %d.txt' % i, 'w' , encoding='utf-8' ) for tr in trs: s = '' tds = tr.xpath('./td' ) for td in tds: s = s + str (td.xpath('string(.)' )) + '| ' print (s) if s != '' : f.write(s + '\n' ) f.close()
warning: Python引用参数格式:含参数变量 % (参数),例:url = base_url % i
网页元素中取出
总页数,传参给循环
1 2 3 4 5 6 7 8 url = base_url % 1 html = requests.get(url, headers=myheaders).text root = etree.HTML(html) lis = root.xpath('//ul[@class="pagination"]/li' ) page_num = lis[0 ].xpath('string(.)' ) ls = re.findall('[0-9]' , page_num) max_page = int (ls[0 ])
** 很多时候分页组件pageno都会跟其他的另外一个参数pagesize(一页显示多少条数据)成对出现,直接可以在第一页的地址栏加入&pagesize=(总的数据数)或者F12查看一页显示数据的元素变量替换pagesize
**
第五关: 网页图片
的爬取及本地保存
找到图片的请求地址,请求图片的地址,并以二进制写入图片的完整请求地址
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import requestsfrom lxml import etreeurl = 'http://spiderbuf.cn/s05/' myheaders = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0' } html = requests.get(url, headers=myheaders).text f = open ('01.html' , 'w' , encoding='utf-8' ) f.write(html) f.close() root = etree.HTML(html) imgs = root.xpath('//img/@src' ) print (imgs)for i in imgs: img_data = requests.get('http://spiderbuf.cn' + i, headers=myheaders).content img = open (str (i).replace('/' , '' ), 'wb' ) img.write(img_data) img.close()
第六关: 带iframe(网页里面嵌网页)
的页面源码分析及数据爬取
warning:将第三关中的请求地址(URL)代码换位数据的真实地址
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import requestsfrom lxml import etreeurl = 'http://spiderbuf.cn/inner/' myheaders = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0' } html = requests.get(url, headers=myheaders).text f = open ('01.html' , 'w' , encoding='utf-8' ) f.write(html) f.close() root = etree.HTML(html) trs = root.xpath('//tr' ) f = open ('demo01.txt' , 'w' , encoding='utf-8' ) for tr in trs: s = '' tds = tr.xpath('./td' ) for td in tds: s = s + str (td.xpath('string(.)' )) + '|' print (s) if s != '' : f.write(s + '\n' ) f.close()
第七关: ajax
动态加载数据的爬取
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import requestsimport jsonurl = 'http://spiderbuf.cn/iplist/?order=asc' myheaders = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0' } data_json = requests.get(url, headers=myheaders) data_json.encoding = 'utf-8' data_json = data_json.text f = open ('01.html' , 'w' , encoding='utf-8' ) f.write(data_json) f.close() data = json.loads(data_json) f = open ('demo01.txt' , 'w' , encoding='utf-8' ) for i in data: print (i) s = '%s|%s|%s|%s|%s|%s|%s|\n' % ( i['ip' ], i['mac' ], i['name' ], i['type' ], i['manufacturer' ], i['ports' ], i['status' ]) f.write(s) f.close()
warning: 应该在获取响应内容之前设置 encoding(编码格式),而不是在已经通过 .text 转换为字符串之后
warning: json转python,要导入json库(自带)
第八关: http post
请求的数据爬取
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 import requestsfrom lxml import etreeurl = 'http://spiderbuf.cn/s08/' myheaders = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0' } payload = {'level' : '8' } html = requests.post(url, headers=myheaders, data=payload).text f = open ('01.html' , 'w' , encoding='utf-8' ) f.write(html) f.close() root = etree.HTML(html) trs = root.xpath('//tr' ) f = open ('demo01.txt' , 'w' , encoding='utf-8' ) for tr in trs: s = '' tds = tr.xpath('./td' ) for td in tds: s = s + str (td.text) + '| ' print (s) if s != '' : f.write(s + '\n' ) f.close()
第九关: 用户名,密码登录
爬取后台数据
F12确定请求数据的方法(GET,POST等等),构造登录和密码的数据表单,传入请求登陆网址
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 import requestsfrom lxml import etreeurl = 'http://spiderbuf.cn/e01/login' myheaders = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0' } payload = {'username' : 'admin' , 'password' : '123456' } html = requests.post(url, headers=myheaders, data=payload).text f = open ('01.html' , 'w' , encoding='utf-8' ) f.write(html) f.close() root = etree.HTML(html) trs = root.xpath('//tr' ) f = open ('demo01.txt' , 'w' , encoding='utf-8' ) for tr in trs: s = '' tds = tr.xpath('./td' ) for td in tds: s = s + str (td.text) + '| ' print (s) if s != '' : f.write(s + '\n' ) f.close()
第十关: 带验证码
的登录爬取
F12确定请求数据的方法(GET,POST等等),构造登录,密码和Cookie
的数据表单,传入存放数据的地址
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 import requestsfrom lxml import etreeurl = 'http://spiderbuf.cn/e02/list' myheaders = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0' , 'Cookie' :'admin=dd40071182672e688d65a0b8774a0293;' } payload = {'username' : 'admin' , 'password' : '123456' } html = requests.post(url, headers=myheaders, data=payload).text f = open ('01.html' , 'w' , encoding='utf-8' ) f.write(html) f.close() root = etree.HTML(html) trs = root.xpath('//tr' ) f = open ('demo01.txt' , 'w' , encoding='utf-8' ) for tr in trs: s = '' tds = tr.xpath('./td' ) for td in tds: s = s + str (td.text) + '| ' print (s) if s != '' : f.write(s + '\n' ) f.close()
第十关: 无序号
翻页
F12找到存放页码序号的元素并拼接URL进行请求,获取数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 import requestsfrom lxml import etreeimport rebase_url = 'http://spiderbuf.cn/e03' myheaders = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0' } html = requests.get(base_url, headers=myheaders).text root = etree.HTML(html) lis = root.xpath('//ul[@class="pagination"]/li/a/@href' ) print (lis)j = 1 for i in lis: i = i.replace('.' , '' ) url = base_url + i html = requests.get(url, headers=myheaders).text f = open ('01 %d .html' % j, 'w' , encoding='utf-8' ) f.write(html) f.close() root = etree.HTML(html) trs = root.xpath('//tr' ) f = open ('demo01 %d.txt' % j, 'w' , encoding='utf-8' ) for tr in trs: s = '' tds = tr.xpath('./td' ) for td in tds: s = s + str (td.xpath('string(.)' )) + '| ' print (s) if s != '' : f.write(s + '\n' ) j += 1 f.close()
第十一关:User-Agent
与Referer
校验反爬
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 import requestsfrom lxml import etree url = 'http://www.spiderbuf.cn/n01/' myheaders = {'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36' , 'Referer' :'http://www.spiderbuf.cn/list' } html = requests.get(url, headers=myheaders).text print (html)f = open ('01.html' , 'w' , encoding='utf-8' ) f.write(html) f.close() root = etree.HTML(html) ls = root.xpath('//div[@class ="container"]/div/div' ) f = open ('01.txt' , 'w' , encoding='utf-8' ) for item in ls: hnodes = item.xpath('./h2' ) s0 = hnodes[0 ].text pnodes = item.xpath('./p' ) s1 = pnodes[0 ].text s2 = pnodes[1 ].text s3 = pnodes[2 ].text s4 = pnodes[3 ].text s = s0 + '|' + s1.replace('排名:' ,'' ) + '|' + s2.replace('企业估值(亿元):' ,'' ) + '|' \ + s3.replace('CEO:' ,'' ) + '|' + s4.replace('行业:' ,'' ) + '\n' print (s) f.write(s) f.close()
第十三关:CSS样式偏移混淆
文本内容的解析与爬取
temp[1:2]:这是切片操作,取temp中索引为1的元素(注意,这不包括索引2的元素)。结果是一个包含单个元素的新序列。
temp[0:1]:这同样是切片操作,取temp中索引为0的元素。结果也是一个包含单个元素的新序列。
temp[2:]:这个切片操作从索引2开始取,直到temp的末尾。
.xpath('string(.)')方法
取该标签内的所有文本
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 import requestsfrom lxml import etreeurl = 'http://spiderbuf.cn/h01/' myheaders = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36' , 'Referer' : 'http://www.spiderbuf.cn/list' } html = requests.get(url, headers=myheaders).text f = open ('01.html' , 'w' , encoding='utf-8' ) f.write(html) f.close() root = etree.HTML(html) ls = root.xpath('//div[@class ="container"]/div/div' ) f = open ('01.txt' , 'w' , encoding='utf-8' ) for item in ls: hnodes = item.xpath('./h2' ) temp = hnodes[0 ].xpath('string(.)' ) s0 = temp[1 :2 ]+temp[0 :1 ]+temp[2 :] print (s0) pnodes = item.xpath('./p' ) s1 = pnodes[0 ].text print (s1) temp = pnodes[1 ].xpath('string(.)' ).replace('企业估值(亿元):' ,'' ) s2 = temp[1 :2 ] + temp[0 :1 ] + temp[2 :] print (s2) s3 = pnodes[2 ].text print (s3) s4 = pnodes[3 ].text print (s4) s = s0 + '|' + s1.replace('排名:' , '' ) + '|' + s2.replace('企业估值(亿元):' , '' ) + '|' \ + s3.replace('CEO:' , '' ) + '|' + s4.replace('行业:' , '' ) + '\n' print (s) f.write(s) f.close()
第十四关:使用Base64编码
的图片爬取与解码还原
Waring:
1 2 3 item = i.replace('data:image/png;base64,' , '' ) str_bytes = item.encode('raw_unicode_escape' ) decoded = base64.b64decode(str_bytes)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import requestsfrom lxml import etreeimport base64url = 'http://spiderbuf.cn/n02/' myheaders = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0' } html = requests.get(url, headers=myheaders).text f = open ('01.html' , 'w' , encoding='utf-8' ) f.write(html) f.close() root = etree.HTML(html) imgs = root.xpath('//img/@src' ) print (imgs)for i in imgs: item = i.replace('data:image/png;base64,' , '' ) str_bytes = item.encode('raw_unicode_escape' ) decoded = base64.b64decode(str_bytes) img = open ('01.png' , 'wb' ) img.write(decoded) img.close()
第十五关:限制访问频率不低于1秒
引入time
库,在第四关的基础之上,请求数据时加入time.sleep(间隔时间(单位:秒))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 import requestsfrom lxml import etreeimport timebase_url = 'http://spiderbuf.cn/n03/%d' myheaders = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0' } max_page = 20 for i in range (1 , max_page + 1 ): url = base_url % i html = requests.get(url, headers=myheaders).text f = open ('01 %d .html' % i, 'w' , encoding='utf-8' ) f.write(html) f.close() root = etree.HTML(html) trs = root.xpath('//tr' ) f = open ('demo01 %d.txt' % i, 'w' , encoding='utf-8' ) for tr in trs: s = '' tds = tr.xpath('./td' ) for td in tds: s = s + str (td.xpath('string(.)' )) + '| ' print (s) if s != '' : f.write(s + '\n' ) time.sleep(2 ) f.close()