发布于2020-07-28 23:32 阅读(1421) 评论(0) 点赞(27) 收藏(3)
最近在学习爬虫,本人还是入门级的小白,自己跟着老师写了一些代码,算是自己的总结,还有一些心得,跟大家分享一下,如果不当,还请各位前辈斧正。
这是代码:
# 导入库
import requests
import re
# 定义函数getHTMLText(url),获取网页信息
def getHTMLText(url):
try:
headers = {
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
"cookie": "miid=1428930817865580362; cna=EarZFfUm1S0CARsR+220O8hH; t=8bc94e7bc688eb7af5533f1976650fde; _m_h5_tk=65dbeb4e38f534aacf4025c8d4e81bce_1586794235712; _m_h5_tk_enc=e96d92ee16e958b4890caa9fc2fa6db4; thw=cn; cookie2=1554e5bbbfe6457cf1c1c9aa63c058df; v=0; _tb_token_=5a85e0188653; _samesite_flag_=true; sgcookie=EpId%2FVCz%2BPjBPFKeidqdS; unb=2683761081; uc3=lg2=WqG3DMC9VAQiUQ%3D%3D&id2=UU6p%2BQEJ8tSc4g%3D%3D&vt3=F8dBxdGLa3BXsASlX%2Bw%3D&nk2=BcLP06d1nZPt5PbdCo24Cnoi; csg=1e8e7f0a; lgc=freezing2856803123; cookie17=UU6p%2BQEJ8tSc4g%3D%3D; dnk=freezing2856803123; skt=6a084e57cf10b6e6; existShop=MTU4NzE5NDg1OA%3D%3D; uc4=id4=0%40U2xkY0WHChRFrR6VhQm75gIGMATD&nk4=0%40B044YAqLRKUazEZ7eWhSvUymCOjtR%2FkE1PO2nJ8%3D; tracknick=freezing2856803123; _cc_=U%2BGCWk%2F7og%3D%3D; _l_g_=Ug%3D%3D; sg=317; _nk_=freezing2856803123; cookie1=B0BXi%2BrAh%2BCsG%2B9LmOzVV9j8dAB5xdFbcF%2BmnvpYvzA%3D; tfstk=chgGBuae-cr6eLnsN1asMerwb79daT74EquI8V-uS4f_xE3z_sIoYL5pOSEkdp1..; mt=ci=97_1; enc=0gxF3t55dTUIEQOzUSrgF7p2gdf9xdcdC6xm317h5dXRn7D21KYrLJkRJFp6vcy6l7Z2CrAPewgEdMBB0j7yHg%3D%3D; alitrackid=www.taobao.com; lastalitrackid=www.taobao.com; hng=CN%7Czh-CN%7CCNY%7C156; uc1=cookie16=UtASsssmPlP%2Ff1IHDsDaPRu%2BPw%3D%3D&cookie21=U%2BGCWk%2F7p4mBoUyS4plD&cookie15=URm48syIIVrSKA%3D%3D&existShop=false&pas=0&cookie14=UoTUPc3lioQ%2F3A%3D%3D; JSESSIONID=B3B7C7381542916C591F2634FDE31A52; l=eBSbgB4VqimFn0mBBOfwdA7-hk7OSBdYYu8NeR-MiT5PON1p5CxAWZXZX0L9C3GVhsZXR3Szm2rQBeYBqS24n5U62j-la_kmn; isg=BGJi2OxgSCf6jlezYGKTe0FGvejEs2bNw3JHu6z7jlWAfwL5lEO23eh9r7uD9N5l"
}
# 有些可以去掉headers=headers,但是大部分不能去掉,否则不能获取数据
r=requests.get(url,timeout=30, headers=headers)
# r = requests.get(url,timeout = 30)
r.raise_for_status()
# print(r.status_code)
r.encoding = r.apparent_encoding
# print (r.text)
return r.text
except:
return ""
# 对每一个获得页面进行解析,获取名称和价格
def parsePage(ilt, html):
try:
# 正则表达式获取商品的价格和名称,各位可以根据自己的需要适当添加新东西
# plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)
# print(plt)
# tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)
# 中括号指的是中括号里面包裹的随机抽取一个
plt = re.findall(r'"view_price":"[\d.]*"', html)
# 以及re.findall(r" 正则 ",待查字符串)
tlt = re.findall(r'"raw_title":".*?"', html)
# print(tlt)
# ilt.append([1,1])
for i in range(len(plt)):
price = eval(plt[i].split(':')[1])
title = eval(tlt[i].split(':')[1])
ilt.append([price, title])
except:
print("")
# 显示商品信息
def printGoodsList(ilt):
tplt = "{:4}\t{:8}\t{:16}" # 定义输出格式
print(tplt.format("序号","价格","商品名称"))
count = 0
for g in ilt:
count = count + 1
print(tplt.format(count, g[0], g[1]))
# main()主函数
def main():
# 要进行搜索的商品名字,以下goods可以替换成其他商品进行搜索比如 '钢笔','杯子'等
goods = '饭盒'
# 搜索的页数,这里设置为只爬取2页
depth = 2
start_url = 'https://s.taobao.com/search?q=' + goods
infoList = []
for i in range(depth):
try:
url = start_url + '&s=' + str(44*i) # 因为淘宝网页每页显示44个商品信息
html = getHTMLText(url)
parsePage(infoList, html)
# print(infoList)
except:
continue
printGoodsList(infoList)
main()
以下是运行结果:
心得:总的来说,学习爬虫还是得多动手,多写代码多总结才行吧。
原文链接:https://blog.csdn.net/panajie/article/details/107611129
作者:fhue34873
链接:https://www.pythonheidong.com/blog/article/466612/9a6a0ebaaa9c1fcdf1a1/
来源:python黑洞网
任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任
昵称:
评论内容:(最多支持255个字符)
---无人问津也好,技不如人也罢,你都要试着安静下来,去做自己该做的事,而不是让内心的烦躁、焦虑,坏掉你本来就不多的热情和定力
Copyright © 2018-2021 python黑洞网 All Rights Reserved 版权所有,并保留所有权利。 京ICP备18063182号-1
投诉与举报,广告合作请联系vgs_info@163.com或QQ3083709327
免责声明:网站文章均由用户上传,仅供读者学习交流使用,禁止用做商业用途。若文章涉及色情,反动,侵权等违法信息,请向我们举报,一经核实我们会立即删除!