Python爬虫入门-scrapy爬取建筑英才网美高梅4858官方

作者:编程技术

网址:所用库requests登入进度1.Log主页2.拿走验证码3.先是次登陆验证这一步是第3步302回涨的第5步:重回值是三个效仿登陆客户{"quotationNodeUrl":"","companyId":"3","test":false,"loginname":"Guest","studioUrl":"","chestboxNodeUrl":"","nodeSid":"18ec7418-87df-403f-be7c-1df45a377fd1","userType":"-1","pwd":"Guest","quoteCompanyType":"4","errorcode":0}平常登陆重回:{"quotationNodeUrl":"","companyId":"3","test":false,"loginname":"8800001","tradeNodeUrl":"","studioUrl":"","chestboxNodeUrl":"","nodeSid":"cd4d78bd-7b20-4d4e-a237-7eb2982d896c","userType":"0","pwd":"mypassword","quoteCompanyType":"4","errorcode":0}寻常的post头cookie富含:_gid=GA1.2.256256411.1571038904;_ga=GA1.2.1829928522.1570867374;_gat=1;那意气风发部分就googleanalytics生成的_gid_ga_gat但自个儿python代码在第3步未有那部分剧情也是经过了当下第5步不成事,不明了是何许原因,是或不是网址和googleanalytics一同做什么样评释记录引致,第5步验证过不了各位大神请帮作者眨眼间间,提点提议

因而在scrapy中的spider.py先河编写制定代码:

 

File "E:Pythonpycharmlagoupositionlagoupositionspiderslagou.py", line 60, in parse

content=data['content']

KeyError: 'content'

});
$("#txtId").blur(function(){
var idReg=/^d{15}$|^d{18}$/;
var id=$(this).val();
if(id==""||id==null)
{
$("#prompt_id"卡塔尔(قطر‎.html("身份ID不可能为空!"卡塔尔国;
return false;
}
if(!idReg.test(id))
{
$("#prompt_id"卡塔尔(قطر‎.html("居民身份证格式不科学!"卡塔尔(قطر‎;
return false;
}
$("#prompt_no").html("<img src='images/ok.gif'/>");
});

( 踩)→┃你┃ ←(死 ) ( →┃√┃ ← ) /_)↗┗━┛ ↖(_/的坑超多。

签到验证:
$(function(){
$("#文本框id"卡塔尔(英语:State of Qatar).focus(function(卡塔尔{ // 输入账号的文本框得到鼠标宗旨
var name = $(this卡塔尔(英语:State of Qatar).val(卡塔尔; // 拿到当前文本框的值
if(name=="请输入6~12位账号"){
$(this卡塔尔(قطر‎.val(""卡塔尔国; // 假如相符条件,则清空文本框内容
}
});
$("#文本框id"卡塔尔国.blur(function(卡塔尔(قطر‎{// 文本框失去鼠标核心
var name = $(this卡塔尔.val(卡塔尔国; // 得到当前文本框的值
if(name==""){
$(this).val("请输入6~十几个人账号"卡塔尔(قطر‎;// 固然相符条件,则设置故事情节
}
});
});
表单验证:
$(function(){
$("#txtNo").blur(function(){
var name=$(this).val();
if(name==""||name==null)
{
$("#prompt_no"卡塔尔(英语:State of Qatar).html("客商账号不可能为空!"卡塔尔(قطر‎;
return false;
}
if(name.length<6||name.length>12)
{
$("#prompt_no"卡塔尔(英语:State of Qatar).html("账号的尺寸在6到十二人之间!"卡塔尔(قطر‎;
return false;
}
$("#prompt_no").html("<img src='images/ok.gif'/>");
});
$("#txtPwd").blur(function(){
var pwd=$(this).val()
if(pwd==""||pwd==null)
{
$("#prompt_pwd"卡塔尔(قطر‎.html("密码不能够为空!"卡塔尔(قطر‎;
return false;
}
if(pwd.length<6||pwd.length>12)
{
$("#prompt_pwd"卡塔尔(英语:State of Qatar).html("密码的尺寸在6到11人以内"卡塔尔;
return false;
}
$("#prompt_pwd").html("<img src='images/ok.gif'/>");
});
$("#txtConfirmPwd").blur(function(){
var pwds=$(this).val();
if(pwds==""||pwds==null){
$("#prompt_confirmpwd"卡塔尔(英语:State of Qatar).html("密码不能够为空!"卡塔尔国;
return false;
}
if(pwds!=$("#txtPwd").val())
{
$("#prompt_confirmpwd"卡塔尔(英语:State of Qatar).html("五回输入的密码要意气风发致!"卡塔尔;
return false;
}
$("#prompt_confirmpwd").html("<img src='images/ok.gif'/>");
});

File "E:Pythonpycharmlagoupositionlagoupositionspiderslagou.py", line 60, in parse

content=data['content']

KeyError: 'content'

$("#txtName").blur(function(){
var names=$(this).val();
if(names==""||names==null){
$("#prompt_name"卡塔尔国.html("客商名不能够为空!"卡塔尔国;
return false;
}
if(names.length<6||names.length>12)
{
$("#prompt_name"卡塔尔(英语:State of Qatar).html("客商名的尺寸在6到拾三人之间!"卡塔尔(قطر‎;
return false;
}
$("#prompt_name").html("<img src='images/ok.gif'/>");

headers={

'Accept': 'application/json, text/javascript, */*; q=0.01',

'Accept-Encoding':'gzip, deflate, br',

'Accept-Language': 'zh-CN,zh;q=0.8',

'Connection':'keep-alive',

'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',

'Cookie':'LGUID=20170624104910-b3421612-5887-11e7-805a-525400f775ce; user_trace_token=20170624104912-161b9c7475a6448381c393fd68935f6b; index_location_city=全国; JSESSIONID=ABAAABAAAFCAAEGF2DB2AA232B68C2B16743FE83939C1E9; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https://www.lagou.com/; TG-TRACK-CODE=index_search; _gid=GA1.2.705404459.1505118253; _ga=GA1.2.1378071003.1498273550; LGSID=20170911225046-98307e76-9700-11e7-8f76-525400f775ce; LGRID=20170911225056-9dbaf56b-9700-11e7-9168-5254005c3644; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1504697344,1504751304,1504860546,1505142452; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1505142462; SEARCH_ID=1875185cf5904051845b74a20b82bebd',

'Host':'www.lagou.com',

'Origin':'',

'Referer':'',

#   'User-Agent':'User-Agent:Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',

'X-Anit-Forge-Code':'0',

'X-Anit-Forge-Token':'None',

'X-Requested-With':'XMLHttpRequest'}

$("#txtPhone").blur(function(){
var phoneReg=/^(13|15|18)d{9}$/;
var phone=$(this).val();
if(phone==""||phone==null)
{
$("#prompt_phone"卡塔尔.html("手提式有线电话机号不能够为空!"卡塔尔国;
return false;
}
if(!phoneReg.test(phone))
{
$("#prompt_phone"卡塔尔.html("手机号格式不科学!"卡塔尔;
return false;
}
$("#prompt_phone").html("<img src='images/ok.gif'/>");
});

美高梅4858官方网站 1

原来感到能够把前边的30页都抓取下来,没悟出只是抓取了少年老成页的剧情后,就能够报前边的大谬不然:

yield scrapy.FormRequest(response.url,formdata{'first':'False','pn':str(pn),'kd':'python'},

method='Post',meta{'pn':pn},callback=self.parse)

美高梅4858官方网站 2

先是正是开荒拉勾网,在寻找栏中输入Python,展开F12,刷新:

事情发生前就爬过前程无忧,不过遇到有个别谬误一贯尚未办法消除,果决舍弃了,今日又重新试着写写看,对于二个新手来讲,真的都以到处是坑,写篇文章记录一些,供接下去上学参谋。

同一时间修改:

消除了地点的编码难点。

yield scrapy.FormRequest(response.url,formdata={'first':'False','pn':str(pn),'kd':'python'},

method='Post',meta{'pn':pn},headers=self.headers,callback=self.parse)

接下来运营,终于能够跑起来了抓了30页的开始和结果。那么些历程中oooO ↘┏━┓ ↙ Oooo

从不headers的原由。所以做了之类的调节,将settings.py中的DEFAULT_REQUEST_HEADECR-VS注释掉然后在spider.py中增加如下:

URL:

相似的在网络找了众多,试了大器晚成都部队分格局还是没什么用,平素报那一个错误,最后找到了后生可畏种缓和方式,在spider.py中加多了如下代码:

分明代码看起来未有何样难点,为啥一向就是一得之见这几个错误啊,着实让小编很奔溃,后边在和讯上观察了有人回复说要把request headers全部拉长(具体怎么回答的人也说还不明白),然后小编就在settings.py设置如下:

接下来继续编码,在items.py:

yield scrapy.FormRequest(url,formdata{'first':'true','pn':'1','kd':'python'},method='Post',

meta{'pn':1},headers=self.headers,callback=self.parse)

然后运维,下面的报错是未有了,不过却现身了叁个编码的报错(作者利用的是window7系统):

import sys,io

sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gbk')

在settings.py下选择的是私下认可的DEFAULT_REQUEST_HEADEEscortS,并在中间作者加多了随意的User-Agent,然后作者最早运转代码,结果现身报错:

伪造到后边大器晚成最初也报那个错误,作者觉着是背后的:

并修改:

import scrapy

classLagouSpider(scrapy.Spider):

    name='lagou'

    def start_requests(self):

        url='

t=false&isSchoolJob=0'

        yield scrapy.FormRequest(url,formdata={'first':'true','pn':'1','kd':'python'},method='Post',meta={'pn':1},callback=self.parse)

    def parse(self,response):

        html=response.text

        data=json.loads(html)

        if data:

            content=data.get('content')

            positionResult=content.get('positionResult')

            results=positionResult.get('result')

            for result in results:

                companyFullName=result.get('companyFullName')

                print(companyFullName)

 DEFAULT_REQUEST_HEADERS = {

     'Accept': 'application/json, text/javascript, */*; q=0.01',

     'Accept-Encoding':'gzip, deflate, br',

     'Accept-Language': 'zh-CN,zh;q=0.8',

     'Connection':'keep-alive',

     'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',

      'Cookie':'LGUID=20170624104910-b3421612-5887-11e7-805a-525400f775ce; user_trace_token=20170624104912-161b9c7475a6448381c393fd68935f6b; index_location_city=全国; JSESSIONID=ABAAABAAAFCAAEGF2DB2AA232B68C2B16743FE83939C1E9; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https://www.lagou.com/; TG-TRACK-CODE=index_search; _gid=GA1.2.705404459.1505118253; _ga=GA1.2.1378071003.1498273550; LGSID=20170911225046-98307e76-9700-11e7-8f76-525400f775ce; LGRID=20170911225056-9dbaf56b-9700-11e7-9168-5254005c3644; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1504697344,1504751304,1504860546,1505142452; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1505142462; SEARCH_ID=1875185cf5904051845b74a20b82bebd',

     'Host':'www.lagou.com',

     'Origin':'',

     'Referer':'',

  #   'User-Agent':'User-Agent:Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',

     'X-Anit-Forge-Code':'0',

     'X-Anit-Forge-Token':'None',

     'X-Requested-With':'XMLHttpRequest'}

在spider.py

def parse(self,response):

    html=response.text

    data=json.loads(html)

    ifdata:

        content=data.get('content')

        positionResult=content.get('positionResult')

        totalCount=positionResult.get('totalCount')

        pages=int(totalCount/15)

        if pages>=30:

            pages=30

        else:

            pages=pages

        results=positionResult.get('result')

        for result in results:

            item=LagoupositionItem()

            item['companyFullName']=result.get('companyFullName')

            item['companyId']=result.get('companyId')

            item['companyLabelList']=result.get('companyLabelList')

            item['companyLogo']=result.get('companyLogo')

            item['companyShortName']=result.get('companyShortName')

            item['companySize']=result.get('companySize')

            item['createTime']=result.get('createTime')

            item['deliver']=result.get('deliver')

            item['district']=result.get('district')

            item['education']=result.get('education')

            item['explain']=result.get('explain')

            item['financeStage']=result.get('financeStage')

            item['firstType']=result.get('firstType')

            item['formatCreateTime']=result.get('formatCreateTime')

            item['gradeDescription']=result.get('gradeDescription')

            item['industryField']=result.get('industryField')

            item['industryLables']=result.get('industryLables')

            item['isSchoolJob']=result.get('isSchoolJob')

            item['jobNature']=result.get('jobNature')

            item['positionAdvantage']=result.get('positionAdvantage')

            item['positionId']=result.get('positionId')

            item['positionLables']=result.get('positionLables')

            item['positionName']=result.get('positionName')

            item['salary']=result.get('salary')

            item['secondType']=result.get('secondType')

            item['workYear']=result.get('workYear')

            yield item

            pn=int(response.meta.get('pn')) 1

            if pn<=pages:

                yield scrapy.FormRequest(response.url,formdata={'first':'False','pn':str(pn),'kd':'python'},method='Post',meta{'pn':pn},callback=self.parse)

from scrapy importItem,Field

classLagoupositionItem(Item):

    companyFullName=Field()

    companyId=Field()

    companyLabelList=Field()

    companyLogo=Field()

    companyShortName=Field()

    companySize=Field()

    createTime=Field()

    deliver=Field()

    district=Field()

    education=Field()

    explain=Field()

    financeStage=Field()

    firstType=Field()

    formatCreateTime=Field()

    gradeDescription=Field()

    industryField=Field()

    industryLables=Field()

    isSchoolJob=Field()

    jobNature=Field()

    positionAdvantage=Field()

    positionId=Field()

    positionLables=Field()

    positionName=Field()

    salary=Field()

    secondType=Field()

    workYear=Field()

在此个原来的恳求的response中是从未有过大家要的多寡的,平日这种场合下自身就切换来XH本田CR-V中取中取找:

美高梅4858官方网站 3

本文由美高梅4858官方网站发布,转载请注明来源

关键词: 大神 爬虫 网站 提点 拉钩-爬