三国棋牌优惠大厅在哪找到,2022卡塔尔世界杯视频,博彩公司现金(中国)·官方网站

一.項目背景

之前文章[Scrapy爬蟲框架初步使用介紹](http://mp.weixin.qq.com/s?__biz=MzIzODI4ODM2MA==&mid=2247484881&idx=1&sn=5d205c3315927845fed5aa4dfbb4f4da&chksm=e93ae956de4d604052e6d18ca10fc081f32cd8479a11420cd13fe20bbb963044b13d55b15390&scene=21#wechat_redirect)我們介紹了Scrapy框架運行基本原理,緊接著我們介紹了如何利用Scrapy爬取文本數據[Scrapy+MySQL+MongoDB爬取豆瓣讀書做簡單數據分析](http://mp.weixin.qq.com/s?__biz=MzIzODI4ODM2MA==&mid=2247484898&idx=1&sn=763a73b7d4b7c991d1aeb2ceb389b686&chksm=e93ae965de4d6073da55c6db07bfe142c1d18ca744dae33214a2dba8940db348616e256a7e50&scene=21#wechat_redirect),以及如何利用Scrapy爬取圖片[Scrapy爬取某網站美女圖片](http://mp.weixin.qq.com/s?__biz=MzIzODI4ODM2MA==&mid=2247486610&idx=1&sn=e05d207e965d7bcc0507a195f25da2b9&chksm=e93ae015de4d69031ae847bf5f12adef61e82d263aa8366e9533a58c7011b6396b4a05051cea&scene=21#wechat_redirect),本次我們分享如何利用Scrapy爬取文件。

本次我們爬取目標網頁為：https://matplotlib.org/2.0.2/examples/index.html

二.實現過程

1.創建項目
   》》scrapy startproject matplot_file
   》》進入該目錄 cd matplot_file
   》》生成爬蟲 scrapy genspider mat  matplotlib.org
   》》運行爬蟲 scrapy crawl mat -o mat_file.json

2.數據爬取
  》》解析數據
  》》存儲數據

# -*- coding: utf-8 -*-
import scrapy
from matplot_file.items import MatplotFileItem




class MatSpider(scrapy.Spider):
    name = 'mat'
    allowed_domains = ['matplotlib.org']
    start_urls = ['https://matplotlib.org/2.0.2/examples/index.html']


    def parse(self, response):
        #獲取所有li元素
        for lis in response.xpath('//*[@id="matplotlib-examples"]/div/ul/li'):
            #遍歷li元素
            for li in lis.xpath('.//ul/li'):
                #獲取鏈接
                url=li.xpath('.//a/@href').get()
                #拼接鏈接
                url = response.urljoin(url)
                #爬取文本
                yield scrapy.Request(url, callback=self.parse_html)


    #解析文本
    def parse_html(self,response):
        #獲取文件鏈接
        href = response.xpath('//div[@class="section"]/p/a/@href').get()
        #拼接鏈接
        url=response.urljoin(href)
        #打印控制臺
        print(url)
        #初始化對象
        matfile=MatplotFileItem()
        #存儲對象
        matfile['file_urls']=[url]
        #返回數據
        yield   matfile

【注】以上是mat.py中代碼

# -*- coding: utf-8 -*-
BOT_NAME = 'matplot_file'


SPIDER_MODULES = ['matplot_file.spiders']
NEWSPIDER_MODULE = 'matplot_file.spiders'




#設置FilePipeline
ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline':1,
    }
#設設置文件保存路徑
FILES_STORE = 'mat_file'
ROBOTSTXT_OBEY = False


【注】以上是settings.py中代碼

import scrapy




class MatplotFileItem(scrapy.Item):
    # define the fields for your item here like:


    #文件url
    file_urls = scrapy.Field()
    #下載文件信息
    files = scrapy.Field()


【注】以上是items.py中代碼