README.md

    pyspider Build Status Coverage Status

    A Powerful Spider(Web Crawler) System in Python.

    • Write script in Python
    • Powerful WebUI with script editor, task monitor, project manager and result viewer
    • MySQL, MongoDB, Redis, SQLite, Elasticsearch; PostgreSQL with SQLAlchemy as database backend
    • RabbitMQ, Redis and Kombu as message queue
    • Task priority, retry, periodical, recrawl by age, etc...
    • Distributed architecture, Crawl Javascript pages, Python 2.{6,7}, 3.{3,4,5,6} support, etc...

    Tutorial: http://docs.pyspider.org/en/latest/tutorial/
    Documentation: http://docs.pyspider.org/
    Release notes: https://github.com/binux/pyspider/releases

    Sample Code

    from pyspider.libs.base_handler import *
    
    
    class Handler(BaseHandler):
        crawl_config = {
        }
    
        @every(minutes=24 * 60)
        def on_start(self):
            self.crawl('http://scrapy.org/', callback=self.index_page)
    
        @config(age=10 * 24 * 60 * 60)
        def index_page(self, response):
            for each in response.doc('a[href^="http"]').items():
                self.crawl(each.attr.href, callback=self.detail_page)
    
        def detail_page(self, response):
            return {
                "url": response.url,
                "title": response.doc('title').text(),
            }

    Installation

    WARNING: WebUI is open to the public by default, it can be used to execute any command which may harm your system. Please use it in an internal network or enable need-auth for webui.

    Quickstart: http://docs.pyspider.org/en/latest/Quickstart/

    Contribute

    TODO

    v0.4.0

    • a visual scraping interface like portia

    License

    Licensed under the Apache License, Version 2.0

    项目简介

    🚀 Github 镜像仓库 🚀

    源项目地址

    https://github.com/binux/pyspider

    发行版本

    当前项目没有发行版本

    贡献者 65

    全部贡献者

    开发语言

    • Python 85.5 %
    • JavaScript 8.1 %
    • HTML 3.6 %
    • CSS 1.7 %
    • Lua 0.9 %