Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
MSH0303
pyspider
提交
47ba0c84
P
pyspider
项目概览
MSH0303
/
pyspider
与 Fork 源项目一致
从无法访问的项目Fork
通知
1
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
pyspider
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
前往新版Gitcode,体验更适合开发者的 AI 搜索 >>
提交
47ba0c84
编写于
11月 14, 2014
作者:
B
binux
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
update readme
上级
4b331432
变更
2
隐藏空白更改
内联
并排
Showing
2 changed file
with
38 addition
and
32 deletion
+38
-32
README.md
README.md
+37
-30
libs/sample_handler.py
libs/sample_handler.py
+1
-2
未找到文件。
README.md
浏览文件 @
47ba0c84
pyspider [![Build Status](https://travis-ci.org/binux/pyspider.png?branch=master)](https://travis-ci.org/binux/pyspider) [![Coverage Status](https://coveralls.io/repos/binux/pyspider/badge.png)](https://coveralls.io/r/binux/pyspider)
pyspider [![Build Status](https://travis-ci.org/binux/pyspider.png?branch=master)](https://travis-ci.org/binux/pyspider) [![Coverage Status](https://coveralls.io/repos/binux/pyspider/badge.png)](https://coveralls.io/r/binux/pyspider)
========
========
A
spider system in p
ython.
[
Try It Now!
](
http://demo.pyspider.org/
)
A
Powerful Spider System in P
ython.
[
Try It Now!
](
http://demo.pyspider.org/
)
-
Write script with python
-
Write script in python with powerful API
-
Web script editor, debugger, task monitor, project manager and result viewer
-
Powerful WebUI with script editor, task monitor, project manager and result viewer
-
MySQL, MongoDB, SQLite as database backend
-
Javascript pages supported!
-
Task priority, retry, periodical and recrawl by age or marks in index page (like update time)
-
Distributed architecture
-
Distributed architecture
-
MySQL, MongoDB and SQLite as database backend
-
Full control of crawl process with powerful API
-
Javascript pages Support! (with phantomjs fetcher)
![
debug demo
](
http://f.binux.me/debug_demo.png
)
Sample Code:
demo code:
[
gist:9424801
](
https://gist.github.com/binux/9424801
)
```
python
from
libs.base_handler
import
*
class
Handler
(
BaseHandler
):
'''
this is a sample handler
'''
@
every
(
minutes
=
24
*
60
,
seconds
=
0
)
def
on_start
(
self
):
self
.
crawl
(
'http://scrapy.org/'
,
callback
=
self
.
index_page
)
@
config
(
age
=
10
*
24
*
60
*
60
)
def
index_page
(
self
,
response
):
for
each
in
response
.
doc
(
'a[href^="http://"]'
).
items
():
self
.
crawl
(
each
.
attr
.
href
,
callback
=
self
.
detail_page
)
def
detail_page
(
self
,
response
):
return
{
"url"
:
response
.
url
,
"title"
:
response
.
doc
(
'title'
).
text
(),
}
```
[
![demo
](
http://ww1.sinaimg.cn/large/7d46d69fjw1emavy6e9gij21kw0uldvy.jpg
)
](http://demo.pyspider.org/)
Installation
Installation
============
============
*
python2.6/2.7
*
python2.6/2.7
*
`pip install -r requirements.txt`
*
`pip install -
-allow-all-external -
r requirements.txt`
*
`./run.py`
, visit
[
http://localhost:5000/
](
http://localhost:5000/
)
*
`./run.py`
, visit
[
http://localhost:5000/
](
http://localhost:5000/
)
Docker
if ubuntu:
`apt-get install python python-dev python-distribute python-pip libcurl4-openssl-dev libxml2-dev libxslt1-dev python-lxml`
======
```
or
[
Run with Docker
](
https://github.com/binux/pyspider/wiki/Run-pyspider-with-Docker
)
# mysql
docker run -it -d --name mysql dockerfile/mysql
# rabbitmq
docker run -it -d --name rabbitmq dockerfile/rabbitmq
# phantomjs link to fetcher and webui
docker run --name phantomjs -it -d -v `pwd`:/mnt/test --expose 25555 cmfatih/phantomjs /usr/bin/phantomjs /mnt/test/fetcher/phantomjs_fetcher.js 25555
# scheduler
docker run -it -d --name scheduler --link mysql:mysql --link rabbitmq:rabbitmq binux/pyspider scheduler
# fetcher, run multiple instance if needed.
docker run -it -d -m 64m --link rabbitmq:rabbitmq binux/pyspider fetcher
# processor, run multiple instance if needed.
docker run -it -d -m 128m --link mysql:mysql --link rabbitmq:rabbitmq binux/pyspider processor
# webui
docker run -it -d -p 5000:5000 --link mysql:mysql --link rabbitmq:rabbitmq --link scheduler:scheduler binux/pyspider webui
```
Documents
Documents
=========
=========
...
@@ -53,8 +60,8 @@ Documents
...
@@ -53,8 +60,8 @@ Documents
Contribute
Contribute
==========
==========
*
部署使用,提交 bug、特性
[
Issue
](
https://github.com/binux/pyspider/issues
)
*
Use It, Open
[
Issue
](
https://github.com/binux/pyspider/issues
)
, PR is welcome.
*
参与
[
特性讨论
](
https://github.com/binux/pyspider/issues?labels=discussion&state=open
)
或
[
完善文档
](
https://github.com/binux/pyspider/wiki
)
*
[
Discuss
](
https://github.com/binux/pyspider/issues?labels=discussion&state=open
)
[
Document
]
(https://github.com/binux/pyspider/wiki)
License
License
...
...
libs/sample_handler.py
浏览文件 @
47ba0c84
...
@@ -3,7 +3,6 @@
...
@@ -3,7 +3,6 @@
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Created on __DATE__
# Created on __DATE__
from
libs.pprint
import
pprint
from
libs.base_handler
import
*
from
libs.base_handler
import
*
class
Handler
(
BaseHandler
):
class
Handler
(
BaseHandler
):
...
@@ -12,7 +11,7 @@ class Handler(BaseHandler):
...
@@ -12,7 +11,7 @@ class Handler(BaseHandler):
'''
'''
@
every
(
minutes
=
24
*
60
,
seconds
=
0
)
@
every
(
minutes
=
24
*
60
,
seconds
=
0
)
def
on_start
(
self
):
def
on_start
(
self
):
self
.
crawl
(
'http://
www.baidu.com
/'
,
callback
=
self
.
index_page
)
self
.
crawl
(
'http://
scrapy.org
/'
,
callback
=
self
.
index_page
)
@
config
(
age
=
10
*
24
*
60
*
60
)
@
config
(
age
=
10
*
24
*
60
*
60
)
def
index_page
(
self
,
response
):
def
index_page
(
self
,
response
):
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录