Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
亦蔚然
Project Crawler Elasticsearch
提交
2ac09d01
P
Project Crawler Elasticsearch
项目概览
亦蔚然
/
Project Crawler Elasticsearch
通知
4
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
DevOps
流水线
流水线任务
计划
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Project Crawler Elasticsearch
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
DevOps
DevOps
流水线
流水线任务
计划
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
流水线任务
提交
Issue看板
前往新版Gitcode,体验更适合开发者的 AI 搜索 >>
未验证
提交
2ac09d01
编写于
5月 22, 2021
作者:
W
weiranyi
提交者:
GitHub
5月 22, 2021
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
V1__Java爬虫基础逻辑版本 (#12)
* V1__Java爬虫基础逻辑版本 * V1__Java爬虫基础逻辑版本
上级
e3540254
变更
2
隐藏空白更改
内联
并排
Showing
2 changed file
with
30 addition
and
25 deletion
+30
-25
README.md
README.md
+2
-1
src/main/java/com/github/weiranyi/Main.java
src/main/java/com/github/weiranyi/Main.java
+28
-24
未找到文件。
README.md
浏览文件 @
2ac09d01
...
...
@@ -99,4 +99,5 @@
-
做什么需要依靠插件
-
maven-surefire-plugin官方检测插件
-
插件可以绑定到maven的各个生命周期上
-
maven-compile-plugein
\ No newline at end of file
-
maven-compile-plugein
-
ORM对象关系映射
\ No newline at end of file
src/main/java/com/github/weiranyi/Main.java
浏览文件 @
2ac09d01
...
...
@@ -14,12 +14,14 @@ import org.jsoup.nodes.Element;
import
java.io.IOException
;
import
java.sql.*
;
import
java.util.ArrayList
;
import
java.util.stream.Collectors
;
public
class
Main
{
private
static
final
String
USER_NAME
=
"root"
;
private
static
final
String
USER_PASSWORD
=
"123456"
;
@SuppressFBWarnings
(
"DMI_CONSTANT_DB_PASSWORD"
)
public
static
void
main
(
String
[]
args
)
throws
IOException
,
SQLException
{
Connection
connection
=
DriverManager
.
getConnection
(
"jdbc:h2:file:/Users/yiweiran/Documents/workPlace/java/JavaProject-Crawler-Elasticsearch/news"
,
USER_NAME
,
USER_PASSWORD
);
...
...
@@ -35,7 +37,7 @@ public class Main {
Document
doc
=
httpGetAndParseHtml
(
link
);
// 分析页面url将它们放到即将处理的url池子中去
parseUrlsFromAndStoreIntoDatabase
(
connection
,
doc
);
storeIntoDatabaseIfItIsNewsPage
(
doc
);
storeIntoDatabaseIfItIsNewsPage
(
connection
,
doc
,
link
);
updataDatabase
(
connection
,
link
,
"insert into LINKS_ALREADY_PROCESSED(link) values (?)"
);
}
else
{
// 不感兴趣
...
...
@@ -45,14 +47,6 @@ public class Main {
}
private
static
void
parseUrlsFromAndStoreIntoDatabase
(
Connection
connection
,
Document
doc
)
throws
SQLException
{
for
(
Element
aTag
:
doc
.
select
(
"a"
))
{
String
href
=
aTag
.
attr
(
"href"
);
updataDatabase
(
connection
,
href
,
"insert into LINKS_TO_BE_PROCESSED(link) values (?)"
);
}
}
/*
* 4、优化主干逻辑,进一步重构
*/
...
...
@@ -64,6 +58,20 @@ public class Main {
return
link
;
}
private
static
void
parseUrlsFromAndStoreIntoDatabase
(
Connection
connection
,
Document
doc
)
throws
SQLException
{
for
(
Element
aTag
:
doc
.
select
(
"a"
))
{
String
href
=
aTag
.
attr
(
"href"
);
if
(
href
.
startsWith
(
"//"
))
{
href
=
"https:"
+
href
;
}
if
(
href
.
toLowerCase
().
startsWith
(
"javascript"
))
{
continue
;
}
updataDatabase
(
connection
,
href
,
"insert into LINKS_TO_BE_PROCESSED(link) values (?)"
);
}
}
/*
* 3、重构对数据库操作部分的代码
*/
...
...
@@ -105,15 +113,6 @@ public class Main {
}
return
false
;
}
// private static List<String> deleteFromDatabase(Connection connection, String sql) throws SQLException {
// List<String> results = new ArrayList<>();
// try (PreparedStatement statement = connection.prepareStatement("delete FROM LINKS_TO_BE_PROCESSED where LINK=?")) {
// statement.setString(1,link);
// // 从数据库加载即将处理的代码
// statement.executeUpdate();
// }
// return results;
// }
/*
...
...
@@ -126,9 +125,6 @@ public class Main {
// 通过http请求拿到HTML文档
private
static
Document
httpGetAndParseHtml
(
String
link
)
throws
IOException
{
try
(
CloseableHttpClient
httpclient
=
HttpClients
.
createDefault
())
{
if
(
link
.
startsWith
(
"//"
))
{
link
=
"https:"
+
link
;
}
HttpGet
httpGet
=
new
HttpGet
(
link
);
httpGet
.
addHeader
(
"user-agent"
,
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"
);
try
(
CloseableHttpResponse
response1
=
httpclient
.
execute
(
httpGet
))
{
...
...
@@ -142,12 +138,20 @@ public class Main {
}
// 若是新闻页面就存到数据库中
private
static
void
storeIntoDatabaseIfItIsNewsPage
(
Document
doc
)
{
private
static
void
storeIntoDatabaseIfItIsNewsPage
(
Connection
connection
,
Document
doc
,
String
link
)
throws
SQLException
{
ArrayList
<
Element
>
articleTags
=
doc
.
select
(
"article"
);
if
(!
articleTags
.
isEmpty
())
{
for
(
Element
articleTag
:
articleTags
)
{
String
titile
=
articleTags
.
get
(
0
).
child
(
0
).
text
();
System
.
out
.
println
(
titile
);
String
title
=
articleTags
.
get
(
0
).
child
(
0
).
text
();
// Collectors.joining("\n")得到的字符串用换行符分隔
String
content
=
articleTag
.
select
(
"p"
).
stream
().
map
(
Element:
:
text
).
collect
(
Collectors
.
joining
(
"\n"
));
System
.
out
.
println
(
title
);
try
(
PreparedStatement
statement
=
connection
.
prepareStatement
(
"insert into news(url,title,content,created_at,MODIFIED_AT)VALUES ( ?,?,?,now(),now() )"
))
{
statement
.
setString
(
1
,
link
);
statement
.
setString
(
2
,
title
);
statement
.
setString
(
3
,
content
);
statement
.
executeUpdate
();
}
}
}
}
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录