提交 eb972ef4 编写于 作者: 亦蔚然's avatar 亦蔚然

V5__基本实现Elasticsearch搜索功能

上级 a2980969
# 项目:多线程爬虫与Elasticsearch搜索引擎实战
***
## 一、项目概述:
- 1、项目目标:爬取新浪网
- 2、需求分析与算法设计:
- 需求:网页中的一个节点开始遍历所有节点
- 算法:使用了广度优先算法的变体
![img.png](https://github.com/weiranyi/JavaProject-Crawler-Elasticsearch/blob/yiweiran/images/flowChart.png?raw=true)
***
## 心得与收获:
### 1、做一个项目的原则
- 心法:
- 1、把每个项目当作人生最好的项目来精雕细琢,一丝不苟滴写好文档,保证代码质量(以自己当前最高水平去完成,可以借助代码检测工具)
- 2、使用业界标准化的模式和流程,每一行代码都不要是多余的(如:不要提交不该提交的文件 .idea 等不要上传到Github);几乎不要有本地依赖,使用者能够毫无障碍的使用
- 3、小步快跑,成就感,越小的变更越容易debug,越早进行越好
- 强制规范:
- 1、【重要】使用GitHub+主干/分支模型进行开发
- 禁止直接push master
- 所有变更必须PR进行
- 2、【重要】自动代码质量检查+测试
- Checkstyle/SpotBugs
- 最基本的自动化测试覆盖
- 项目设计流程
- 多人协作【自顶向下】
- 模块化
- 各模块之间责任明确,界限清晰
- 基本文档
- 基本借口
- 小步提交
- 大的变更难以review
- 的的变更更加棘手
- 小步提交颗粒度
- 单打独斗【自底向上】
- 先实现功能
- 在实现的过程中不断抽出公用的部分
- 每当自己写的代码比较啰嗦(不断复制粘贴)的时候就得重构了
- 通过重构实现模块化、接口化
- 项目演进:
- 单线程 -> 多线程
- console -> H2 -> MySQL
- database -> Elasticsearch
- 好的代码习惯:
- 不要写妥协的代码,将烂代码、啰嗦的代码重构掉,初学可能会学习大量烂代码
- 有好的三方实现可以借用,如:Apache提供的包
- 代码要有一个较好的扩展性
### 2、收获
- 冒烟测试;测试原则:每个测试是一个类,负责一个小的功能模块
- git命令回顾:
- 新建分支的命令:
```shell
git checkout -b basic-algorithm
```
- 撤销 git add 操作,可以使用以下命令:
```shell
git restore --staged src/main/java/com/github/weiranyi/Main.java
```
- 若此时全部commit提交,想要撤销一个提交怎么办
```shell
git reset HEAD~1
```
- 撤销PR的提交
```shell
git log --获得61b22195162ec24fbbf2ef020485bb0a524c82b9
git revert 61b22195162ec24fbbf2ef020485bb0a524c82b9
```
- 若在自己分支出现与master冲突时,可以通过force push
```shell
git push -f
```
- commit提交文本,首行做标题行,第二行开始写内容,每行最好不要超过72个字符
- 使用了circleci检查,比自己发现代码问题还要细致
- 算法
- DFS 深度优先算法
- BFS 广度优先
- 重构
- 短方法:
- a.便于人脑理解
- b.越短越容易复用
- c.对于Java来说可以方便的对方法进行覆盖
- spotbugs
- spotbugs goal:分析项目
- check goal:分析项目,发现BUG就让build失败
```shell
mvn spotbugs:check
mvn spotbugs:gui
```
- maven默认生命周期:
- maven在各生命周期什么都不做
- 做什么需要依靠插件
- maven-surefire-plugin官方检测插件
- 插件可以绑定到maven的各个生命周期上
- maven-compile-plugein
- ORM对象关系映射
\ No newline at end of file
## 1、迭代内容:
- 版本1:
- 用Java编写一个多线程爬虫,完成HTTP请求、HTML解析等工作,得到数据后放入H2数据库中,借助Flyway将建表、添加原始数据的工作等(自动化)
- 使用Maven进行包管理,使用CircleCI进行自动化测试,在生命周期绑定 Checkstyle、SpotBugs 插件保证代码质量
- 版本2:使用ORM(对象关系映射)重构,使用MyBatis框架
- 版本3:通过flyway插件迁移数据,将数据从H2 数据库迁移到MySQL数据库
- 版本4:将主函数从爬虫类中抽取出,形成新的类,方便调用爬虫线程
- 版本5:借助Elasticsearch编写一个简单的搜索程序
## 2、建立:
- 建立GitHub仓库并克隆到本地:
```shell
# 后期建议使用SSH
git clone https://github.com/weiranyi/JavaProject-Crawler-Elasticsearch.git
```
- 使用自动化工具Flyway完成自动建表工作:
```shell
mvn flyway:migrate
# 备用命令:上一次建表失败的情况下,想再次建表,可以使用本命令
mvn flyway:clean && mvn flyway:migrate
```
# 3、测试:
- 项目测试:
```shell
mvn verify
```
- 补充测试:
```shell
mvn spotbugs:check
mvn spotbugs:gui
#压制不必要的报错:
@SuppressFBWarnings("DMI_CONSTANT_DB_PASSWORD")
```
# 4、扩展:
[项目中MySQL和是Docker安装滴,Docker的使用可以点击这,参考该笔记](https://zhuanlan.zhihu.com/p/356987233)
# 5、展示:
- 搜索展示:
![搜索展示](https://raw.githubusercontent.com/weiranyi/Crawler-Elasticsearch/new-feature/images/search_code.png)
- 爬取的数据:
![数据展示](https://github.com/weiranyi/Crawler-Elasticsearch/blob/new-feature/images/news_database.png?raw=true)
- Docker展示:
![Docker](https://github.com/weiranyi/Crawler-Elasticsearch/blob/new-feature/images/Docker.png?raw=true)
- Elasticsearch
![Elasticsearch](https://github.com/weiranyi/Crawler-Elasticsearch/blob/new-feature/images/Elasticsearch.png?raw=true)
\ No newline at end of file
......@@ -87,6 +87,12 @@
<artifactId>mysql-connector-java</artifactId>
<version>8.0.25</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.elasticsearch.client/elasticsearch-rest-high-level-client -->
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>elasticsearch-rest-high-level-client</artifactId>
<version>7.12.0</version>
</dependency>
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-api</artifactId>
......
......@@ -16,34 +16,40 @@ import java.util.ArrayList;
import java.util.stream.Collectors;
public class Crawler {
CrawlerDao dao = new MyBatisCrawlerDao();
public class Crawler extends Thread {
private CrawlerDao dao;
public void run() throws SQLException, IOException {
String link = null;
// 从数据库中加载下一个链接,若能加载到则进行下一个循环
while ((link = dao.getNextLinkThenDelete()) != null) {
// 若链接已经处理过了就跳到下一次循环
if (dao.isLinkProcessed(link)) {
continue;
}
// 判断是否是感兴趣滴内容【新浪站内的网页】
if (isInterestingLink(link)) {
Document doc = httpGetAndParseHtml(link);
// 分析页面url将它们放到即将处理的url池子中去
parseUrlsFromAndStoreIntoDatabase(doc);
storeIntoDatabaseIfItIsNewsPage(doc, link);
dao.insertProcessedLinked(link);
// dao.updataDatabase(link, "insert into LINKS_ALREADY_PROCESSED(link) values (?)");
} else {
// 不感兴趣
continue;
}
}
public Crawler(CrawlerDao dao) {
// 这样每个线程共享同一个链接
this.dao = dao;
}
public static void main(String[] args) throws IOException, SQLException {
new Crawler().run();
@Override
public void run() {
try {
String link;
// 从数据库中加载下一个链接,若能加载到则进行下一个循环
while ((link = dao.getNextLinkThenDelete()) != null) {
// 若链接已经处理过了就跳到下一次循环
if (dao.isLinkProcessed(link)) {
continue;
}
// 判断是否是感兴趣滴内容【新浪站内的网页】
if (isInterestingLink(link)) {
Document doc = httpGetAndParseHtml(link);
// 分析页面url将它们放到即将处理的url池子中去
parseUrlsFromAndStoreIntoDatabase(doc);
storeIntoDatabaseIfItIsNewsPage(doc, link);
dao.insertProcessedLinked(link);
// dao.updataDatabase(link, "insert into LINKS_ALREADY_PROCESSED(link) values (?)");
} else {
// 不感兴趣
continue;
}
}
} catch (Exception e) {
throw new RuntimeException(e);
}
}
private void parseUrlsFromAndStoreIntoDatabase(Document doc) throws SQLException {
......
package com.github.weiranyi;
import com.github.weiranyi.entity.News;
import org.apache.http.HttpHost;
import org.apache.ibatis.io.Resources;
import org.apache.ibatis.session.SqlSession;
import org.apache.ibatis.session.SqlSessionFactory;
import org.apache.ibatis.session.SqlSessionFactoryBuilder;
import org.elasticsearch.action.bulk.BulkRequest;
import org.elasticsearch.action.bulk.BulkResponse;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.xcontent.XContentType;
import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
/**
* @author: https://github.com/weiranyi
* @description 生成一亿条数据
* @date: 2021/3/31 4:09 下午
* @Version 1.0
*/
public class ElasticsearchDataGenerator {
private static void writeSingleThread(List<News> newsFromMySQL) {
try (RestHighLevelClient client = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http")))) {
// 单线程写入2000*1000 = 200_0000数据
for (int i = 0; i < 1000; i++) {
// 批处理操作
BulkRequest bulkRequest = new BulkRequest();
for (News news : newsFromMySQL) {
IndexRequest request = new IndexRequest("news");
Map<String, Object> data = new HashMap<>();
data.put("content", news.getContent().length() > 10 ? news.getContent().substring(0, 10) : news.getContent());
data.put("url", news.getUrl());
data.put("title", news.getTitle());
data.put("createdAt", news.getCreatedAt());
data.put("modifiedAt", news.getModifiedAt());
request.source(data, XContentType.JSON);
bulkRequest.add(request);
IndexResponse response = client.index(request, RequestOptions.DEFAULT);
System.out.println(response.status().getStatus());
}
BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT);
System.out.println(Thread.currentThread().getName() + "finishes" + i + ":" + bulkResponse.status().getStatus());
}
} catch (IOException e) {
throw new RuntimeException(e);
}
}
public static void main(String[] args) throws IOException {
SqlSessionFactory sqlSessionFactory;
try {
String resource = "db/mybatis/config.xml";
InputStream inputStream = Resources.getResourceAsStream(resource);
sqlSessionFactory = new SqlSessionFactoryBuilder().build(inputStream);
} catch (
IOException e) {
throw new RuntimeException(e);
}
List<News> newsFromMySQL = getNewsFromMySQL(sqlSessionFactory);
for (int i = 0; i < 16; i++) {
new Thread(() -> writeSingleThread(newsFromMySQL)).start();
}
}
private static List<News> getNewsFromMySQL(SqlSessionFactory sqlSessionFactory) {
try (SqlSession session = sqlSessionFactory.openSession()) {
return session.selectList("com.github.weiranyi.MockMapper.selectNews");
}
}
}
package com.github.weiranyi;
import org.apache.http.HttpHost;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.index.query.MultiMatchQueryBuilder;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
/**
* @author: https://github.com/weiranyi
* @description 搜索引擎,该类用于实现搜索功能
* @date: 2021/5/26 2:56 下午
* @Version 1.0
*/
public class ElasticsearchEngine {
public static void main(String[] args) throws IOException {
while (true) {
System.out.println("请输入一个搜索关键字");
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in, "utf-8"));
String keyword = reader.readLine();
System.out.println(keyword);
search(keyword);
}
}
private static void search(String keyword) throws IOException {
try (RestHighLevelClient client = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http")))) {
SearchRequest request = new SearchRequest("news");
request.source(new SearchSourceBuilder().query(new MultiMatchQueryBuilder(keyword, "title", "content")));
SearchResponse result = client.search(request, RequestOptions.DEFAULT);
for (SearchHit hit : result.getHits()) {
System.out.println(hit.getSourceAsString());
}
}
}
}
package com.github.weiranyi;
import java.io.IOException;
import java.sql.SQLException;
public class Main {
public static void main(String[] args) throws IOException, SQLException {
CrawlerDao dao = new MyBatisCrawlerDao();
for (int i = 0; i < 16; i++) {
new Crawler(dao).start();
}
}
}
package com.github.weiranyi;
import com.github.weiranyi.entity.News;
import org.apache.ibatis.io.Resources;
import org.apache.ibatis.session.SqlSession;
import org.apache.ibatis.session.SqlSessionFactory;
import org.apache.ibatis.session.SqlSessionFactoryBuilder;
import java.io.IOException;
import java.io.InputStream;
import java.time.Instant;
import java.util.List;
import java.security.SecureRandom;
/**
* @author: https://github.com/weiranyi
* @description 为后续的搜索操作创建数据,将已经爬取的数据复制粘贴
* @date: 2021/5/25 7:23 下午
* @Version 1.0
*/
public class MockDataGenerator {
private static final int TAGET_ROW_COUNT = 100_0000;
private static final SecureRandom random = new SecureRandom();
private static void mockData(SqlSessionFactory sqlSessionFactory, int howMany) {
//ExecutorType.BATCH
try (SqlSession session = sqlSessionFactory.openSession(true)) {
List<News> currentNews = session.selectList("com.github.weiranyi.MockMapper.selectNews");
int count = howMany - currentNews.size();
try {
while (count-- > 0) {
int index = random.nextInt(currentNews.size());
News newsToBeInsert = new News(currentNews.get(index));
Instant currentTime = newsToBeInsert.getCreatedAt();
currentTime = currentTime.minusSeconds(random.nextInt(3600 * 24 * 365));
newsToBeInsert.setModifiedAt(currentTime);
newsToBeInsert.setCreatedAt(currentTime);
session.insert("com.github.weiranyi.MockMapper.insertNews", newsToBeInsert);
// 进度条功能
// 将剩余待处理的链接处理完,System.out.println("Insert:" + index);
System.out.println("Left" + count);
if (count % 2000 == 0) {
session.flushStatements();
}
}
session.commit();
} catch (Exception e) {
session.rollback();
throw new RuntimeException(e);
}
}
}
public static void main(String[] args) {
SqlSessionFactory sqlSessionFactory;
try {
String resource = "db/mybatis/config.xml";
InputStream inputStream = Resources.getResourceAsStream(resource);
sqlSessionFactory = new SqlSessionFactoryBuilder().build(inputStream);
} catch (IOException e) {
throw new RuntimeException(e);
}
mockData(sqlSessionFactory, TAGET_ROW_COUNT);
}
}
/**
* # 查询表中所有数据
* SELECT * FROM NEWS
* # 只保留日期
* update NEWS set created_at = date(created_at),modified_at = date(modified_at)
* # 建立索引
* CREATE INDEX created_at_index ON NEWS(created_at)
* # 查看索引【默认是B树】
* show index from NEWS
* # 索引前36S,索引后1S;再次运行,时间会继续缩短
* # 数据库中存在多级缓存,再次运行就不是冷启动了
* SELECT * FROM NEWS WHERE created_at = '2019-08-29'
* # 查看当前SQL将会如何执行
* EXPLAIN SELECT * FROM NEWS WHERE created_at = '2019-08-29'
*/
......@@ -26,9 +26,9 @@ public class MyBatisCrawlerDao implements CrawlerDao {
sqlSessionFactory = new SqlSessionFactoryBuilder().build(inputStream);
}
// 获取下一个链接再删除
// 【synchronized转原子操作】获取下一个链接再删除
@Override
public String getNextLinkThenDelete() throws SQLException {
public synchronized String getNextLinkThenDelete() throws SQLException {
// SqlSession openSession(boolean autoCommit);这里设计事务,必须提交才生效,要设置参数为true
try (SqlSession session = sqlSessionFactory.openSession(true)) {
String url = session.selectOne("com.github.weiranyi.MyMapper.selectNextAvailableLink");
......@@ -47,7 +47,6 @@ public class MyBatisCrawlerDao implements CrawlerDao {
}
}
//
@Override
public boolean isLinkProcessed(String link) throws SQLException {
try (SqlSession session = sqlSessionFactory.openSession(true)) {
......
package com.github.weiranyi.entity;
import java.time.Instant;
/**
* @author: https://github.com/weiranyi
* @description 这是一个新闻类
......@@ -12,6 +14,9 @@ public class News {
private String url;
private String content;
private String title;
// Instant:时刻、时间;可以去替代时间点
private Instant createdAt;
private Instant modifiedAt;
public News() {
......@@ -23,6 +28,16 @@ public class News {
this.title = title;
}
public News(News old) {
this.id = old.id;
this.url = old.url;
this.content = old.content;
this.title = old.title;
this.createdAt = old.createdAt;
this.modifiedAt = old.modifiedAt;
}
public Integer getId() {
return id;
}
......@@ -54,4 +69,20 @@ public class News {
public void setTitle(String title) {
this.title = title;
}
public Instant getCreatedAt() {
return createdAt;
}
public void setCreatedAt(Instant createdAt) {
this.createdAt = createdAt;
}
public Instant getModifiedAt() {
return modifiedAt;
}
public void setModifiedAt(Instant modifiedAt) {
this.modifiedAt = modifiedAt;
}
}
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE mapper
PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN"
"http://mybatis.org/dtd/mybatis-3-mapper.dtd">
<mapper namespace="com.github.weiranyi.MockMapper">
<select id="insertNews" parameterType="com.github.weiranyi.entity.News">
insert into NEWS (url, title, content, created_at, modified_at)
values (#{url},#{title},#{content},#{createdAt},#{modifiedAt})
</select>
<select id="selectNews" resultType="com.github.weiranyi.entity.News">
select id,url,title,content,created_at,modified_at
from NEWS
Limit 2000
</select>
</mapper>
\ No newline at end of file
......@@ -8,7 +8,11 @@
PUBLIC "-//mybatis.org//DTD Config 3.0//EN"
"http://mybatis.org/dtd/mybatis-3-config.dtd">
<configuration>
<!-- 默认情况下mybatis不会自己把数据库下划线列名映射为Java的驼峰名 -->
<!--
1、该段代码主要是两种命名约定之间的转换
2、settings在配置表中有自己的顺序,必须放在其他配置的前面
3、默认情况下mybatis不会自己把数据库下划线列名映射为Java的驼峰名
-->
<settings>
<setting name="mapUnderscoreToCamelCase" value="true"/>
</settings>
......@@ -26,5 +30,6 @@
</environments>
<mappers>
<mapper resource="db/mybatis/MyMapper.xml"/>
<mapper resource="db/mybatis/MockMapper.xml"/>
</mappers>
</configuration>
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册