V5__基本实现Elasticsearch搜索功能

eb972ef4 · 亦蔚然 · a2980969 · eb972ef4 · eb972ef4 · eb972ef4
18 changed file
--- a/README.md
+++ b/README.md
 # 项目：多线程爬虫与Elasticsearch搜索引擎实战
-
-***
-
-## 一、项目概述：
-
- 1、项目目标：爬取新浪网
- 2、需求分析与算法设计：
-    - 需求：网页中的一个节点开始遍历所有节点
-    - 算法：使用了广度优先算法的变体
-        ![img.png](https://github.com/weiranyi/JavaProject-Crawler-Elasticsearch/blob/yiweiran/images/flowChart.png?raw=true)
-
-***
-
-## 心得与收获：
-
-### 1、做一个项目的原则
-
- 心法：
-    - 1、把每个项目当作人生最好的项目来精雕细琢，一丝不苟滴写好文档，保证代码质量（以自己当前最高水平去完成，可以借助代码检测工具）
-    - 2、使用业界标准化的模式和流程，每一行代码都不要是多余的（如：不要提交不该提交的文件 .idea 等不要上传到Github）；几乎不要有本地依赖，使用者能够毫无障碍的使用
-    - 3、小步快跑，成就感，越小的变更越容易debug，越早进行越好
- 强制规范：
-    - 1、【重要】使用GitHub+主干/分支模型进行开发
-        - 禁止直接push master
-        - 所有变更必须PR进行
-    - 2、【重要】自动代码质量检查+测试
-        - Checkstyle/SpotBugs
-        - 最基本的自动化测试覆盖
- 项目设计流程
-    - 多人协作【自顶向下】
-        - 模块化
-            - 各模块之间责任明确，界限清晰
-            - 基本文档
-            - 基本借口
-        - 小步提交
-            - 大的变更难以review
-            - 的的变更更加棘手
-            - 小步提交颗粒度
-    - 单打独斗【自底向上】
-        - 先实现功能
-        - 在实现的过程中不断抽出公用的部分
-            - 每当自己写的代码比较啰嗦（不断复制粘贴）的时候就得重构了
-        - 通过重构实现模块化、接口化
- 项目演进：
-    - 单线程 -> 多线程
-    - console -> H2 -> MySQL
-    - database -> Elasticsearch
-    
- 好的代码习惯：
-    - 不要写妥协的代码，将烂代码、啰嗦的代码重构掉，初学可能会学习大量烂代码
-    - 有好的三方实现可以借用，如：Apache提供的包
-    - 代码要有一个较好的扩展性
-
-### 2、收获
-
- 冒烟测试；测试原则：每个测试是一个类，负责一个小的功能模块
- git命令回顾：
-    - 新建分支的命令：
-        ```shell
-        git checkout -b basic-algorithm
-        ```
-    - 撤销 git add 操作，可以使用以下命令：
-        ```shell
-        git restore --staged src/main/java/com/github/weiranyi/Main.java
-        ```
-    - 若此时全部commit提交，想要撤销一个提交怎么办
-        ```shell
-        git reset HEAD~1
-        ```
-    - 撤销PR的提交
-        ```shell
-        git log --获得61b22195162ec24fbbf2ef020485bb0a524c82b9
-        git revert 61b22195162ec24fbbf2ef020485bb0a524c82b9
-        ```
-    - 若在自己分支出现与master冲突时，可以通过force push
-        ```shell
-        git push -f
-        ```
- commit提交文本，首行做标题行，第二行开始写内容，每行最好不要超过72个字符
- 使用了circleci检查，比自己发现代码问题还要细致
- 算法
-    - DFS 深度优先算法
-    - BFS 广度优先
- 重构
-    - 短方法：
-        - a.便于人脑理解
-        - b.越短越容易复用
-        - c.对于Java来说可以方便的对方法进行覆盖
- spotbugs
-    - spotbugs goal：分析项目
-    - check goal：分析项目，发现BUG就让build失败
-    ```shell
-        mvn spotbugs:check
-        mvn spotbugs:gui
-    ```
- maven默认生命周期：
-    - maven在各生命周期什么都不做
-    - 做什么需要依靠插件
-        - maven-surefire-plugin官方检测插件      
-    - 插件可以绑定到maven的各个生命周期上
-    - maven-compile-plugein
- ORM对象关系映射
\ No newline at end of file
+## 1、迭代内容：
+- 版本1： 
+  - 用Java编写一个多线程爬虫，完成HTTP请求、HTML解析等工作，得到数据后放入H2数据库中，借助Flyway将建表、添加原始数据的工作等（自动化）
+  - 使用Maven进行包管理，使用CircleCI进行自动化测试，在生命周期绑定 Checkstyle、SpotBugs 插件保证代码质量
+- 版本2：使用ORM（对象关系映射）重构，使用MyBatis框架
+- 版本3：通过flyway插件迁移数据，将数据从H2 数据库迁移到MySQL数据库
+- 版本4：将主函数从爬虫类中抽取出，形成新的类，方便调用爬虫线程
+- 版本5：借助Elasticsearch编写一个简单的搜索程序
+## 2、建立：
+- 建立GitHub仓库并克隆到本地：
+```shell
+# 后期建议使用SSH
+git clone https://github.com/weiranyi/JavaProject-Crawler-Elasticsearch.git
+```
+- 使用自动化工具Flyway完成自动建表工作：
+```shell
+mvn flyway:migrate
+# 备用命令：上一次建表失败的情况下，想再次建表，可以使用本命令
+mvn flyway:clean && mvn flyway:migrate
+```
+# 3、测试：
+- 项目测试：
+```shell
+mvn verify
+```
+- 补充测试：
+```shell
+mvn spotbugs:check
+mvn spotbugs:gui
+#压制不必要的报错：
+@SuppressFBWarnings("DMI_CONSTANT_DB_PASSWORD")
+```
+# 4、扩展：
+[项目中MySQL和是Docker安装滴，Docker的使用可以点击这，参考该笔记](https://zhuanlan.zhihu.com/p/356987233)
+
+# 5、展示：
+- 搜索展示：
+![搜索展示](https://raw.githubusercontent.com/weiranyi/Crawler-Elasticsearch/new-feature/images/search_code.png)
+
+- 爬取的数据：
+![数据展示](https://github.com/weiranyi/Crawler-Elasticsearch/blob/new-feature/images/news_database.png?raw=true)
+
+- Docker展示：
+![Docker](https://github.com/weiranyi/Crawler-Elasticsearch/blob/new-feature/images/Docker.png?raw=true)
+  
+- Elasticsearch
+![Elasticsearch](https://github.com/weiranyi/Crawler-Elasticsearch/blob/new-feature/images/Elasticsearch.png?raw=true)
\ No newline at end of file
--- a/images/Docker.png
+++ b/images/Docker.png
--- a/images/Elasticsearch.png
+++ b/images/Elasticsearch.png
--- a/images/V1/V1.png
+++ b/images/V1/V1.png
--- a/images/V1/spotbugs.png
+++ b/images/V1/spotbugs.png
--- a/images/flowChart.png
+++ b/images/flowChart.png
--- a/images/news_database.png
+++ b/images/news_database.png
--- a/images/search_code.png
+++ b/images/search_code.png
--- a/pom.xml
+++ b/pom.xml
@@ -87,6 +87,12 @@
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.25</version>
        </dependency>
+        <!-- https://mvnrepository.com/artifact/org.elasticsearch.client/elasticsearch-rest-high-level-client -->
+        <dependency>
+            <groupId>org.elasticsearch.client</groupId>
+            <artifactId>elasticsearch-rest-high-level-client</artifactId>
+            <version>7.12.0</version>
+        </dependency>
        <dependency>
            <groupId>org.junit.jupiter</groupId>
            <artifactId>junit-jupiter-api</artifactId>

--- a/src/main/java/com/github/weiranyi/Crawler.java
+++ b/src/main/java/com/github/weiranyi/Crawler.java
@@ -16,34 +16,40 @@ import java.util.ArrayList;
 import java.util.stream.Collectors;


-public class Crawler {
-    CrawlerDao dao = new MyBatisCrawlerDao();
+public class Crawler extends Thread {
+    private CrawlerDao dao;

-    public void run() throws SQLException, IOException {
-        String link = null;
-        // 从数据库中加载下一个链接，若能加载到则进行下一个循环
-        while ((link = dao.getNextLinkThenDelete()) != null) {
-            // 若链接已经处理过了就跳到下一次循环
-            if (dao.isLinkProcessed(link)) {
-                continue;
-            }
-            // 判断是否是感兴趣滴内容【新浪站内的网页】
-            if (isInterestingLink(link)) {
-                Document doc = httpGetAndParseHtml(link);
-                // 分析页面url将它们放到即将处理的url池子中去
-                parseUrlsFromAndStoreIntoDatabase(doc);
-                storeIntoDatabaseIfItIsNewsPage(doc, link);
-                dao.insertProcessedLinked(link);
-                // dao.updataDatabase(link, "insert into LINKS_ALREADY_PROCESSED(link) values (?)");
-            } else {
-                // 不感兴趣
-                continue;
-            }
-        }
+    public Crawler(CrawlerDao dao) {
+        // 这样每个线程共享同一个链接
+        this.dao = dao;
    }

-    public static void main(String[] args) throws IOException, SQLException {
-        new Crawler().run();
+    @Override
+    public void run() {
+        try {
+            String link;
+            // 从数据库中加载下一个链接，若能加载到则进行下一个循环
+            while ((link = dao.getNextLinkThenDelete()) != null) {
+                // 若链接已经处理过了就跳到下一次循环
+                if (dao.isLinkProcessed(link)) {
+                    continue;
+                }
+                // 判断是否是感兴趣滴内容【新浪站内的网页】
+                if (isInterestingLink(link)) {
+                    Document doc = httpGetAndParseHtml(link);
+                    // 分析页面url将它们放到即将处理的url池子中去
+                    parseUrlsFromAndStoreIntoDatabase(doc);
+                    storeIntoDatabaseIfItIsNewsPage(doc, link);
+                    dao.insertProcessedLinked(link);
+                    // dao.updataDatabase(link, "insert into LINKS_ALREADY_PROCESSED(link) values (?)");
+                } else {
+                    // 不感兴趣
+                    continue;
+                }
+            }
+        } catch (Exception e) {
+            throw new RuntimeException(e);
+        }
    }

    private void parseUrlsFromAndStoreIntoDatabase(Document doc) throws SQLException {

--- a/src/main/java/com/github/weiranyi/ElasticsearchDataGenerator.java
+++ b/src/main/java/com/github/weiranyi/ElasticsearchDataGenerator.java
+package com.github.weiranyi;
+
+import com.github.weiranyi.entity.News;
+import org.apache.http.HttpHost;
+import org.apache.ibatis.io.Resources;
+import org.apache.ibatis.session.SqlSession;
+import org.apache.ibatis.session.SqlSessionFactory;
+import org.apache.ibatis.session.SqlSessionFactoryBuilder;
+import org.elasticsearch.action.bulk.BulkRequest;
+import org.elasticsearch.action.bulk.BulkResponse;
+import org.elasticsearch.action.index.IndexRequest;
+import org.elasticsearch.action.index.IndexResponse;
+import org.elasticsearch.client.RequestOptions;
+import org.elasticsearch.client.RestClient;
+import org.elasticsearch.client.RestHighLevelClient;
+import org.elasticsearch.common.xcontent.XContentType;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * @author: https://github.com/weiranyi
+ * @description 生成一亿条数据
+ * @date: 2021/3/31 4:09 下午
+ * @Version 1.0
+ */
+public class ElasticsearchDataGenerator {
+    private static void writeSingleThread(List<News> newsFromMySQL) {
+        try (RestHighLevelClient client = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http")))) {
+            // 单线程写入2000*1000 = 200_0000数据
+            for (int i = 0; i < 1000; i++) {
+                // 批处理操作
+                BulkRequest bulkRequest = new BulkRequest();
+                for (News news : newsFromMySQL) {
+                    IndexRequest request = new IndexRequest("news");
+                    Map<String, Object> data = new HashMap<>();
+                    data.put("content", news.getContent().length() > 10 ? news.getContent().substring(0, 10) : news.getContent());
+                    data.put("url", news.getUrl());
+                    data.put("title", news.getTitle());
+                    data.put("createdAt", news.getCreatedAt());
+                    data.put("modifiedAt", news.getModifiedAt());
+                    request.source(data, XContentType.JSON);
+                    bulkRequest.add(request);
+                    IndexResponse response = client.index(request, RequestOptions.DEFAULT);
+                    System.out.println(response.status().getStatus());
+                }
+                BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT);
+                System.out.println(Thread.currentThread().getName() + "finishes" + i + "：" + bulkResponse.status().getStatus());
+            }
+
+        } catch (IOException e) {
+            throw new RuntimeException(e);
+        }
+    }
+
+    public static void main(String[] args) throws IOException {
+        SqlSessionFactory sqlSessionFactory;
+        try {
+            String resource = "db/mybatis/config.xml";
+            InputStream inputStream = Resources.getResourceAsStream(resource);
+            sqlSessionFactory = new SqlSessionFactoryBuilder().build(inputStream);
+        } catch (
+                IOException e) {
+            throw new RuntimeException(e);
+        }
+        List<News> newsFromMySQL = getNewsFromMySQL(sqlSessionFactory);
+
+        for (int i = 0; i < 16; i++) {
+            new Thread(() -> writeSingleThread(newsFromMySQL)).start();
+        }
+    }
+
+    private static List<News> getNewsFromMySQL(SqlSessionFactory sqlSessionFactory) {
+        try (SqlSession session = sqlSessionFactory.openSession()) {
+            return session.selectList("com.github.weiranyi.MockMapper.selectNews");
+        }
+    }
+}
--- a/src/main/java/com/github/weiranyi/ElasticsearchEngine.java
+++ b/src/main/java/com/github/weiranyi/ElasticsearchEngine.java
+package com.github.weiranyi;
+
+import org.apache.http.HttpHost;
+import org.elasticsearch.action.search.SearchRequest;
+import org.elasticsearch.action.search.SearchResponse;
+import org.elasticsearch.client.RequestOptions;
+import org.elasticsearch.client.RestClient;
+import org.elasticsearch.client.RestHighLevelClient;
+import org.elasticsearch.index.query.MultiMatchQueryBuilder;
+import org.elasticsearch.search.SearchHit;
+import org.elasticsearch.search.builder.SearchSourceBuilder;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStreamReader;
+
+/**
+ * @author: https://github.com/weiranyi
+ * @description 搜索引擎，该类用于实现搜索功能
+ * @date: 2021/5/26 2:56 下午
+ * @Version 1.0
+ */
+
+public class ElasticsearchEngine {
+    public static void main(String[] args) throws IOException {
+        while (true) {
+            System.out.println("请输入一个搜索关键字");
+            BufferedReader reader = new BufferedReader(new InputStreamReader(System.in, "utf-8"));
+            String keyword = reader.readLine();
+            System.out.println(keyword);
+            search(keyword);
+        }
+    }
+
+    private static void search(String keyword) throws IOException {
+        try (RestHighLevelClient client = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http")))) {
+            SearchRequest request = new SearchRequest("news");
+            request.source(new SearchSourceBuilder().query(new MultiMatchQueryBuilder(keyword, "title", "content")));
+
+            SearchResponse result = client.search(request, RequestOptions.DEFAULT);
+
+            for (SearchHit hit : result.getHits()) {
+                System.out.println(hit.getSourceAsString());
+            }
+        }
+    }
+}
--- a/src/main/java/com/github/weiranyi/Main.java
+++ b/src/main/java/com/github/weiranyi/Main.java
+package com.github.weiranyi;
+
+import java.io.IOException;
+import java.sql.SQLException;
+
+public class Main {
+    public static void main(String[] args) throws IOException, SQLException {
+        CrawlerDao dao = new MyBatisCrawlerDao();
+        for (int i = 0; i < 16; i++) {
+            new Crawler(dao).start();
+        }
+    }
+}
--- a/src/main/java/com/github/weiranyi/MockDataGenerator.java
+++ b/src/main/java/com/github/weiranyi/MockDataGenerator.java
+package com.github.weiranyi;
+
+import com.github.weiranyi.entity.News;
+import org.apache.ibatis.io.Resources;
+import org.apache.ibatis.session.SqlSession;
+import org.apache.ibatis.session.SqlSessionFactory;
+import org.apache.ibatis.session.SqlSessionFactoryBuilder;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.time.Instant;
+import java.util.List;
+import java.security.SecureRandom;
+
+/**
+ * @author: https://github.com/weiranyi
+ * @description 为后续的搜索操作创建数据，将已经爬取的数据复制粘贴
+ * @date: 2021/5/25 7:23 下午
+ * @Version 1.0
+ */
+public class MockDataGenerator {
+    private static final int TAGET_ROW_COUNT = 100_0000;
+    private static final SecureRandom random = new SecureRandom();
+
+    private static void mockData(SqlSessionFactory sqlSessionFactory, int howMany) {
+        //ExecutorType.BATCH
+        try (SqlSession session = sqlSessionFactory.openSession(true)) {
+            List<News> currentNews = session.selectList("com.github.weiranyi.MockMapper.selectNews");
+            int count = howMany - currentNews.size();
+            try {
+                while (count-- > 0) {
+                    int index = random.nextInt(currentNews.size());
+                    News newsToBeInsert = new News(currentNews.get(index));
+                    Instant currentTime = newsToBeInsert.getCreatedAt();
+                    currentTime = currentTime.minusSeconds(random.nextInt(3600 * 24 * 365));
+                    newsToBeInsert.setModifiedAt(currentTime);
+                    newsToBeInsert.setCreatedAt(currentTime);
+                    session.insert("com.github.weiranyi.MockMapper.insertNews", newsToBeInsert);
+                    // 进度条功能
+                    // 将剩余待处理的链接处理完，System.out.println("Insert:" + index);
+                    System.out.println("Left" + count);
+                    if (count % 2000 == 0) {
+                        session.flushStatements();
+                    }
+                }
+                session.commit();
+            } catch (Exception e) {
+                session.rollback();
+                throw new RuntimeException(e);
+            }
+
+        }
+    }
+
+    public static void main(String[] args) {
+        SqlSessionFactory sqlSessionFactory;
+        try {
+            String resource = "db/mybatis/config.xml";
+            InputStream inputStream = Resources.getResourceAsStream(resource);
+            sqlSessionFactory = new SqlSessionFactoryBuilder().build(inputStream);
+        } catch (IOException e) {
+            throw new RuntimeException(e);
+        }
+        mockData(sqlSessionFactory, TAGET_ROW_COUNT);
+    }
+}
+
+/**
+ * # 查询表中所有数据
+ * SELECT * FROM NEWS
+ * # 只保留日期
+ * update NEWS set created_at = date(created_at),modified_at = date(modified_at)
+ * # 建立索引
+ * CREATE INDEX created_at_index ON NEWS(created_at)
+ * # 查看索引【默认是B树】
+ * show index from NEWS
+ * # 索引前36S，索引后1S；再次运行，时间会继续缩短
+ * # 数据库中存在多级缓存，再次运行就不是冷启动了
+ * SELECT * FROM NEWS WHERE created_at = '2019-08-29'
+ * # 查看当前SQL将会如何执行
+ * EXPLAIN SELECT * FROM NEWS WHERE created_at = '2019-08-29'
+ */
--- a/src/main/java/com/github/weiranyi/MyBatisCrawlerDao.java
+++ b/src/main/java/com/github/weiranyi/MyBatisCrawlerDao.java
@@ -26,9 +26,9 @@ public class MyBatisCrawlerDao implements CrawlerDao {
        sqlSessionFactory = new SqlSessionFactoryBuilder().build(inputStream);
    }

-    // 获取下一个链接再删除
+    // 【synchronized转原子操作】获取下一个链接再删除
    @Override
-    public String getNextLinkThenDelete() throws SQLException {
+    public synchronized String getNextLinkThenDelete() throws SQLException {
        //  SqlSession openSession(boolean autoCommit);这里设计事务，必须提交才生效，要设置参数为true
        try (SqlSession session = sqlSessionFactory.openSession(true)) {
            String url = session.selectOne("com.github.weiranyi.MyMapper.selectNextAvailableLink");
@@ -47,7 +47,6 @@ public class MyBatisCrawlerDao implements CrawlerDao {
        }
    }

-    //
    @Override
    public boolean isLinkProcessed(String link) throws SQLException {
        try (SqlSession session = sqlSessionFactory.openSession(true)) {

--- a/src/main/java/com/github/weiranyi/entity/News.java
+++ b/src/main/java/com/github/weiranyi/entity/News.java
 package com.github.weiranyi.entity;

+import java.time.Instant;
+
 /**
 * @author: https://github.com/weiranyi
 * @description 这是一个新闻类
@@ -12,6 +14,9 @@ public class News {
    private String url;
    private String content;
    private String title;
+    // Instant：时刻、时间；可以去替代时间点
+    private Instant createdAt;
+    private Instant modifiedAt;

    public News() {

@@ -23,6 +28,16 @@ public class News {
        this.title = title;
    }

+    public News(News old) {
+        this.id = old.id;
+        this.url = old.url;
+        this.content = old.content;
+        this.title = old.title;
+        this.createdAt = old.createdAt;
+        this.modifiedAt = old.modifiedAt;
+    }
+
+
    public Integer getId() {
        return id;
    }
@@ -54,4 +69,20 @@ public class News {
    public void setTitle(String title) {
        this.title = title;
    }
+
+    public Instant getCreatedAt() {
+        return createdAt;
+    }
+
+    public void setCreatedAt(Instant createdAt) {
+        this.createdAt = createdAt;
+    }
+
+    public Instant getModifiedAt() {
+        return modifiedAt;
+    }
+
+    public void setModifiedAt(Instant modifiedAt) {
+        this.modifiedAt = modifiedAt;
+    }
 }
--- a/src/main/resources/db/mybatis/MockMapper.xml
+++ b/src/main/resources/db/mybatis/MockMapper.xml
+<?xml version="1.0" encoding="UTF-8" ?>
+<!DOCTYPE mapper
+        PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN"
+        "http://mybatis.org/dtd/mybatis-3-mapper.dtd">
+<mapper namespace="com.github.weiranyi.MockMapper">
+    <select id="insertNews" parameterType="com.github.weiranyi.entity.News">
+        insert into NEWS (url, title, content, created_at, modified_at)
+        values (#{url},#{title},#{content},#{createdAt},#{modifiedAt})
+    </select>
+    <select id="selectNews" resultType="com.github.weiranyi.entity.News">
+        select id,url,title,content,created_at,modified_at
+        from NEWS
+        Limit 2000
+    </select>
+</mapper>
\ No newline at end of file
--- a/src/main/resources/db/mybatis/config.xml
+++ b/src/main/resources/db/mybatis/config.xml
@@ -8,7 +8,11 @@
        PUBLIC "-//mybatis.org//DTD Config 3.0//EN"
        "http://mybatis.org/dtd/mybatis-3-config.dtd">
 <configuration>
-    <!-- 默认情况下mybatis不会自己把数据库下划线列名映射为Java的驼峰名 -->
+    <!--
+        1、该段代码主要是两种命名约定之间的转换
+        2、settings在配置表中有自己的顺序，必须放在其他配置的前面
+        3、默认情况下mybatis不会自己把数据库下划线列名映射为Java的驼峰名
+    -->
    <settings>
        <setting name="mapUnderscoreToCamelCase" value="true"/>
    </settings>
@@ -26,5 +30,6 @@
    </environments>
    <mappers>
        <mapper resource="db/mybatis/MyMapper.xml"/>
+        <mapper resource="db/mybatis/MockMapper.xml"/>
    </mappers>
 </configuration>
\ No newline at end of file