Nacos 单实例集群 docker compose 部署问题

背景

在 Linux 环境下通过 Docker Compose 部署 Nacos + Grafana + Prometheus 单实例集群，研究其监控功能时，遇到了无法部署的问题。我按照官方文档的以下指令在 CentOS 服务器部署 Nacos 集群：

1
2
3

git clone --depth 1 https://github.com/nacos-group/nacos-docker.git
cd nacos-docker
docker-compose -f example/standalone-derby.yaml up

我在自己 Windows 环境中进行部署非常正常。但我随后在 Linux 服务器上部署，集群在启动时报出了如下错误：

1
2
3

...
nacos-standalone  | 2023-05-08 21:28:05,238 ERROR Error starting Tomcat context. Exception: org.springframework.beans.factory.UnsatisfiedDependencyException. Message: Error creating bean with name 'basicAuthenticationFilter' defined in class path resource [com/alibaba/nacos/prometheus/filter/PrometheusAuthFilter.class]: Unsatisfied dependency expressed through method 'basicAuthenticationFilter' parameter 0; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'nacosAuthConfig' defined in URL [jar:file:/home/nacos/target/nacos-server.jar!/BOOT-INF/lib/nacos-plugin-default-impl-2.2.2.jar!/com/alibaba/nacos/plugin/auth/impl/NacosAuthConfig.class]: Unsatisfied dependency expressed through constructor parameter 3; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'nacosUserDetailsServiceImpl': Unsatisfied dependency expressed through field 'userPersistService'; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'embeddedUserPersistServiceImpl': Unsatisfied dependency expressed through field 'databaseOperate'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'standaloneDatabaseOperateImpl': Invocation of init method failed; nested exception is java.lang.RuntimeException: com.alibaba.nacos.api.exception.runtime.NacosRuntimeException: errCode: 500, errMsg: load derby-schema.sql error.
...

看起来，似乎是集群的 derby 持久化层出现了问题。

解决方法

实际上， github issue 中已经有很多人提出了这个问题，官方的回复在 Nacos 主仓库中的 issue#9742 有说明。这个问题的原因是因为 Nacos 的持久化层 Derby 数据库需要从磁盘加载，但还未等到 Derby 加载完成，Nacos 就开始执行 SQL 语句，导致 Derby 加载失败。这种问题往往出自于磁盘读写速度过慢，或是磁盘空间不足。我的 Windows PC 使用 SSD ，因此从来没有出现这个问题。但我使用的 Linux 服务器装载的是 HDD ，几乎总是会产生这样的错误。

解决方案也在 issue#9742 给出了，需要将环境变量 db.pool.config.connectionTimeout 改为 60000 (默认值是 30000)

另外，官方文档指出，如果要使用集群中的 Grafana 和 Prometheus ，需要将 Nacos 中的 /home/nacos/conf/application.properties 映射到本地。然后在里面修改 management.endpoints.web.exposure.include=*。实际上直接在 standalone-derby.yaml 中添加也可以，更加方便，省去了管理配置文件的麻烦。

完整的 docker-compose.yml

修改后的 example/standalone-derby.yaml 如下所示：

version: "2"
services:
  nacos:
    image: nacos/nacos-server:${NACOS_VERSION}
    container_name: nacos-standalone
    environment:
      - DB_POOL_CONFIG_CONNECTIONTIMEOUT=60000 # 增加 derby 数据库等待时间
      - MANAGEMENT_ENDPOINTS_WEB_EXPOSURE_INCLUDE=* # 开启所有的监控端点
      - PREFER_HOST_MODE=hostname
      - MODE=standalone
    volumes:
      - ./standalone-logs/:/home/nacos/logs
    ports:
      - "8848:8848"
      - "9848:9848"
  prometheus:
    container_name: prometheus
    image: prom/prometheus:latest
    volumes:
      - ./prometheus/prometheus-standalone.yaml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
    depends_on:
      - nacos
    restart: on-failure
  grafana:
    container_name: grafana
    image: grafana/grafana:latest
    ports:
      - 3000:3000
    restart: on-failure