在负责的一个项目迁移到公司内部私有云的过程中,发现迁移过后的服务的监控数据不理想,具体表现为:平均响应时间増大、响应时间抖动、cache 访问出现部分超时等,与原本物理机上面部署的服务性能有差异。经过排查发现问题的导火索可能是 JVM GC 导致的暂停;经过几轮对比调参之后,基本上保证了服务的较好状态。

🔗背景

所负责的项目主要是存储用户的基本信息数据,数据量在亿级。存储底层为MySQL,使用了自研的分库分表中间件;数据库实例个数在5个以上,单实例数据表行数在千万以上。

为了配合公司的云化战略,该服务需要从物理机部署迁移为私有云(docker based)部署。

服务的运行环境是 JDK 1.6,JVM 运行参数为:

-server
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:+DisableExplicitGC
-Xms10g
-Xmx10g
-Xmn4g
-Xss1024K
-XX:PermSize=256m
-XX:MaxPermSize=512m
-XX:SurvivorRatio=10
-XX:+ParallelRefProcEnabled
-XX:+CMSParallelRemarkEnabled
-XX:+UseCMSCompactAtFullCollection
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=70
-XX:CMSMaxAbortablePrecleanTime=30000
-XX:SoftRefLRUPolicyMSPerMB=0

即使用的 GC 收集器为 ParNew (new) + ConcMarkSweep (CMS, old)。

在迁移过程中,发现私有云上的服务实例:1)访问 Cache 出现超时,且超时量较明显;2)平均请求响应时间增大;3)平均响应时间会出现偶发性的突刺。

线上的服务出现了问题就需要解决问题,于是开始对服务进行排查。

对比迁移前后的服务变化和环境变化:

  1. 迁移之前是部署到物理机上(32 core + 128 G),迁移之后是部署到私有云上(8 core + 8 G)
  2. 迁移之前的 JVM 参数中配置的内存数为 10 G,迁移之后调整为 7 G

通过定位 Cache 超时问题,很快发现服务会出现较长时间的全局暂停(> 100 ms),很容易联想到 GC 的 stop-the-world 上去。

🔗GC 调优过程

🔗Round 1 - 自适应新生代

首先需要在服务中启用 GC 日志,并去掉了显式的新生代大小设置(-Xmn)。

-Xms7g
-Xmx7g
-Xloggc:/path-to-gc.log
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps

实际采集到的 GC 日志片段如下:

> 2018-12-18T12:00:44.860+0800: 54473.284: [GC 54473.284: [ParNew: 143331K->1260K(156224K), 0.0121330 secs] 5110193K->4968165K(7325888K), 0.0126500 secs] [Times: user=0.08 sys=0.01, real=0.02 secs]
> 2018-12-18T12:00:45.233+0800: 54473.657: [GC 54473.657: [ParNew: 143340K->1319K(156224K), 0.0106800 secs] 5110245K->4968250K(7325888K), 0.0111050 secs] [Times: user=0.07 sys=0.01, real=0.01 secs]
> 2018-12-18T12:00:45.660+0800: 54474.085: [GC 54474.085: [ParNew: 143399K->1346K(156224K), 0.0121900 secs] 5110330K->4968289K(7325888K), 0.0125390 secs] [Times: user=0.08 sys=0.01, real=0.02 secs]
> 2018-12-18T12:00:46.040+0800: 54474.464: [GC 54474.464: [ParNew: 143426K->1246K(156224K), 0.0118000 secs] 5110369K->4968244K(7325888K), 0.0121270 secs] [Times: user=0.08 sys=0.01, real=0.02 secs]
> 2018-12-18T12:00:46.459+0800: 54474.883: [GC 54474.883: [ParNew: 143326K->1218K(156224K), 0.0101350 secs] 5110324K->4968233K(7325888K), 0.0105130 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
> 2018-12-18T12:00:46.834+0800: 54475.258: [GC 54475.259: [ParNew: 143298K->1194K(156224K), 0.0101920 secs] 5110313K->4968223K(7325888K), 0.0106250 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
> 2018-12-18T12:00:47.230+0800: 54475.654: [GC 54475.654: [ParNew: 143274K->1150K(156224K), 0.0125030 secs] 5110303K->4968202K(7325888K), 0.0128590 secs] [Times: user=0.09 sys=0.00, real=0.02 secs]
> 2018-12-18T12:00:47.549+0800: 54475.973: [GC 54475.973: [ParNew: 143230K->1333K(156224K), 0.0102610 secs] 5110282K->4968408K(7325888K), 0.0106520 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
> 2018-12-18T12:00:47.954+0800: 54476.378: [GC 54476.378: [ParNew: 143413K->1198K(156224K), 0.0102270 secs] 5110488K->4968301K(7325888K), 0.0106760 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
> 2018-12-18T12:00:48.354+0800: 54476.778: [GC 54476.779: [ParNew: 143278K->1213K(156224K), 0.0107860 secs] 5110381K->4968334K(7325888K), 0.0112900 secs]

可以看到 JVM 自动选择的新生代大小为 156224K,新生代 GC 的暂停时长为 10-13 ms,间隔 ~ 4 s。

作为参考,调用 jstat -gcutil 的输出如下:

> 11:41:13 AM   16   15.96    0.00    5.32    0.00    0.00    3.19    0.00    0.00 75.53
> 11:41:13 AM   17   15.46    0.00    7.22    0.00    0.00    3.09    0.00    0.00 74.23
> 11:41:13 AM   18   15.62    0.00    6.25    0.00    0.00    4.17    0.00    0.00 73.96
> 11:41:13 AM   19   16.16    0.00    9.09    0.00    0.00    2.02    0.00    0.00 72.73
> 11:41:13 AM   20   14.29    0.00    8.16    0.00    0.00    1.02    0.00    0.00 76.53
> 11:41:13 AM   21   12.77    0.00    7.45    0.00    0.00    3.19    0.00    0.00 76.60
> 11:41:13 AM   22   15.46    0.00    7.22    0.00    0.00    2.06    0.00    0.00 75.26
> 11:41:13 AM   23   16.13    0.00    5.38    0.00    0.00    4.30    0.00    0.00 74.19
> 11:41:13 AM   24   16.49    0.00    7.22    0.00    0.00    2.06    0.00    0.00 74.23
> 11:41:13 AM   25   16.67    0.00    7.29    0.00    0.00    1.04    0.00    0.00 75.00
> 11:41:13 AM   26   17.89    0.00    5.26    0.00    0.00    3.16    0.00    0.00 73.68
> 11:41:13 AM   27   13.83    0.00    6.38    0.00    0.00    1.06    0.00    0.00 78.72
> 11:41:13 AM   28   14.74    0.00    6.32    0.00    0.00    1.05    0.00    0.00 77.89
> 11:41:13 AM   29   16.33    0.00    7.14    0.00    0.00    3.06    0.00    0.00 73.47
> 11:41:13 AM   30   15.62    0.00    5.21    0.00    0.00    6.25    0.00    0.00 72.92
> 11:41:13 AM   31   15.62    0.00    6.25    0.00    0.00    2.08    0.00    0.00 76.04
> 11:41:13 AM   32   17.89    0.00    5.26    0.00    0.00    1.05    0.00    0.00 75.79
> 11:41:13 AM   33   15.79    0.00    6.32    0.00    0.00    1.05    0.00    0.00 76.84
> 11:41:13 AM   34   16.33    0.00    7.14    0.00    0.00    2.04    0.00    0.00 74.49
> 11:41:13 AM   35   14.29    0.00    8.16    0.00    0.00    2.04    0.00    0.00 75.51
> 11:41:13 AM   36   15.46    0.00    8.25    0.00    0.00    1.03    0.00    0.00 75.26
> 11:41:13 AM   37   14.74    0.00    5.26    0.00    0.00    5.26    0.00    0.00 74.74
> 11:41:13 AM   38   13.83    0.00    6.38    0.00    0.00    0.00    0.00    0.00 79.79
> 11:41:13 AM   39   14.58    0.00    6.25    0.00    0.00    1.04    0.00    0.00 78.12

🔗Round 2 - 增大新生代大小以增加 GC 间隔

新生代的 GC 间隔约为 4 s,尝试设置显示的新生代大小为 2G 以增加新生代 GC 的间隔,降低新生代 GC 的频率 。

-server
-Xms7g
-Xmx7g
-Xloggc:/path-to-gc.log
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
...
-Xmn2g

新的 GC 日志片段如下:

> 2018-12-19T10:35:00.120+0800: 78304.144: [GC 78304.144: [ParNew: 1749120K->1866K(1922432K), 0.0171070 secs] 3928447K->2181205K(7165312K), 0.0176320 secs] [Times: user=0.06 sys=0.03, real=0.01 secs]
> 2018-12-19T10:35:03.624+0800: 78307.649: [GC 78307.649: [ParNew: 1749578K->2454K(1922432K), 0.0103060 secs] 3928917K->2181797K(7165312K), 0.0110020 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
> 2018-12-19T10:35:07.247+0800: 78311.271: [GC 78311.272: [ParNew: 1750166K->3738K(1922432K), 0.0106380 secs] 3929509K->2183085K(7165312K), 0.0110510 secs] [Times: user=0.04 sys=0.02, real=0.01 secs]
> 2018-12-19T10:35:10.629+0800: 78314.653: [GC 78314.653: [ParNew: 1751450K->15503K(1922432K), 0.0126500 secs] 3930797K->2194859K(7165312K), 0.0130680 secs] [Times: user=0.05 sys=0.03, real=0.01 secs]
> 2018-12-19T10:35:14.304+0800: 78318.329: [GC 78318.329: [ParNew: 1763215K->18989K(1922432K), 0.0145030 secs] 3942571K->2198350K(7165312K), 0.0149920 secs] [Times: user=0.06 sys=0.03, real=0.02 secs]
> 2018-12-19T10:35:17.991+0800: 78322.016: [GC 78322.016: [ParNew: 1766701K->21481K(1922432K), 0.0123160 secs] 3946062K->2200848K(7165312K), 0.0127420 secs] [Times: user=0.06 sys=0.03, real=0.01 secs]
> 2018-12-19T10:35:21.624+0800: 78325.648: [GC 78325.648: [ParNew: 1769193K->19033K(1922432K), 0.0124430 secs] 3948560K->2198741K(7165312K), 0.0128530 secs] [Times: user=0.06 sys=0.03, real=0.01 secs]
> 2018-12-19T10:35:25.366+0800: 78329.390: [GC 78329.390: [ParNew: 1766745K->20487K(1922432K), 0.0138270 secs] 3946453K->2200265K(7165312K), 0.0141630 secs] [Times: user=0.07 sys=0.03, real=0.02 secs]
> 2018-12-19T10:35:28.923+0800: 78332.947: [GC 78332.947: [ParNew: 1768199K->20388K(1922432K), 0.0132530 secs] 3947977K->2201932K(7165312K), 0.0138050 secs] [Times: user=0.06 sys=0.03, real=0.01 secs]
> 2018-12-19T10:17:26.733+0800: 77250.757: [GC [1 CMS-initial-mark: 3670087K(5242880K)] 3672294K(7165312K), 0.0086640 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]
> 2018-12-19T10:17:26.742+0800: 77250.766: [CMS-concurrent-mark-start]
> 2018-12-19T10:17:27.007+0800: 77251.031: [CMS-concurrent-mark: 0.264/0.264 secs] [Times: user=1.08 sys=0.17, real=0.26 secs]
> 2018-12-19T10:17:27.007+0800: 77251.031: [CMS-concurrent-preclean-start]
> 2018-12-19T10:17:27.026+0800: 77251.051: [CMS-concurrent-preclean: 0.020/0.020 secs] [Times: user=0.04 sys=0.04, real=0.02 secs]
> 2018-12-19T10:17:27.027+0800: 77251.051: [CMS-concurrent-abortable-preclean-start]
> 2018-12-19T10:17:28.760+0800: 77252.784: [CMS-concurrent-abortable-preclean: 1.727/1.734 secs] [Times: user=4.54 sys=1.20, real=1.74 secs]
> 2018-12-19T10:17:28.762+0800: 77252.786: [GC[YG occupancy: 884151 K (1922432 K)]77252.786: [Rescan (parallel) , 0.1405090 secs]77252.927: [weak refs processing, 0.5567620 secs] [1 CMS-remark: 3670087K(5242880K)] 4554238K(7165312K), 0.6983880 secs] [Times: user=5.44 sys=0.00, real=0.70 secs]
> 2018-12-19T10:17:29.461+0800: 77253.485: [CMS-concurrent-sweep-start]
> 2018-12-19T10:17:32.255+0800: 77256.279: [CMS-concurrent-sweep: 2.760/2.794 secs] [Times: user=9.06 sys=2.68, real=2.79 secs]
> 2018-12-19T10:17:32.255+0800: 77256.279: [CMS-concurrent-reset-start]
> 2018-12-19T10:17:32.277+0800: 77256.301: [CMS-concurrent-reset: 0.022/0.022 secs] [Times: user=0.04 sys=0.03, real=0.02 secs]

新生代 GC 的暂停时间增大为 12-17 ms,间隔为 5 s,效果不明显。老年代 GC 的暂停时间分别为 8 ms 和 698 ms,remark 阶段耗时较大,存在优化空间;第一次老年代 GC 发生于 21 个小时后。

🔗Round 3 - 降低并发 GC 线程数以减少线程切换损失

对比迁移前后的部署环境,发现一个差异:JVM 读取到的 CPU 核数的限制是硬件核数。在物理机上,硬件核数与 JVM 运行时可用的核数一致;而私有云上 JVM 的可用核数受到 docker 的限制。而 CMS 的算法在 remark 阶段并发收集的线程数是通过 JVM 收集到的 CPU 核心数相关的。Oracle JDK 默认当 CPU 的核数小于等于 8 时,设置线程数等于核数;当 CPU 的核数大于 8 时,设置线程数为 3 + 5 * CPU / 8。这导致迁移后的服务为 JVM 的 GC 并发收集线程数设置了较高的值,可能导致在服务负载较高的情况下过高的线程切换开销。

尝试设置 GC 并发线程数为可用的 CPU 核数,即 -XX:ParallelGCThreads

-server
-Xms7g
-Xmx7g
-Xmn2g
-Xloggc:/path-to-gc.log
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
...
-XX:ParallelGCThreads=8

新的 GC 日志片段如下:

> 2018-12-20T09:27:11.128+0800: 77960.749: [GC 77960.749: [ParNew: 1749861K->1877K(1922432K), 0.0097160 secs] 4175798K->2427820K(7165312K), 0.0101650 secs] [Times: user=0.04 sys=0.02, real=0.01 secs]
> 2018-12-20T09:27:14.823+0800: 77964.444: [GC 77964.444: [ParNew: 1749589K->2028K(1922432K), 0.0088190 secs] 4175532K->2427976K(7165312K), 0.0092040 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
> 2018-12-20T09:27:18.638+0800: 77968.259: [GC 77968.260: [ParNew: 1749740K->2260K(1922432K), 0.0103380 secs] 4175688K->2428211K(7165312K), 0.0107590 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
> 2018-12-20T09:27:22.317+0800: 77971.938: [GC 77971.938: [ParNew: 1749972K->2333K(1922432K), 0.0107860 secs] 4175923K->2428289K(7165312K), 0.0114030 secs] [Times: user=0.05 sys=0.02, real=0.01 secs]
> 2018-12-20T09:27:26.037+0800: 77975.658: [GC 77975.658: [ParNew: 1750045K->1839K(1922432K), 0.0098070 secs] 4176001K->2427799K(7165312K), 0.0102390 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
> 2018-12-20T09:27:29.659+0800: 77979.280: [GC 77979.280: [ParNew: 1749551K->1969K(1922432K), 0.0098240 secs] 4175511K->2427931K(7165312K), 0.0102390 secs] [Times: user=0.05 sys=0.02, real=0.01 secs]
> 2018-12-20T09:27:33.313+0800: 77982.934: [GC 77982.935: [ParNew: 1749681K->2092K(1922432K), 0.0098760 secs] 4175643K->2428056K(7165312K), 0.0104370 secs] [Times: user=0.05 sys=0.02, real=0.01 secs]
> 2018-12-20T09:27:36.861+0800: 77986.482: [GC 77986.483: [ParNew: 1749804K->2058K(1922432K), 0.0096300 secs] 4175768K->2428026K(7165312K), 0.0100290 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
> 2018-12-20T09:27:40.544+0800: 77990.165: [GC 77990.166: [ParNew: 1749770K->2046K(1922432K), 0.0094430 secs] 4175738K->2428016K(7165312K), 0.0098560 secs] [Times: user=0.04 sys=0.02, real=0.01 secs]
> 2018-12-20T09:13:58.766+0800: 77168.387: [GC [1 CMS-initial-mark: 3671985K(5242880K)] 3685285K(7165312K), 0.0167850 secs] [Times: user=0.02 sys=0.00, real=0.02 secs]
> 2018-12-20T09:13:58.783+0800: 77168.404: [CMS-concurrent-mark-start]
> 2018-12-20T09:13:59.155+0800: 77168.776: [CMS-concurrent-mark: 0.372/0.372 secs] [Times: user=1.32 sys=0.53, real=0.37 secs]
> 2018-12-20T09:13:59.155+0800: 77168.776: [CMS-concurrent-preclean-start]
> 2018-12-20T09:13:59.176+0800: 77168.797: [CMS-concurrent-preclean: 0.020/0.021 secs] [Times: user=0.09 sys=0.00, real=0.02 secs]
> 2018-12-20T09:13:59.176+0800: 77168.797: [CMS-concurrent-abortable-preclean-start]
> 2018-12-20T09:14:04.259+0800: 77173.880: [CMS-concurrent-abortable-preclean: 5.046/5.083 secs] [Times: user=14.63 sys=4.18, real=5.08 secs]
> 2018-12-20T09:14:04.260+0800: 77173.881: [GC[YG occupancy: 888560 K (1922432 K)]77173.881: [Rescan (parallel) , 0.1166920 secs]77173.998: [weak refs processing, 0.6118720 secs] [1 CMS-remark: 3678145K(5242880K)] 4566706K(7165312K), 0.7296350 secs] [Times: user=5.61 sys=0.10, real=0.73 secs]
> 2018-12-20T09:14:04.990+0800: 77174.611: [CMS-concurrent-sweep-start]
> 2018-12-20T09:14:07.444+0800: 77177.065: [CMS-concurrent-sweep: 2.420/2.454 secs] [Times: user=8.08 sys=2.48, real=2.46 secs]
> 2018-12-20T09:14:07.444+0800: 77177.066: [CMS-concurrent-reset-start]
> 2018-12-20T09:14:07.501+0800: 77177.122: [CMS-concurrent-reset: 0.056/0.056 secs] [Times: user=0.15 sys=0.05, real=0.06 secs]

可以发现,新生代的 GC 的耗时降为 8-10 ms,间隔为 4 s,有一定成效。老年代 GC 的 remark 阶段耗时 729 ms,效果变差。

作为参考,jstat -gcutil 的输出如下:

>   S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT
>   1.29   0.00  98.83  46.26  29.10  12222   97.902     2    0.746   98.649
>   0.00   1.38  12.85  46.26  29.10  12223   97.913     2    0.746   98.659
>   0.00   1.38  26.33  46.26  29.10  12223   97.913     2    0.746   98.659
>   0.00   1.38  40.07  46.26  29.10  12223   97.913     2    0.746   98.659
>   0.00   1.38  52.13  46.26  29.10  12223   97.913     2    0.746   98.659
>   0.00   1.38  64.72  46.26  29.10  12223   97.913     2    0.746   98.659
>   0.00   1.38  76.27  46.26  29.10  12223   97.913     2    0.746   98.659
>   0.00   1.38  89.33  46.26  29.10  12223   97.913     2    0.746   98.659
>   1.27   0.00   3.30  46.26  29.10  12224   97.922     2    0.746   98.668
>   1.27   0.00  17.94  46.26  29.10  12224   97.922     2    0.746   98.668
>   1.27   0.00  34.07  46.26  29.10  12224   97.922     2    0.746   98.668
>   1.27   0.00  47.79  46.26  29.10  12224   97.922     2    0.746   98.668
>   1.27   0.00  60.85  46.26  29.10  12224   97.922     2    0.746   98.668
>   1.27   0.00  74.12  46.26  29.10  12224   97.922     2    0.746   98.668
>   1.27   0.00  86.74  46.26  29.10  12224   97.922     2    0.746   98.668
>   0.00   1.33   0.00  46.26  29.10  12225   97.939     2    0.746   98.685
>   0.00   1.33  15.49  46.26  29.10  12225   97.939     2    0.746   98.685
>   0.00   1.33  28.17  46.26  29.10  12225   97.939     2    0.746   98.685
>   0.00   1.33  41.54  46.26  29.10  12225   97.939     2    0.746   98.685

🔗Round 4 - full heap dump & analyze

分析 remark 阶段的耗时,发现主要是 weak refs processing 过程耗时较大(612 ms)。参考 PhantomReference导致CMS GC耗时严重 里的经验,推测可能是类似的问题:由于 MySQL com.mysql.jdbc.NonRegisteringDriver 类内部通过静态变量 connectionPhantomRefs 维护了一个 com.mysql.jdbc.NonRegisteringDriver.ConnectionPhantomReference 类型(java.lang.ref.PhantomReference 的子类型)的哈希表,用于 JDBC 连接的最终释放;该哈希表中元素在连接主动关闭时并不会被清理,而是通过 java.lang.ref.PhantomReference 的清理机制来实现连接的清理和释放。猜想这样设计是为了在用户没有主动关闭连接(未调用 com.mysql.jdbc.Connection#close)时,还能够清理连接。

标准库的 java.lang.ref.PhantomReference 被设计来进行资源的清理和关闭:当 JVM 检测到某对象只被虚引用 java.lang.ref.PhantomReference 而没有被别的更强引用(标准强引用、软引用 java.lang.ref.SoftReference 或弱引用 java.lang.ref.SoftReference)所指向后,会在合适时机将对应的虚引用放入一个阻塞队列。MySQL 的驱动类 com.mysql.jdbc.Drivercom.mysql.jdbc.NonRegisteringDriver 类的子类,在类初始化的时候会启动一个清理线程清理上述的阻塞队列,实现连接的清理和释放的逻辑。

以上是理论。为了验证是否是由于连接为清理的原因导致的weak refs processing 过程耗时较大,对线上服务进行了一次 full heap dump (jmap),并使用 Eclipse Memory Analyzer 对 dump 的结果进行可视化分析。

mat retained pie

通过 retained size 分析可以看出,类 com.mysql.jdbc.NonRegisteringDriver 导致了 812 M 的内存对象不能及时释放。

mat dominater tree

通过对象依赖分析可以看出,类 com.mysql.jdbc.NonRegisteringDriver 类导致大量 retained size 的罪魁祸首就是上文分析的哈希表。

🔗Round 5 - clean connection reference map

根据以上分析,尝试在服务中启动一个线程定时清理哈希表中的 connection reference。

新的 GC 日志片段如下:

> 2019-01-17T16:32:46.296+0800: 6.446: [GC 6.446: [ParNew
> Desired survivor size 57671680 bytes, new threshold 6 (max 6)
> - age   1:    2994640 bytes,    2994640 total
> : 2027520K->2977K(2140160K), 0.0070780 secs] 2027520K->2977K(7227392K), 0.0071630 secs] [Times: user=0.03 sys=0.01, real=0.01 secs]
> 2019-01-17T16:32:56.565+0800: 16.715: [GC 16.715: [ParNew
> Desired survivor size 57671680 bytes, new threshold 6 (max 6)
> - age   1:   44670080 bytes,   44670080 total
> - age   2:    1607568 bytes,   46277648 total
> : 2030497K->46695K(2140160K), 0.0387120 secs] 2030497K->46695K(7227392K), 0.0389860 secs] [Times: user=0.33 sys=0.01, real=0.04 secs]
> 2019-01-17T16:33:05.557+0800: 25.707: [GC 25.707: [ParNew
> Desired survivor size 57671680 bytes, new threshold 2 (max 6)
> ...
> 2019-01-19T05:31:17.469+0800: 133117.619: [GC 133117.619: [ParNew
> Desired survivor size 57671680 bytes, new threshold 6 (max 6)
> - age   1:    1273280 bytes,    1273280 total
> - age   2:      40864 bytes,    1314144 total
> - age   3:     377040 bytes,    1691184 total
> - age   4:      17336 bytes,    1708520 total
> - age   5:     371152 bytes,    2079672 total
> - age   6:       3424 bytes,    2083096 total
> : 2045916K->12232K(2140160K), 0.0256050 secs] 5857193K->3833867K(7227392K), 0.0259730 secs] [Times: user=0.11 sys=0.06, real=0.02 secs]
> 2019-01-19T05:31:17.496+0800: 133117.646: [GC [1 CMS-initial-mark: 3821635K(5087232K)] 3834101K(7227392K), 0.0109000 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
> 2019-01-19T05:31:17.507+0800: 133117.657: [CMS-concurrent-mark-start]
> 2019-01-19T05:31:17.767+0800: 133117.917: [CMS-concurrent-mark: 0.260/0.260 secs] [Times: user=0.60 sys=0.03, real=0.25 secs]
> 2019-01-19T05:31:17.767+0800: 133117.917: [CMS-concurrent-preclean-start]
> 2019-01-19T05:31:17.782+0800: 133117.932: [CMS-concurrent-preclean: 0.015/0.015 secs] [Times: user=0.01 sys=0.01, real=0.02 secs]
> 2019-01-19T05:31:17.782+0800: 133117.932: [CMS-concurrent-abortable-preclean-start]
> 2019-01-19T05:31:31.055+0800: 133131.205: [CMS-concurrent-abortable-preclean: 12.150/13.273 secs] [Times: user=15.56 sys=1.31, real=13.27 secs]
> 2019-01-19T05:31:31.056+0800: 133131.206: [GC[YG occupancy: 1027100 K (2140160 K)]133131.206: [Rescan (parallel) , 0.1175020 secs]133131.324: [weak refs processing, 0.0134280 secs] [1 CMS-remark: 3821635K(5087232K)] 4848735K(7227392K), 0.1313880 secs] [Times: user=0.95 sys=0.04, real=0.14 secs]
> 2019-01-19T05:31:31.188+0800: 133131.338: [CMS-concurrent-sweep-start]
> 2019-01-19T05:31:36.925+0800: 133137.075: [CMS-concurrent-sweep: 5.737/5.737 secs] [Times: user=7.36 sys=0.46, real=5.73 secs]
> 2019-01-19T05:31:36.925+0800: 133137.075: [CMS-concurrent-reset-start]
> 2019-01-19T05:31:36.974+0800: 133137.124: [CMS-concurrent-reset: 0.048/0.048 secs] [Times: user=0.00 sys=0.06, real=0.05 secs]
> 2019-01-19T05:31:46.216+0800: 133146.366: [GC 133146.366: [ParNew
> Desired survivor size 57671680 bytes, new threshold 6 (max 6)
> - age   1:     924424 bytes,     924424 total
> - age   2:     397824 bytes,    1322248 total
> - age   3:      19208 bytes,    1341456 total
> - age   4:     372896 bytes,    1714352 total
> - age   5:      15520 bytes,    1729872 total
> - age   6:     370360 bytes,    2100232 total
> : 2039752K->3869K(2140160K), 0.0093030 secs] 2177219K->141338K(7227392K), 0.0096620 secs] [Times: user=0.04 sys=0.02, real=0.01 secs]
> ...
> 2019-01-20T17:38:59.007+0800: 263179.157: [GC 263179.157: [ParNew
> Desired survivor size 57671680 bytes, new threshold 6 (max 6)
> - age   1:    1025760 bytes,    1025760 total
> - age   2:      22256 bytes,    1048016 total
> - age   3:      17976 bytes,    1065992 total
> - age   4:      12624 bytes,    1078616 total
> - age   5:      17472 bytes,    1096088 total
> - age   6:      41664 bytes,    1137752 total
> : 2033674K->2251K(2140160K), 0.0103870 secs] 5847925K->3818686K(7227392K), 0.0108330 secs] [Times: user=0.04 sys=0.03, real=0.01 secs]
> 2019-01-20T17:38:59.020+0800: 263179.169: [GC [1 CMS-initial-mark: 3816434K(5087232K)] 3819233K(7227392K), 0.0087940 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
> 2019-01-20T17:38:59.029+0800: 263179.178: [CMS-concurrent-mark-start]
> 2019-01-20T17:38:59.265+0800: 263179.414: [CMS-concurrent-mark: 0.236/0.236 secs] [Times: user=0.45 sys=0.32, real=0.23 secs]
> 2019-01-20T17:38:59.265+0800: 263179.414: [CMS-concurrent-preclean-start]
> 2019-01-20T17:38:59.280+0800: 263179.430: [CMS-concurrent-preclean: 0.015/0.015 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
> 2019-01-20T17:38:59.280+0800: 263179.430: [CMS-concurrent-abortable-preclean-start]
> 2019-01-20T17:39:03.988+0800: 263184.138: [CMS-concurrent-abortable-preclean: 4.692/4.708 secs] [Times: user=8.33 sys=1.33, real=4.71 secs]
> 2019-01-20T17:39:03.990+0800: 263184.140: [GC[YG occupancy: 1017092 K (2140160 K)]263184.140: [Rescan (parallel) , 0.1327010 secs]263184.273: [weak refs processing, 0.0096950 secs] [1 CMS-remark: 3816434K(5087232K)] 4833527K(7227392K), 0.1428740 secs] [Times: user=1.12 sys=0.01, real=0.14 secs]
> 2019-01-20T17:39:04.133+0800: 263184.283: [CMS-concurrent-sweep-start]
> 2019-01-20T17:39:08.528+0800: 263188.678: [GC 263188.678: [ParNew
> Desired survivor size 57671680 bytes, new threshold 6 (max 6)
> - age   1:    1415120 bytes,    1415120 total
> - age   2:      36072 bytes,    1451192 total
> - age   3:      15880 bytes,    1467072 total
> - age   4:      14776 bytes,    1481848 total
> - age   5:      10656 bytes,    1492504 total
> - age   6:      17016 bytes,    1509520 total
> : 2029771K->2026K(2140160K), 0.0095960 secs] 3152679K->1124976K(7227392K), 0.0100460 secs] [Times: user=0.04 sys=0.02, real=0.01 secs]
> 2019-01-20T17:39:10.456+0800: 263190.605: [CMS-concurrent-sweep: 6.310/6.322 secs] [Times: user=11.80 sys=1.90, real=6.32 secs]
> 2019-01-20T17:39:10.456+0800: 263190.605: [CMS-concurrent-reset-start]
> 2019-01-20T17:39:10.468+0800: 263190.618: [CMS-concurrent-reset: 0.013/0.013 secs] [Times: user=0.01 sys=0.02, real=0.02 secs]
> 2019-01-20T17:39:17.954+0800: 263198.103: [GC 263198.104: [ParNew
> Desired survivor size 57671680 bytes, new threshold 6 (max 6)
> - age   1:    1087528 bytes,    1087528 total
> - age   2:     402256 bytes,    1489784 total
> - age   3:      30104 bytes,    1519888 total
> - age   4:      13072 bytes,    1532960 total
> - age   5:      12616 bytes,    1545576 total
> - age   6:      10096 bytes,    1555672 total
> : 2029546K->2265K(2140160K), 0.0082040 secs] 2169712K->142448K(7227392K), 0.0086450 secs] [Times: user=0.04 sys=0.02, real=0.01 secs]

新生代 GC 的暂停时间减小为 8 ms,间隔为 8 s,效果没有明显改进。

第一次老年代 GC 发生于服务器动的 37 个小时后,暂停时间分别为 11 ms 和 131 ms;第二次老年代 GC 发生于服务器动的 73 个小时后,暂停时间分别为 9 ms 和 143 ms;两次的时间间隔是 35 小时。效果明显。

以下是随机挑选的 4 个私有云实例上服务的 jstat 输出:

  1. host 1
> $ jstat -gcutil 843 1250 15
>   S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT
>   0.00   1.91  79.02  49.72  37.11  29647  279.684     4    0.309  279.993
>   0.00   1.91  91.99  49.72  37.11  29647  279.684     4    0.309  279.993
>   1.78   0.00   7.37  49.72  37.11  29648  279.693     4    0.309  280.003
>   1.78   0.00  21.18  49.72  37.11  29648  279.693     4    0.309  280.003
>   1.78   0.00  34.66  49.72  37.11  29648  279.693     4    0.309  280.003
>   1.78   0.00  47.58  49.72  37.11  29648  279.693     4    0.309  280.003
>   1.78   0.00  60.44  49.72  37.11  29648  279.693     4    0.309  280.003
>   1.78   0.00  74.01  49.72  37.11  29648  279.693     4    0.309  280.003
>   1.78   0.00  85.44  49.72  37.11  29648  279.693     4    0.309  280.003
>   1.78   0.00  98.60  49.72  37.11  29648  279.693     4    0.309  280.003
>   0.00   1.88  12.41  49.72  37.11  29649  279.703     4    0.309  280.012
>   0.00   1.88  24.82  49.72  37.11  29649  279.703     4    0.309  280.012
>   0.00   1.88  37.35  49.72  37.11  29649  279.703     4    0.309  280.012
>   0.00   1.88  50.44  49.72  37.11  29649  279.703     4    0.309  280.012
>   0.00   1.88  62.64  49.72  37.11  29649  279.703     4    0.309  280.012
  1. host 2
> $ jstat -gcutil 826 1250 15
>   S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT
>   0.00   1.89  47.85  37.82  37.09  28523  263.721     4    0.260  263.980
>   0.00   1.89  61.32  37.82  37.09  28523  263.721     4    0.260  263.980
>   0.00   1.89  74.97  37.82  37.09  28523  263.721     4    0.260  263.980
>   0.00   1.89  87.87  37.82  37.09  28523  263.721     4    0.260  263.980
>   1.78   0.00   3.42  37.82  37.09  28524  263.730     4    0.260  263.990
>   1.78   0.00  15.14  37.82  37.09  28524  263.730     4    0.260  263.990
>   1.78   0.00  27.75  37.82  37.09  28524  263.730     4    0.260  263.990
>   1.78   0.00  39.36  37.82  37.09  28524  263.730     4    0.260  263.990
>   1.78   0.00  51.27  37.82  37.09  28524  263.730     4    0.260  263.990
>   1.78   0.00  63.74  37.82  37.09  28524  263.730     4    0.260  263.990
>   1.78   0.00  76.18  37.82  37.09  28524  263.730     4    0.260  263.990
>   1.78   0.00  88.50  37.82  37.09  28524  263.730     4    0.260  263.990
>   0.00   1.61   1.15  37.82  37.09  28525  263.739     4    0.260  263.999
>   0.00   1.61  13.71  37.82  37.09  28525  263.739     4    0.260  263.999
>   0.00   1.61  27.08  37.82  37.09  28525  263.739     4    0.260  263.999
  1. host 3
> $ jstat -gcutil 858 1250 15
>   S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT
>   0.00   1.58  50.50  43.42  37.10  28959  289.524     4    0.317  289.841
>   0.00   1.58  63.98  43.42  37.10  28959  289.524     4    0.317  289.841
>   0.00   1.58  75.25  43.42  37.10  28959  289.524     4    0.317  289.841
>   0.00   1.58  88.07  43.42  37.10  28959  289.524     4    0.317  289.841
>   1.86   0.00   0.80  43.42  37.10  28960  289.535     4    0.317  289.852
>   1.86   0.00  15.12  43.42  37.10  28960  289.535     4    0.317  289.852
>   1.86   0.00  27.24  43.42  37.10  28960  289.535     4    0.317  289.852
>   1.86   0.00  40.93  43.42  37.10  28960  289.535     4    0.317  289.852
>   1.86   0.00  54.40  43.42  37.10  28960  289.535     4    0.317  289.852
>   1.86   0.00  67.59  43.42  37.10  28960  289.535     4    0.317  289.852
>   1.86   0.00  79.63  43.42  37.10  28960  289.535     4    0.317  289.852
>   1.86   0.00  92.58  43.42  37.10  28960  289.535     4    0.317  289.852
>   0.00   1.72   6.57  43.42  37.10  28961  289.545     4    0.317  289.862
>   0.00   1.72  19.05  43.42  37.10  28961  289.545     4    0.317  289.862
>   0.00   1.72  31.74  43.42  37.10  28961  289.545     4    0.317  289.862
  1. host 4
> $ jstat -gcutil 798 1250 15
>   S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT
>   1.61   0.00  18.65  44.68  37.09  29190  258.263     4    0.261  258.524
>   1.61   0.00  31.99  44.68  37.09  29190  258.263     4    0.261  258.524
>   1.61   0.00  45.03  44.68  37.09  29190  258.263     4    0.261  258.524
>   1.61   0.00  58.15  44.68  37.09  29190  258.263     4    0.261  258.524
>   1.61   0.00  69.67  44.68  37.09  29190  258.263     4    0.261  258.524
>   1.61   0.00  83.00  44.68  37.09  29190  258.263     4    0.261  258.524
>   1.61   0.00  95.56  44.68  37.09  29190  258.263     4    0.261  258.524
>   0.00   1.61   9.47  44.68  37.09  29191  258.273     4    0.261  258.534
>   0.00   1.61  23.27  44.68  37.09  29191  258.273     4    0.261  258.534
>   0.00   1.61  37.58  44.68  37.09  29191  258.273     4    0.261  258.534
>   0.00   1.61  50.82  44.68  37.09  29191  258.273     4    0.261  258.534
>   0.00   1.61  64.65  44.68  37.09  29191  258.273     4    0.261  258.534
>   0.00   1.61  76.70  44.68  37.09  29191  258.273     4    0.261  258.534
>   0.00   1.61  89.35  44.68  37.09  29191  258.273     4    0.261  258.534
>   1.67   0.00   3.33  44.68  37.09  29192  258.282     4    0.261  258.543

以下是使用 GCeasy - Universal GC Log Analyzer 对 GC 日志的可视化分析结构,作为参考。

清理连接之前:

gceasy for pre-clean

清理连接之后:

gceasy for post-clean

🔗Update

以上的所有结果和分析都是在服务启动之后1-2 天的结果;作为对比,下面是服务运行半个月之后的结果:

gceasy for a month later

🔗补充说明

🔗连接清理

由于服务使用的 Tomcat DBCP 连接池组件,会托管所使用的 JDBC 连接,所以不需要额外的 MySQL 自带的清理机制。

🔗连接数

发现服务运行过程中,哈希表中最多会出现 30000+ 的连接的虚引用。

当前服务所使用的数据库中间件通过 Tomcat DBCP 连接池管理超过 5 个的 MySQL 数据库实例,每个 MySQL 实例 3 个连接池(一主两备)。

服务中使用的连接池的配置是:

<initialSize>16</initialSize>
<maxActive>16</maxActive>
<maxWait>4000</maxWait>
<minIdle>16</minIdle>
<maxIdle>16</maxIdle>

应该不会出现如此大量的连接。而且在当前的服务访问量下,也应该没有空闲连接以致出现连接被自动释放的情况。

此问题可以进一步排查。

🔗升级 JDK 版本

其实如果可以,真心觉得可以优先升级 JDK 到 1.8。但是由于该服务是老服务,而且调用量巨大,升级风险太大;而且 JDBC 接口跨 JDK 版本存在兼容性问题,服务所使用的分库分表中间件没有做兼容性测试和保证。所以没有办法升级 JDK 版本。

🔗总结

在通过调整 GC 参数和代码逻辑之后,服务由于 GC 暂停导致的暂停时间大大减少。

Young GC Pause (ms)Young GC Interval (ms)CMS Initial Pause (ms)CMS Remark Pause (ms)Old GC Interval (h)
Pre11.6510.658613
Post9.14710.014035

由于新生代的 GC 是复制算法,耗时主要在于存活对象的大小,所以调优对新生代的效果不明显;老年代的 GC 是 CMS,通过调整并发收集线程数和修改代码逻辑定时清理 PhantomReference,发现取得了较好的效果。

调优之后的服务,基本上不出现或很少出现 Cache 访问超时,服务的平均响应时间也接近甚至好于迁移之前的表现。