系统优化-IO-案例redis引起的io问题

测试方案

容器准备

1
2
3
4
5
6
7
docker run --name=redis -itd -p 10000:80 feisky/redis-server
docker run --name=app --network=container:redis -itd feisky/redis-app

docker ps #确定启动了俩个容器
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e8aa78bb2a42 feisky/redis-app "python /app.py" 15 minutes ago Up 15 minutes app
5567d2ba22c7 feisky/redis-server "docker-entrypoint..." 21 minutes ago Up 21 minutes 6379/tcp, 0.0.0.0:10000->80/tcp redis

数据准备

1
2
curl http://127.0.0.1:10000/init/5000
# 返回 {"elapsed_seconds":23.771671295166016,"keys_initialized":5000}

测试

1
while true; do curl http://192.168.0.10:10000/get_cache; done

现象

接口响应要7s,用top查看cpu0的us是11.1,wa是35.6,初步判断是io的问题,cpu占用较高的进程是Pyhton和redis-server

1
2
3
4
5
6
7
8
9
top - 12:37:23 up 4 days, 13:13,  2 users,  load average: 0.51, 0.25, 0.20
Tasks: 102 total, 2 running, 100 sleeping, 0 stopped, 0 zombie
%Cpu0 : 11.1 us, 8.8 sy, 0.0 ni, 40.6 id, 35.6 wa, 0.0 hi, 1.1 si, 2.7 st
KiB Mem : 1014884 total, 106288 free, 229516 used, 679080 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 524716 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2335 root 20 0 192992 23796 5148 S 17.6 2.3 0:06.97 python
394 100 20 0 28340 3020 1128 R 8.6 0.3 0:06.05 redis-server

我们用iostat -d -x进一步定位io的问题,发现很奇怪的发现一个查询接口居然有大量的write

1
2
3
4
5
6
7
8
9
10
11
iostat -d -x 1
Linux 3.10.0-957.27.2.el7.x86_64 (10-254-192-16) 05/16/2020 _x86_64_ (1 CPU)

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.26 0.53 5.36 105.01 107.24 72.00 0.13 35.83 57.35 33.69 0.75 0.44

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.00 1.08 504.30 8.60 1259.68 5.02 0.43 1.62 54.00 1.51 0.74 37.53

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.00 0.00 576.14 0.00 1459.09 5.07 0.36 1.41 0.00 1.41 0.63 36.36

用pidstat -d 定位发现是redis-server在写,接着我们用strace定位,发现有大量的write和fdatasync,经过排查发现我们的redis-server用的是AOF策略是alwalys,每条指令都需要write和fsync,会产生大量的write。我们改变策略改为everysec,问题解决大部分,为什么是大部分,我们看到西面的命令存在大量的SADD。这些是没必要的,我们可以通过修改代码优化。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
strace -f -T -tt -p 394
[pid 394] 12:49:57.864317 fdatasync(7) = 0 <0.003447>
[pid 394] 12:49:57.867921 write(8, ":1\r\n", 4) = 4 <0.000285>
[pid 394] 12:49:57.868293 epoll_pwait(5, [{EPOLLIN, {u32=8, u64=8}}], 10128, 55, NULL, 8) = 1 <0.000078>
[pid 394] 12:49:57.868481 read(8, "*2\r\n$3\r\nGET\r\n$41\r\nuuid:510df338-"..., 16384) = 61 <0.000040>
[pid 394] 12:49:57.868609 read(3, 0x7fffaef0dce7, 1) = -1 EAGAIN (Resource temporarily unavailable) <0.000047>
[pid 394] 12:49:57.868754 write(8, "$3\r\nbad\r\n", 9) = 9 <0.000249>
[pid 394] 12:49:57.869108 epoll_pwait(5, [{EPOLLIN, {u32=8, u64=8}}], 10128, 54, NULL, 8) = 1 <0.000037>
[pid 394] 12:49:57.869226 read(8, "*2\r\n$3\r\nGET\r\n$41\r\nuuid:52aa4944-"..., 16384) = 61 <0.000062>
[pid 394] 12:49:57.869403 read(3, 0x7fffaef0dce7, 1) = -1 EAGAIN (Resource temporarily unavailable) <0.000036>
[pid 394] 12:49:57.869505 write(8, "$3\r\nbad\r\n", 9) = 9 <0.000205>
[pid 394] 12:49:57.869833 epoll_pwait(5, [{EPOLLIN, {u32=8, u64=8}}], 10128, 54, NULL, 8) = 1 <0.000038>
[pid 394] 12:49:57.869939 read(8, "*4\r\n$4\r\nSCAN\r\n$4\r\n1590\r\n$5\r\nMATC"..., 16384) = 47 <0.000038>
[pid 394] 12:49:57.870071 read(3, 0x7fffaef0dce7, 1) = -1 EAGAIN (Resource temporarily unavailable) <0.000036>
[pid 394] 12:49:57.870172 write(8, "*2\r\n$4\r\n5942\r\n*10\r\n$41\r\nuuid:4bd"..., 499) = 499 <0.000352>
[pid 394] 12:49:57.870600 epoll_pwait(5, [{EPOLLIN, {u32=8, u64=8}}], 10128, 53, NULL, 8) = 1 <0.000037>
[pid 394] 12:49:57.870753 read(8, "*2\r\n$3\r\nGET\r\n$41\r\nuuid:4bdfcf62-"..., 16384) = 61 <0.000057>
[pid 394] 12:49:57.870887 read(3, 0x7fffaef0dce7, 1) = -1 EAGAIN (Resource temporarily unavailable) <0.000036>
[pid 394] 12:49:57.870987 write(8, "$4\r\ngood\r\n", 10) = 10 <0.000196>
[pid 394] 12:49:57.871297 epoll_pwait(5, [{EPOLLIN, {u32=8, u64=8}}], 10128, 52, NULL, 8) = 1 <0.000105>

经过修改AOF的策略和优化掉SADD命令相关代码接口响应市场降为0.16034674644470215ms