在接收到某境外业务项目同步国内分析库实例的MySQL中转从库磁盘大于80%的告警,ECS主机数据盘扩容过程中,偶然发现从库sql_thread线程状态中断。其中查阅主库的error log无异常,本想通过mysqlbinlog --base64-output=decode-rows -vv(verify-binlog-checksum) 是否可以正常解析binlog文件,以验证主库的binlog是否损坏,发现主库binlog已经过期删除。此外,在处理从库的时候重启从库MySQL实例,因为relay_log_recovery=on设置导致从库的relay-log也因重启实例被删除了,当前就只能寄希望于从库实例上的蛛丝马迹看是否能跟踪定位发现问题,从库实例主机日志也被清理了,其中MySQL错误日志如下:
##从库MySQL错误日志如下:
Status information:
Current dir: /data/3306/data/
Running threads: 12 Stack size: 262144
Current locks:
lock: 0x7c673b0:
lock: 0x7bf5a60:
lock: 0x7bd8770:
lock: 0x7be74b0:
lock: 0x7bc6350:
lock: 0x7bbf920:
lock: 0x7bb7c10:
lock: 0x2081480:
lock: 0x20813a0:
lock: 0x20812c0:
......
Events status:
LLA = Last Locked At LUA = Last Unlocked At
WOC = Waiting On Condition DL = Data Locked
Event scheduler status:
State : INITIALIZED
Thread id : 0
LLA : n/a:0
LUA : n/a:0
WOC : NO
Workers : 0
Executed : 0
Data locked: NO
Event queue status:
Element count : 0
Data locked : NO
Attempting lock : NO
LLA : init_queue:103
LUA : init_queue:111
WOC : NO
Next activation : never
2022-09-01T11:41:22.841046+08:00 7 [ERROR] Error in Log_event::read_log_event(): 'Found invalid event in binary log', data_len: 59, event_type: 101
2022-09-01T11:41:22.841075+08:00 7 [ERROR] Error reading relay log event for channel '': slave SQL thread aborted because of I/O error
2022-09-01T11:41:22.841091+08:00 7 [ERROR] Slave SQL for channel '': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog
' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or sl
ave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 1594
2022-09-01T11:41:22.843607+08:00 7 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000242' position 314110116.
2022-09-01T22:09:25.948621+08:00 2 [Note] Aborted connection 2 to db: 'unconnected' user: 'dm_cnl' host: '10.10.80.99' (Got timeout reading communication packets)
从以上从库实例MySQL错误日志,大概能猜测可能已经跨境网络导致从库实例接收到的Relay-log传输过程中错误导致日志文件损坏,进而导致MySQL从库实例在回放binlog时报错,即sql_thread_running=no,但在定位和解决问题的几点教训总结如下:
- 在定位和解决问题之前,"保护现场"!即不管是数据还是日志都需要足够警惕,事先备份
- 重要数据保护措施/参数等:
- relay_log_recovery
- expire_logs_days
- /var/log/messags
- /data/3306/log/error.log
- 任何时候,任何服务都应及时配置/创建服务的重要监控告警项,以及时发现和解决问题
- 解决relay-log文件损坏的重要步骤
- 定位是主库binlog还是从库relaylog损坏,通过mysqlbinlog看是否正常解析日志文件
- 查阅主从操作系统日志和MySQL错误日志信息,/var/log/messages error.log
- 如master节点的binlog还保留着,从节点执行reset slave,change master to重新同步
- 如master节点的binlog已过期删除,重新备份主库数据并重做主从数据同步
- MySQL对于binlog和relaylog文件中event事件有效性校验参数
参数名称 |
参数解释 |
binlog_checksum |
binlog完整性校验参数,默认CRC32,如为none,检查binlog中event的长度方式来校验 |
master_verify_checksum |
从Master中获取binlog会去校验binlog的事件完整性,两种情况会发生,其一:show binlog events,其二:binlog dump向slave中binlog |
slave_sql_verify_checksum |
从库sql_thread回放relaylog获取event时校验 |