Common MySQL Replication Issues – WWW.TRIJULIAN.WEB.ID

Introduction

In this blog we are going to cover some of the most frequent replication issues that might happen to any actively being used databases. If they are not addressed properly it could really take your standby slaves nodes to an inconsistent and which could land you easily with single point of failure if your slaves are not in consistent state with master. Having a consistent slave is more important that having N number of slaves. We will go over couple of errors in this post.

Error 1032 – Missing Records

Failed Replication Status


               Slave_IO_State: Waiting for master to send event
                  Master_Host: genexdemo-master.com
                  Master_User: replication_user
                  Master_Port: 3306
              Master_Log_File: mysql-bin.011788
          Read_Master_Log_Pos: 440461708
                Relay_Log_Pos: 401162462
        Relay_Master_Log_File: mysql-bin.011787
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 8 failed executing transaction '542d20a8-51ee-11e7-93e0-0050569a05da:7085897700' at master log mysql-bin.011787, end_log_pos 403406130. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
          Exec_Master_Log_Pos: 401162249
        Seconds_Behind_Master: NULL
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1032
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 8 failed executing transaction '542d20a8-51ee-11e7-93e0-0050569a05da:7085897700' at master log mysql-bin.011787, end_log_pos 403406130. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                    SQL_Delay: 0
      Slave_SQL_Running_State: 
                 Channel_Name:

Issue Analysis

MySQL Error Code – 1032

Error Message

Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 8 failed executing transaction ‘542d20a8-51ee-11e7-93e0-0050569a05da:7085897700’ at master log mysql-bin.011787, end_log_pos 403406130. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

Problem Statement

Data missing in the slaves

Cause of Problem

Someone might have accidentally deleted records directly from slave instead of master and then later when the same transactions comes through replication from the master the slave ends up in such failed state. So precisely transactions from master on records that does not exists in slaves would get into a failed state as above.

Short-term Solution

Refresh the inconsistent tables using xtrabackup/meb/mysqldump backups from master

Permanent Solution

Identify the transactions causing discrepancies on slaves and fix those transactions and make sure it flow properly through replication. Mark you slaves super-read-only so it can block any accidental direct writes from all users.

Error 1062 – Duplicate Records

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: genexdemo-master.dom
                  Master_User: replication_user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.002011
          Read_Master_Log_Pos: 266143481
               Relay_Log_File: mysql-relay-bin.000028
                Relay_Log_Pos: 264952719
        Relay_Master_Log_File: mysql-bin.000011
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
                   Last_Errno: 1062
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '52347a8b-fbc3-11e7-afa2-0050568181e9:122276511' at master log mysql-bin.000011, end_log_pos 264955815. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 264955505
              Relay_Log_Space: 266143517
              Until_Condition: None
               Last_SQL_Errno: 1062
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '52347a8b-fbc3-11e7-afa2-0050568181e9:122276511' at master log mysql-bin.000011, end_log_pos 264955815. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
1 row in set (0.00 sec)

Issue Analysis

MySQL Error Code – 1062

Issue

Slave fails to replicate because of duplicate entry for the primary of the table

Error Message

A Coordinator is stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction ‘52347a8b-fbc3-11e7-afa2-0050568181e9:122276511’ at master log mysql-bin.000011, end_log_pos 264955815. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

Reason

As I already mentioned for error code 1032 that we would be refreshing the tables that are not in sync. Each instances are quite huge and we take online backups from the master.

Even though we stop replication in the slave before starting the backup from master. So the replication co-ordinates of the backup would be greater than the coordinates of the stopped slave.

The state of the single table imported is greater than the position of the replication stopped in the slave. So till the replication comes to the state of the table being refreshed we would be getting duplicate entry errors for the data which is already there in that table.

Quick Resolution

It’s absolutely fine to Skip duplicate errors on a table with the error code 1062

Option 1:

stop slave; set global sql_slave_skip_counter=1; start slave;

Option 2:

You can even delete the duplicate records in the slave if you are sure the data in slave are corrupt so deleting those records from slave and letting replication to backfill the data deleted is also a possible option

Permanent Resolution

Perform a consistent check between master and slave. Record the tables not in consistent state. Take a consistent backup from master or any other live slave whichever is possible , load that backup from master to broken slave and resume replication.

Conclusion

There could be multiple ways to address the replication issues but in an ideal world replication is the most reliable way of having a standby server. However we have to accept the fact that we cannot build systems which would never fail and what gives us the edge over the failures is how well you are prepared or equipped to handle it.

Do we know all the possible failures and the cause of such failures in your system ? Do you have a monitoring that can help you detect the failure points ? Do you have a support that can pull you off from such failures efficiently. @genexdb we help our customers building solutions and processes needed for their environment to handle such scenarios.

Please reach out to us @ support@genexdbs.com if you need help with improving and managing your database systems. We support a wide range of database and data streaming softwares visit us genexdbs.com to learn more about us.

source : https://genexdbs.com/common-mysql-replication-issues/

Posted on: May 23, 2022, by : Julian's | 421 views