Bug 585798

Summary:	dev-db/mysql-5.6.28 - unresponsive MySQL after few days of running
Product:	Gentoo Linux	Reporter:	SysadminF2C <sysadmin>
Component:	Current packages	Assignee:	Gentoo Linux MySQL bugs team <mysql-bugs>
Status:	RESOLVED NEEDINFO
Severity:	normal	CC:	hydrapolic
Priority:	Normal
Version:	unspecified
Hardware:	AMD64
OS:	Linux
Whiteboard:
Package list:		Runtime testing required:	---
Attachments:	emerge --info dev-db/mysql kern.log dmesg MySQL error log

Description SysadminF2C 2016-06-13 10:02:09 UTC

Hello,

We have a critical bug with MySQL after few days of running.

The error than we received is "Too Many connections" but it's not the source of the problem.

The "max_connections" is 350.

When we have the problem:
- I can open the last connection and I can see that many queries are blocked.
- For example, if I have 350 blocked queries, I can kill some but not all:
many queries are still displayed (show processlist) and don't stop.
- After kill, I can't connect on MySQL: I've already the message "Too many connections" whereas we aren't at the maximum of connections (example: 320/350)
- If I stop MySQL (/etc/init.d/mysql stop), the stop process is blocking. If I check the number of file opened with 'lsof -c mysqld | wc -l', I can see that MySQL start the closure of tables. But after a time, MySQL is stuck on the closure of one table.

I don't think that the problem is hardware: I don't have any symptoms in log files (dmesg, /var/log/kern.log, etc).
Furthermore, the problem can appears after 2 days as 15 days.

The problem appears after upgrade from MySQL 5.6.24 to MySQL 5.6.27.
Freshly, we tried to upgrade to MySQL 5.6.28 and the problem is still present.

We tried to repair our databases with mysqlrepair and we don't have any problems.

Details:
- kernel: 3.18.9-hardened-xxxx-grs-ipv4-64
- glibc: sys-libs/glibc
- procps: 3.3.9-r2
- ncurses: 5.9-r3
- openssl: 1.0.2h

We think upgrade quickly the operating system and MySQL to the 5.6.30 but we aren't sure that this version can resolve the problem (nothing in MySQL's changelog).

Our 'ulimit' values for MySQL:
# ulimit -Hn
1024000
# ulimit -Sn
1024000
# ulimit -Hu
10240
# ulimit -Su
10240

The 'my.cnf' (mysqld section):
[mysqld]
character-set-server                = utf8
user                                = mysql
port                                = 3306
socket                              = /var/run/mysqld/mysqld.sock
pid-file                            = /var/run/mysqld/mysqld.pid
log-error                           = /var/log/mysql/mysqld.err
basedir                             = /usr
datadir                             = /var/lib/mysql
tmpdir                              = /tmp/
skip-external-locking
key_buffer_size                     = 16M
max_allowed_packet                  = 100M
table_open_cache                    = 1000
table_definition_cache              = 1000
sort_buffer_size                    = 20M
net_buffer_length                   = 8K
read_buffer_size                    = 256K
read_rnd_buffer_size                = 512K
myisam_sort_buffer_size             = 8M
join_buffer_size                    = 256K
lc_messages_dir                     = /usr/share/mysql
lc_messages                         = en_GB
sql-mode                            = "traditional"
explicit_defaults_for_timestamp     = 1
max_heap_table_size                 = 2G
tmp_table_size                      = 2G
back_log                            = 120
max_connections                     = 350
max_connect_errors                  = 100
query_cache_limit                   = 128M
query_cache_size                    = 2G
query_cache_type                    = 1
slow_query_log                      = 1
slow_query_log_file                 = /var/log/mysql/mysql-slow.log
long_query_time                     = 60
# log-queries-not-using-indexes
log-slow-admin-statements
thread_cache_size                   = 2000
bind-address                        = 192.168.1.20
server-id                           = 1
sync_binlog                         = 1
log-bin                             = logmaster
log-bin-index                       = logmasterindex
relay-log                           = logrelay
relay-log-index                     = logrelayindex
binlog_format                       = MIXED
binlog_checksum                     = CRC32
binlog-row-event-max-size           = 8192
expire_logs_days                    = 10
sync_master_info                    = 10000
master_info_repository              = FILE
sync_relay_log                      = 10000
sync_relay_log_info                 = 10000
relay_log_info_repository           = FILE
gtid_mode                           = OFF
flush_time                          = 0
innodb_file_format                  = Barracuda
innodb_buffer_pool_size             = 128G
innodb_data_file_path               = ibdata1:50M:autoextend:max:100G
innodb_log_file_size                = 500M
innodb_log_buffer_size              = 8M
innodb_log_files_in_group           = 2
innodb_flush_log_at_trx_commit      = 1
innodb_lock_wait_timeout            = 50
innodb_io_capacity                  = 200
innodb_autoextend_increment         = 64
innodb_buffer_pool_instances        = 8
innodb_checksum_algorithm           = CRC32
innodb_concurrency_tickets          = 5000
innodb_file_per_table               = 1
innodb_old_blocks_time              = 1000
innodb_old_blocks_pct               = 37
##### innodb_open_files             = -1
innodb_stats_on_metadata            = 0

Our application is a web application and we use:
- Apache 2.2.31
- PHP 5.6.13 with PDO

The web server and the sql server are two different physical servers: we don't have any problem with network. Even if that were the case, MySQL should succeed to stop when we run '/etc/init.d/mysql stop'.

Have you any ideas please ?
Thank you :)

Comment 1 Tomáš Mózes 2016-06-13 21:02:59 UTC

Hello, please provide us with some more information - what hardware do you run the server on? Output of emerge --info, content of dmesg and /var/log/kern.log, your mysql log.

Have you tried running FLUSH HOSTS? Have you tried increasing max_connections and/or max_connect_errors?

Some random observations:
- 2GB query cache is an overkill and may actually slow down your application. It's better to use application caching.
- you have a limit on innodb_data_file_path (max:100GB) - isn't that over the limit? You shouldn't limit it.
- innodb_log_file_size is 500MB, do you really need that big one?
- binlog_format is set to MIXED - do you use replication? If so, the only recommended format is ROW.

How many disk space do you have in your paritions? Especially /var/lib/mysql and /tmp?

Comment 2 SysadminF2C 2016-06-14 07:25:44 UTC

Created attachment 437486 [details]
emerge --info dev-db/mysql

Comment 3 SysadminF2C 2016-06-14 08:13:09 UTC

Created attachment 437488 [details]
kern.log

Comment 4 SysadminF2C 2016-06-14 08:13:34 UTC

Created attachment 437490 [details]
dmesg

Comment 5 SysadminF2C 2016-06-14 08:16:07 UTC

Hello Tomáš,

The server is a physical server:
- CPU: 2 * Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
- Memory: 256 Go DDR3 16866 Mhz ECC (16 * 16 Go)
- Hard Drive: 12 x 600 Go SAS 15K RPM
- RAID: RAID 10 with LSI MegaRAID SAS 9271-8i

Do you want more information about the server ?

You will find the 'emerge --info', 'kern.log' and 'dmesg' in attachments.

Have you tried running FLUSH HOSTS?
=> No but I can try when the problem will happen again.

Have you tried increasing max_connections and/or max_connect_errors?
=> No because I can kill a few queries and connections. After this, the number of connections is lower than max_connections. The problem has often occurred during off-peak hours. I think that MySQL does not respond to a request and that the other accumulates behind.

- 2GB query cache is an overkill and may actually slow down your application. It's better to use application caching.
- you have a limit on innodb_data_file_path (max:100GB) - isn't that over the limit? You shouldn't limit it.
- innodb_log_file_size is 500MB, do you really need that big one?
- binlog_format is set to MIXED - do you use replication? If so, the only recommended format is ROW.
=> Thank you for advices. We are on a production server so it isn't easy to change running configuration without qualification. We have plans to split our SQL server into multiple smaller servers: we will take the opportunity to test different values to these parameters.

How many disk space do you have in your paritions? Especially /var/lib/mysql and /tmp?
=> At the moment we have a single partition:
Filesystem Size Used Avail Use% Mounted on
/dev/root 3.3T 1.7T 1.4T 55% /

Comment 6 Tomáš Mózes 2016-06-15 12:40:19 UTC

The only interesting stuff in the logs is pretty old:

Aug 27 11:30:19 sql1 kernel: EXT4-fs warning (device sda4): ext4_dx_add_entry:1990: Directory index full!

Please attach the logs from /var/log/mysql.

Comment 7 SysadminF2C 2016-06-15 17:51:46 UTC

Hello,

Indeed this problem is old and it was fixed since.

Please find "mysqld_2016.err" in attachment.
It's the single file that we have.

Comment 8 SysadminF2C 2016-06-15 17:53:17 UTC

Created attachment 437678 [details]
MySQL error log

Comment 9 Tomáš Mózes 2016-06-15 19:46:59 UTC

Those shutdowns from the MySQL log - are you doing them on purpose?

Comment 10 SysadminF2C 2016-06-15 20:18:11 UTC

When MySQL has the problem, we have to restart him.
When we try to stop it with '/etc/init.d/mysql stop', it nothing happens even after several minutes so we must kill it with 'kill -9'.

I tried to compile MySQL with 'debug' flag but 'mysqld -V' return:
mysqld  Ver 5.6.28 for Linux on x86_64 (Source distribution)
  => we haven't 'mysqld-debug'.

I tried to enable this parameter in '/etc/mysql/my.cnf':
debug = d:t:i:o,/tmp/mysqld.trace

But I had this error:
2016-06-15 22:15:12 38311 [ERROR] /usr/sbin/mysqld: unknown variable 'debug=d:t:i:o,/tmp/mysqld.trace'

Do you know how can I enable the debug mode please ?

Comment 11 Tomáš Mózes 2016-06-16 05:27:44 UTC

A normal shutdown was issued:
2016-06-10 23:47:33 16945 [Note] /usr/sbin/mysqld: Normal shutdown
2016-06-10 23:47:33 16945 [Note] Giving 350 client threads a chance to die gracefully

Then MySQL was killed and started 2 minutes afterwards:
2016-06-10 23:49:30 48949 [Note] InnoDB: Using atomics to ref count buffer pool pages

You mentioned you have 256GB of RAM and Innodb buffer pool of 128GB. Even though you have innodb_flush_log_at_trx_commit = 1, MySQL has make sure everything is dumped to the disk. I would try giving it more time, because there is nothing unusual in the log, it's a normal shutdown. Can you check iotop or similar tool to see if any disks read/writes are being made? I suppose there will be a lot of writes by MySQL.

I think the problem can be caused by the PHP application itself. If PHP is a module for apache and it stucks for some reason (networking,other services,etc.), it will remain there until you restart Apache.

Try increasing max_connections and max_connect_errors and check how many PHP threads are actually being used.

Debugging is mainly when you really have problems like crashes, depends what you want to find. Since it's a normal shutdown, you can still strace attach and see what system calls are issued. More on debugging:
https://wiki.gentoo.org/wiki/Debugging
https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces

Comment 12 SysadminF2C 2016-06-21 07:05:01 UTC

Hi,

Sorry for the response time.

When we stop MySQL properly, it takes about 15 minutes.
When we have the problem, even after 40 minutes, MySQL is already running.
Furthermore, the number of open file by MySQL doesn't decrease :-/

We already tried to restart Apache and connections remained on MySQL.

The problem appeared with MySQL upgrade (MySQL 5.6.24 -> 5.6.27). I'm not sure that the problem is side PHP :-/

Comment 13 Tomáš Mózes 2016-06-21 08:41:38 UTC

Can you try a newer release? Was this the only thing changed (I mean the MySQL version)?

Comment 14 SysadminF2C 2016-07-08 08:06:23 UTC

Hi,

Sorry for the response time.

We have upgraded MySQL to MySQL 5.6.30 two weeks ago (2016-06-22).

For the moment the problem hasn't appeared but we have stopped MySQL on the 2016-06-30 for maintenance therefore we have to wait and see.

Comment 15 Tomáš Mózes 2016-07-08 12:26:48 UTC

Thanks for the update. I'll close the bug for now, if it happens again, feel free to reopen and we can continue.