Agents in DEAD Calls and Randomly Pausing
Posted: Tue Jun 14, 2022 12:23 am
First, let me say that I am (still) running Version 2.14-670a (2.14b0.5), SVN 2973 (on all servers), DB Schema 1542, build 180424-1521, 6 of my telephony servers are running asterisk v11.22.0-vici, 1 is running 11.21.2-vici and 2 are running 13.27.0-vici, and my system was installed from the vicibox v8.1 ISO.
Next a little background...
We have recently added about 30 more agents - we are load balancing across 3 of our best servers - 1 that used to be my DB server (20 core Intel Xeon processor) and 2 that have always been Agent servers (each has 12 core Intel Xeon). With these additional agents, I'm at around 150, so - I am typically seeing 40 - 50 agents on each of these servers on the REPORTS page. Currently, my (40 core) DB server is just that - only DB and Web - and the remaining 6 servers vary in processor cores, RAM, etc - but - they're used exclusively for outbound dialing trunks - which all combine to give me 1500 - 2000 outbound dialing channels if/when I need them. Not sure if it's relevant - but - our 150 agents are spread out over 10+ separate offices (different cities, etc) - so - as much as I'd like to - I can't blame their local internet connection for the problem, since the problem I'm about to describe appears to happen everywhere.
The problem...
Over the past week or so, our agents have been experiencing random pausing between calls as well as randomly showing in a DEAD call on the Real Time screen, while still showing READY (or even PAUSED) on their screen. The random pausing isn't so bad, because the Agent sees it, but the DEAD call thing isn't obvious, so it's not resolved until someone notices and tells them to logout/in.
While trying to figure out what the issue is - I checked each server's Load Average on their Real Time screens (running that report via each server's web server, so the Load Average I see is for each server). It seems that we are experiencing a relatively HIGH Load on the DB server - commonly 5.0 - 17.5 - but the strange thing is, it's this high even when nothing appears in mtop / seems like there is nothing going on in MySQL - PLUS - top/htop shows mysql using 250 - 480% CPU or more - again - even when I don't see any active queries in mtop. The three Agent servers where all of our extensions are registering to do show some random peaks in Load Average, but they typically remain in the single digit range.
My question is this...
Is this something more likely caused by the load on our database server - OR - is it more likely caused by the number of Agent's each load balanced server is now responsible for?
Should I be looking to fix/optimize something on the DB server? Or should I be involving additional servers in our load balancing configuration by repointing the URL of 1 or 2 of each station's 3 extensions to different servers to lower the number of Agents balanced on those three servers?
If this is more likely an issue on the DB server - what suggestions do you have to help make it more efficient? Honestly - it's been running great until about 3 weeks ago, we have had times in the past where we approached 150 reps without these issues, but, it's possible that we didn't stay there long enough to have the problems be noticed.
I am running a mysqlcheck overnight tonight to see if there are any corrupt tables. Tomorrow, unless I hear anything else, I am planning on modifying the log archive crontab job to keep less data in the _log tables and more in the _log_archive ones.
Anyway - thank you for taking the time to read this - I really do appreciate any feedback/suggestions/advice you can give me - thanks!
David
Next a little background...
We have recently added about 30 more agents - we are load balancing across 3 of our best servers - 1 that used to be my DB server (20 core Intel Xeon processor) and 2 that have always been Agent servers (each has 12 core Intel Xeon). With these additional agents, I'm at around 150, so - I am typically seeing 40 - 50 agents on each of these servers on the REPORTS page. Currently, my (40 core) DB server is just that - only DB and Web - and the remaining 6 servers vary in processor cores, RAM, etc - but - they're used exclusively for outbound dialing trunks - which all combine to give me 1500 - 2000 outbound dialing channels if/when I need them. Not sure if it's relevant - but - our 150 agents are spread out over 10+ separate offices (different cities, etc) - so - as much as I'd like to - I can't blame their local internet connection for the problem, since the problem I'm about to describe appears to happen everywhere.
The problem...
Over the past week or so, our agents have been experiencing random pausing between calls as well as randomly showing in a DEAD call on the Real Time screen, while still showing READY (or even PAUSED) on their screen. The random pausing isn't so bad, because the Agent sees it, but the DEAD call thing isn't obvious, so it's not resolved until someone notices and tells them to logout/in.
While trying to figure out what the issue is - I checked each server's Load Average on their Real Time screens (running that report via each server's web server, so the Load Average I see is for each server). It seems that we are experiencing a relatively HIGH Load on the DB server - commonly 5.0 - 17.5 - but the strange thing is, it's this high even when nothing appears in mtop / seems like there is nothing going on in MySQL - PLUS - top/htop shows mysql using 250 - 480% CPU or more - again - even when I don't see any active queries in mtop. The three Agent servers where all of our extensions are registering to do show some random peaks in Load Average, but they typically remain in the single digit range.
My question is this...
Is this something more likely caused by the load on our database server - OR - is it more likely caused by the number of Agent's each load balanced server is now responsible for?
Should I be looking to fix/optimize something on the DB server? Or should I be involving additional servers in our load balancing configuration by repointing the URL of 1 or 2 of each station's 3 extensions to different servers to lower the number of Agents balanced on those three servers?
If this is more likely an issue on the DB server - what suggestions do you have to help make it more efficient? Honestly - it's been running great until about 3 weeks ago, we have had times in the past where we approached 150 reps without these issues, but, it's possible that we didn't stay there long enough to have the problems be noticed.
I am running a mysqlcheck overnight tonight to see if there are any corrupt tables. Tomorrow, unless I hear anything else, I am planning on modifying the log archive crontab job to keep less data in the _log tables and more in the _log_archive ones.
Anyway - thank you for taking the time to read this - I really do appreciate any feedback/suggestions/advice you can give me - thanks!
David