vicidial.org

by **DefLeppard** » Mon Apr 28, 2014 7:36 pm

We have a Vicibox 4.0.3 cluster with 1x DB, 1x Web, 4x Dialers. There are about 70-80 agents logged in on an average. It works well most of the times but sometimes randomly throughout the day they have issues like agents getting paused or kicked out, customer information taking longer to show up on the agent screen.

The top command on DB server usually shows load averages between 1 and 4 which I am assuming is fine. The %CPU however is something I am worried about. At times, it shoots up to as high as 300. Following is a screenshot taken at a random time during the day.
http://postimg.org/image/56togp1jv/

We used to have 10k SAS drives on the DB server but they were having this issue a lot more often with that server. We contacted Vicidial Group and they suggested we upgrade to 15k drives in RAID 1 or 10 configuration with LSI MegaRAID card. We took their suggestion and upgraded to a better server with faster drives and RAID card and the issue is less frequent but it still is happening. How can we get rid of the issue completely?

DB Server: Supermicro 2x Quad Xeon 5520, 24 GB RAM, LSI MegaRAID 256MB, 2x 15k RPM SAS drives in RAID 1 configuration
All other servers: Supermicro 2x Quad Xeon 5420, 8GB RAM, 1x 7200 RPM SATA drive
Vicidial VERSION: 2.8-408a BUILD: 130711-2208 SVN 2003 DB Schema Update 2013-07-24 16:46:45

by **mflorell** » Tue Apr 29, 2014 6:05 am

I would suggest running mtop or running the mysql query "show full processlist;" when this happens to see if there are any specific queries that are causing this issue.

You could also enable slow query logging in MySQL and then check that log file to see if there is a query that happens at the same time.

by **DefLeppard** » Tue Apr 29, 2014 11:25 am

Here is a snippet of the Mysql slow query log.
http://pastebin.com/Hbq91vdW

by **Vince-0** » Wed Apr 30, 2014 3:06 am

Your list table is huge: Rows_examined: 3724132 . I reckon you should clean up some lists.

by **DefLeppard** » Wed Apr 30, 2014 2:42 pm

Cleaning lists is not an option.

However, your slow queries suggestion did help. I found some queries taking as long as 3.x seconds to complete. These were mainly hopper reload queries. I found an interesting thing with thesse queries. They had the mysql RAND() function in them. I tried the query without the RAND() function and it worked a lot faster. It seems that the campaigns have Lead Order set to RANDOM and the List Order Randomize is also set to Y which are the reasons for the RAND() function to be used in the queries. I changed the Lead Order to Down and List Order Randomize to N and as expected queries were running a lot faster and the load average on the server also went down.

They have been working for 5 hours since this morning without issues. But again I wouldn't jump the gun. I will keep monitoring and keep my fingers crossed.

by **geoff3dmg** » Thu May 01, 2014 3:10 am

Yes, the Random lead ordering/list ordering is quite taxing on a large lead list. Additionally lead ordering by call times is also higher load. If you can stick to lead orders that have database indexs then you will have much faster queries and lower database load.

by **boybawang** » Thu May 01, 2014 6:41 am

try putting your mysql temp directory into a ramdrive or tmpfs

by **geoff3dmg** » Thu May 01, 2014 7:38 am

Ideally the DB server should have enough ram to hold the entire database. If so, MySQL temp on tmpfs won't make a difference.

by **DefLeppard** » Thu May 01, 2014 11:12 am

We have 24GB of RAM and the database is just around 4GB in size. Creating a Ramdisk to hold the DB might be a good idea, never tried it before. Any pointers?

by **williamconley** » Sat May 03, 2014 11:14 pm

It's been tried and not made enough difference or failed miserably (but not by us ... and I'm not sure exactly how those others actually tried before they failed). The problem at first isn't that the DB isn't "in ram" but that the portion that's in the cache keeps changing so it must be re-grabbed and you have 3Million records from which to choose while "re-grabbing". This is why indexing makes a huge difference on the queries, because even though the entire DB could be cached, the cache needs to be changed too often.

However, you ARE describing a classic tail of MIRROR slowdown. Try it with RAID10 on an SSD or without RAID at all and you should notice the system coming back to par. But eventually you will find the limit of the system ... it's inevitable with a DB continuing to grow.

Consider pruning all your _log tables (there's an archiving script available for this already), as it may not just be the list table causing the slowdown ... you may also be experiencing cascade failure from other huge tables. This happens quite a bit, too.

If you search the forum for RAID discussions, you'll find that heavy-DB users bump into this problem quite often. Throwing bigger/badder hardware isn't going to resolve the issue that this data has to travel the various pathways and the RAID card becomes involved at some point.

Although I would definitely be interested in seeing someone try to "RAM" the entire data drive during boot and then store it back to HD during shut down ... that's obviously a very dangerous procedure unless you know what you're doing (for instance: Backup DB during startup and then log storage on SSD for security against system shutdown causing the loss of the DB, that sort of thing). After all, if your entire DB is under 4G and you CAN store the entire DB in RAM ... in theory this would stop the need for data flow over the RAID during DB usage. *Poof* you should be able to go like lightning. Experimental, but would be great fun to try ... major clients prefer to stick with "Bigger Badder DB system with faster CPU, faster Memory and More memory and 6G/sec SSD RAID 10" instead. Go figure (tried and true, of course, they like that).

Happy Hunting

by **DefLeppard** » Mon May 12, 2014 1:40 pm

We moved the database to a new server last week. This one has the following specs:

Dual Xeon E5520 @ 2.27 GHz (multi-threaded)
24GB RAM
2x 15k RPM SAS drives in RAID 0 configuration

It works well most of the time with no issues except just sometimes agents start having issues like getting paused automatically and multiple calls. I am starting to think this is an application issue now. This server is faster than we even need. Any inputs?

by **DefLeppard** » Mon May 12, 2014 1:42 pm

FYI. All Mysql queries are being executed in less than 2 seconds. And the issue occurs maybe once in a few days.

by **williamconley** » Mon Jul 07, 2014 11:40 pm

DefLeppard wrote:We moved the database to a new server last week. This one has the following specs:

Dual Xeon E5520 @ 2.27 GHz (multi-threaded)
24GB RAM
2x 15k RPM SAS drives in RAID 0 configuration

It works well most of the time with no issues except just sometimes agents start having issues like getting paused automatically and multiple calls. I am starting to think this is an application issue now. This server is faster than we even need. Any inputs?

1) Welcome to the Party! 8-)

2) As you are obviously new here, I have some suggestions to help us all help you:

When you post, please post your entire configuration including (but not limited to) your installation method and vicidial version with build.

This IS a requirement for posting along with reading the stickies (at the top of each forum) and the manager's manual (available on EFLO.net, both free and paid versions)

You should also post: Asterisk version, telephony hardware (model number is helpful here), cluster information if you have one, and whether any other software is installed in the box. If your installation method is "from scratch" you must post your operating system and should also post the .iso version from which you installed your original operating system. If your installation is "Hosted" list the site name of the host.

If this is a "Cloud" or "Virtual" server, please note the technology involved along with the version of that techology (ie: VMware Server Version 2.0.2). If it is not, merely stating the Motherboard model # and CPU would be helpful.

Similar to This:

Vicibox X.X from .iso | Vicidial X.X.X-XXX Build XXXXXX-XXXX | Asterisk X.X.X | Single Server | No Digium/Sangoma Hardware | No Extra Software After Installation | Intel DG35EC | Core2Quad Q6600

3) I hope you meant to say dual quad core for a total of 8 cores. It's nice that it's Xeon, but Xeon itself does not mean "quad" so it's helpful if you just tell us the core count and avoid lookups. Regardless, you have not mentioned how many calls vs how many cores so I can't say whether there's a challenge. Nor have you mentioned whether you are recording all calls or using g729 (deciding factors!). And you were a bit vague about whether you moved ONLY the DB portion of the system to a new machine, and if so whether you now have a multi-server cluster ... or just a DB and a Web/dialer ...

4) Also: multi-threaded ... did you mean hyperthreading is on? As a rule that's very nice in ... windows. But can be counterproductive in linux. I've had some reports that it's neutral and some that it causes instability ... but nothing personal and never that it "helps".

5) Next: Is this the host server, or did you load Proxmox or vSphere or something similar and then place your vicidial system in a virtual machine within the environment? (Have to ask).

6) Autopausing is often attributed to "server issues" but usually turns out to actually be a networking issue. Describe the network configuration, especially how the agents link to the Vicidial web and dialer servers. It can also be related to timing between the servers. Be sure they are time synced. (Not "time SET" but SYNCED .. using ntp to sync the servers ... if you're not sure, try "ntpq -p" one server should sync outside and the other should sync to that one.)

7) Multiple calls is usually a sign of overload, but there are also some old bugs related to hitting the enter key that can have the same effect. And I don't know your Vicidial Version.

Happy Hunting! 8-)

by **Iz3k34l** » Sat Jul 26, 2014 1:39 pm

William,

could you elaborate a little more on point 7, at what point in which hitting the enter key on a call causes the introduction of another call. And if known... does hiting the enter key trick the system into thinking the agent is in 'waiting' sending a call to the agents session?

vicidial.org

Database slowing down on cluster

Database slowing down on cluster

Re: Database slowing down on cluster

Re: Database slowing down on cluster

Re: Database slowing down on cluster

Re: Database slowing down on cluster

Re: Database slowing down on cluster

Re: Database slowing down on cluster

Re: Database slowing down on cluster

Re: Database slowing down on cluster

Re: Database slowing down on cluster

Re: Database slowing down on cluster

Re: Database slowing down on cluster

Re: Database slowing down on cluster

Re: Database slowing down on cluster

Who is online