Unexpected Asterisk CID can crash whole cluster

All installation and configuration problems and questions

Moderators: gerski, enjay, williamconley, Op3r, Staydog, gardo, mflorell, MJCoate, mcargile, Kumba, Michael_N

Unexpected Asterisk CID can crash whole cluster

Postby ccabrera » Mon Apr 23, 2018 10:32 pm

Hello:

Today a client of mine had a database lock cause by a non-optimized query, which was in turn triggered by an unexpected Asterisk CID appearing on the channels.

Here´s what happened:

Calls stopped flowing to the agents. Doing a mysql -e 'show processlist' revealed a query was blocking the vicidial_list table, thus no records could be updated. The query was this:

Code: Select all
UPDATE vicidial_list set status='ADC' where lead_id = '5379071502'


The lead_id in this database is not that high, so it was pretty clear that the lead_id was incorrect. After doing an egrep over /var/log/astguiclient/action_full-*, the place where that lead_id showed up was this line:

Code: Select all
Callerid: VCagcW1524537798150298150298150298150298


I know that Vicidial uses the Asterisk CID variable to pass information between calls, so what I see that happened here was that the wrong CID caused the wrong lead_id to be looked up. This wouldn´t be a problem since that would only cause the UPDATE query to fail and move on, but the problem is that because of the size of the database, that simple UPDATE was locking the table for something like 5 minutes per query.

Doing an EXPLAIN UPDATE on MySQL, showed me this:

Code: Select all
MariaDB [asterisk]> EXPLAIN UPDATE vicidial_list SET STATUS='ADC' WHERE lead_id = '5379551502';
+------+-------------+---------------+------+---------------+------+---------+------+----------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows     | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+----------+-------------+
|    1 | SIMPLE      | vicidial_list | ALL  | NULL          | NULL | NULL    | NULL | 36806543 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+----------+-------------+


Which means is doing a full table search for processing that query. If I run the same query without the single quotes (thus searching for an integer instead of string), I get this:

Code: Select all
MariaDB [asterisk]> EXPLAIN UPDATE vicidial_list SET STATUS='ADC' WHERE lead_id = 5379551502;
+------+-------------+---------------+-------+---------------+---------+---------+------+------+-------------+
| id   | select_type | table         | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+------+-------------+---------------+-------+---------------+---------+---------+------+------+-------------+
|    1 | SIMPLE      | vicidial_list | range | PRIMARY       | PRIMARY | 4       | NULL |    1 | Using where |
+------+-------------+---------------+-------+---------------+---------+---------+------+------+-------------+


Which means it would only look up 1 row and thats it. The query would finish very fast and nothing would be harmed.


I found that the script using this query is the /usr/share/astguiclient/FastAGI_log.pl in line 997, which reads like this:

Code: Select all
$stmtA = "UPDATE vicidial_list set status='$VDL_status' where lead_id = '$CIDlead_id';";


So in order that this doesn´t happen again, I need to change it to this:

Code: Select all
$stmtA = "UPDATE vicidial_list set status='$VDL_status' where lead_id = $CIDlead_id;";


And this problem would be solved. This line still exists in current trunk as of 2018-04-23.



Now, before anyone yells at me, I´m aware of the following:

1. Yes, the database is HUGE (36M leads). We clean 3 month old leads since we require to store the data for that long.
2. Yes, it was Asterisk using strange CIDs that caused this problem in the first place. Maybe it was too much load on one box.

So even though I know this wouldn´t be an issue with a "small" database or with a light load Asterisk, I´m also sure this wouldn´t happen if the query was created properly using an integer instead of string in the first place. Keep in mind that this is a cluster, so at most I would expect an Asterisk box to crash but let the others continue. This problem causes that a single Asterisk box can bring down the whole cluster.

I will gladly provide a patch for this, but I would like to read some thoughts first.

Any comments?

Regards,
Christian Cabrera
Enlaza Comunicaciones - Vicidial Partner
Mexico City
ccabrera
 
Posts: 153
Joined: Fri Jan 14, 2011 7:53 pm
Location: Mexico City

Re: Unexpected Asterisk CID can crash whole cluster

Postby mflorell » Mon Apr 23, 2018 10:48 pm

We've seen all kinds of queries crash databases that large. Yes, this one was certainly not optimal, but if not that one, it would have been another query that did it. 36 Million leads is WAY too many in the vicidial_list table, and it can easily cause all kinds of slow-downs, database crashes and even data corruption.
mflorell
Site Admin
 
Posts: 18386
Joined: Wed Jun 07, 2006 2:45 pm
Location: Florida

Re: Unexpected Asterisk CID can crash whole cluster

Postby williamconley » Mon Apr 23, 2018 10:59 pm

We have an app that moves leads to a vicidial_list_archive table to reduce the "live" leads. Eventually, of course, even that archive table will overload, but this app has been known to bring dying servers back to life for the clients that use it. Of course, you can just move those leads yourself. Beware: Moving large quantities of leads in a table that large can take hours. We have an upgraded version of our app that "marks" leads (by list or campaign) to be archived and then moves them overnight to avoid downing the database during work hours.
Vicidial Installation and Repair, plus Hosting and Colocation
Newest Product: Vicidial Agent Only Beep - Beta
http://www.PoundTeam.com # 352-269-0000 # +44(203) 769-2294
williamconley
 
Posts: 20258
Joined: Wed Oct 31, 2007 4:17 pm
Location: Davenport, FL (By Disney!)


Return to Support

Who is online

Users browsing this forum: No registered users and 129 guests

cron