Today a client of mine had a database lock cause by a non-optimized query, which was in turn triggered by an unexpected Asterisk CID appearing on the channels.
Here´s what happened:
Calls stopped flowing to the agents. Doing a mysql -e 'show processlist' revealed a query was blocking the vicidial_list table, thus no records could be updated. The query was this:
- Code: Select all
UPDATE vicidial_list set status='ADC' where lead_id = '5379071502'
The lead_id in this database is not that high, so it was pretty clear that the lead_id was incorrect. After doing an egrep over /var/log/astguiclient/action_full-*, the place where that lead_id showed up was this line:
- Code: Select all
Callerid: VCagcW1524537798150298150298150298150298
I know that Vicidial uses the Asterisk CID variable to pass information between calls, so what I see that happened here was that the wrong CID caused the wrong lead_id to be looked up. This wouldn´t be a problem since that would only cause the UPDATE query to fail and move on, but the problem is that because of the size of the database, that simple UPDATE was locking the table for something like 5 minutes per query.
Doing an EXPLAIN UPDATE on MySQL, showed me this:
- Code: Select all
MariaDB [asterisk]> EXPLAIN UPDATE vicidial_list SET STATUS='ADC' WHERE lead_id = '5379551502';
+------+-------------+---------------+------+---------------+------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+---------------+------+---------------+------+---------+------+----------+-------------+
| 1 | SIMPLE | vicidial_list | ALL | NULL | NULL | NULL | NULL | 36806543 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+----------+-------------+
Which means is doing a full table search for processing that query. If I run the same query without the single quotes (thus searching for an integer instead of string), I get this:
- Code: Select all
MariaDB [asterisk]> EXPLAIN UPDATE vicidial_list SET STATUS='ADC' WHERE lead_id = 5379551502;
+------+-------------+---------------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+---------------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | vicidial_list | range | PRIMARY | PRIMARY | 4 | NULL | 1 | Using where |
+------+-------------+---------------+-------+---------------+---------+---------+------+------+-------------+
Which means it would only look up 1 row and thats it. The query would finish very fast and nothing would be harmed.
I found that the script using this query is the /usr/share/astguiclient/FastAGI_log.pl in line 997, which reads like this:
- Code: Select all
$stmtA = "UPDATE vicidial_list set status='$VDL_status' where lead_id = '$CIDlead_id';";
So in order that this doesn´t happen again, I need to change it to this:
- Code: Select all
$stmtA = "UPDATE vicidial_list set status='$VDL_status' where lead_id = $CIDlead_id;";
And this problem would be solved. This line still exists in current trunk as of 2018-04-23.
Now, before anyone yells at me, I´m aware of the following:
1. Yes, the database is HUGE (36M leads). We clean 3 month old leads since we require to store the data for that long.
2. Yes, it was Asterisk using strange CIDs that caused this problem in the first place. Maybe it was too much load on one box.
So even though I know this wouldn´t be an issue with a "small" database or with a light load Asterisk, I´m also sure this wouldn´t happen if the query was created properly using an integer instead of string in the first place. Keep in mind that this is a cluster, so at most I would expect an Asterisk box to crash but let the others continue. This problem causes that a single Asterisk box can bring down the whole cluster.
I will gladly provide a patch for this, but I would like to read some thoughts first.
Any comments?
Regards,