An entry about how the way SQL queries are built affects their performance, something I think many people don’t pay attention to:
For the last couple of days I’ve been struggling with the database of an osCommerce-based on-line shop. Everything ran perfectly until after four years of existence we have a DB with over 1500 categories and 16000 products, and the web is getting increasingly slower over time. We use a heavily modified osCommerce 2.2 installation, so moving to osCommerce 3 is out of the question (besides, it’s still alpha and this is a production site). After some hw and os-level optimization that didn’t help to improve performance, I had to start digging into the code and the DB.
First of all: activate the log-slow-queries option on mySQL’s config. This creates a new log with all the queries that take longer than expected to run (def. more than 10 seconds). With this information, we can concentrate on those queries that are hitting the bigger performance penalties and start the optimization process there.
To get you on context, this is the query that extracts all the products from a given category:
p.products_image, p.products_quantity, p.products_date_added,
p.products_id, p.manufacturers_id, p.products_price,
p.products_tax_class_id, IF(s.STATUS, s.specials_new_products_price, NULL)
AS specials_new_products_price, IF(s.STATUS, s.specials_new_products_price,
p.products_price) AS final_price
FROM products_description pd, products p
LEFT JOIN manufacturers m ON p.manufacturers_id = m.manufacturers_id
LEFT JOIN specials s ON p.products_id = s.products_id,
products_to_categories p2c
WHERE
p.products_status = ’1′ AND p.products_id = p2c.products_id AND
pd.products_id = p2c.products_id AND pd.language_id = ’3′ AND
p2c.categories_id = ’114′
ORDER BY p.products_date_added DESC;
On the first place for those unfamiliar with osCommerce, there are a heap of tables involved. Six on this example: categories, products, products_to_categories, products_description, manufacturers and specials (products with discounted prices). According to mysql-slow.log, that query was taking around 12 seconds to run. The log also read “Rows_examined: 1903433″. No wonder it was taking that long. :-D
Why so many “rows_examined”? Going back to DB theory on the University, ;-D if memory serves and leaving each particular DB engine’s optmizations aside, when selecting from several tables the DB engine does the cartesian product (product set) of all their tuples, this is, every possible combination of each tuple on one table with every other tuple on the other tables. After that, those combinations matching the WHERE criteria are selected (and I say engine optimizations aside because I don’t get the math here: 16k products, 16k descriptions, 1k5 categories, and just 1903433 combinations?). On the other hand with JOIN the related data is appended to those tuples already selected by the WHERE clause and ONLY TO THOSE.
So, why using more than one table on the where clause if it isn’t strictly necessary? On the previous example, we want to extract all the info of the products on a given category, but we select from the products, products_description and products_to_categories tables. Wouldn’t it be better using only the products_to_categories table on the from clause, reducing the result set to only the category we are interested on, and then get all the products’ information with JOIN? It’s the same principle as when piping data through several commands on a shell-script, using first grep and then cut: first reduce the data set, so that on each following step there is less data to spend CPU cycles with, increasingly reducing the overall processing time.
With this modification, the resulting query looks like:
p.products_image, p.products_quantity, p.products_date_added,
p.products_id, p.manufacturers_id, p.products_price,
p.products_tax_class_id, IF(s.STATUS, s.specials_new_products_price, NULL)
AS specials_new_products_price, IF(s.STATUS, s.specials_new_products_price,
p.products_price) AS final_price
FROM products_to_categories p2c
LEFT JOIN products p ON p.products_id=p2c.products_id
LEFT JOIN products_description pd ON p.products_id=pd.products_id AND pd.language_id=’3′
LEFT JOIN manufacturers m ON p.manufacturers_id = m.manufacturers_id
LEFT JOIN specials s ON p.products_id = s.products_id
WHERE p2c.categories_id = ’114′ AND p.products_status = ’1′
ORDER BY p.products_date_added DESC
The result sets of both queries are identical, but while the first one took around 12 seconds to run, the second one only takes a couple of milliseconds. :-) I’ve found several similar cases already on my osCommerce installation, that have gone from more than 10 seconds to milliseconds after rearanging the query in a similar fashion.
Another important point are indexes: I’ve found some queries that couldn’t be further optimized, they already used only one table on the from clause, the more restricting one, and then some joins. On these cases what I’ve found is that some table lacked an index on the foreign key column, and after adding it the execution time has gone down again from seconds to milliseconds.