Tuesday, February 03, 2009
A word from my current work world...
Now that I have a job, my current Big Project is to make our product, which currently falls apart under a load of about 20 million "sessions" (or about 60M records), to run with 500M sessions or about 1.5B records. Searches are ad-hoc on any combination of AND, OR, and NOT, on about 20 different search parameters. Also, I was to do this without needing fancy hardware support. To make things more interesting, the schema had to work on both Oracle and MySQL.
To solve this problem, I used a schema derived from "snowflake" designs from data warehouses, and broke up the data into "dimension" (search) versus "fact" (data display) tables. The dimension tables are denormalized in that they store some data that is also in FACT tables.
Queries are done by doing initial qualification on the search tables. The results of these searches are saved to temp tables. The final set of qualifying IDs is computed using SQL INTERSECT, UNION, or MINUS queries on the IDs in the temp tables, depending on the search logic. After we have the final set of qualifying IDs, the set is sorted using a computed value that is bound to all "sessions", and the top N records are joined with the FACT tables to produce the final result.
The fundamental problem with the old schema is it relied heavily on ID joins for secondary qualification, after "picking" a primary search qualification. Large-scale ID joins require full B-tree descent for every record in the join, and if there are hundreds of thousands of records being qualified this way, the query will take minutes or more. My new schema avoided this problem by simply qualifying each criterion separately, doing a single pass through the B-tree index per criterion, and doing the qualification logic without actually needing to visit the - very wide - FACT table.
A thing that helped hugely was using INDEX ORGANIZED tables in Oracle, which means that all table data is physically stored in the index - as opposed to the standard storage method of having the index records pointing at "real" table storage in a heap. This meant my qualification searches didn't actually have to traverse index recs to get at a base table - they were pure index scans. I haven't done MySQL yet, but InnoDB storage is basically identical to INDEX ORGANIZED tables, so this trick should still work.
One advantage that I have with our application is it has a magic number I can use to short-circuit searches, but even without this magic number, my new approach is still far faster than a more "standard" approach using simple joins.
The final results are that my new approach is nearly four orders of magnitude faster than the existing schema, and is completely predictable. We're able to go after much bigger deals than before now that we have a more scalable schema.
To solve this problem, I used a schema derived from "snowflake" designs from data warehouses, and broke up the data into "dimension" (search) versus "fact" (data display) tables. The dimension tables are denormalized in that they store some data that is also in FACT tables.
Queries are done by doing initial qualification on the search tables. The results of these searches are saved to temp tables. The final set of qualifying IDs is computed using SQL INTERSECT, UNION, or MINUS queries on the IDs in the temp tables, depending on the search logic. After we have the final set of qualifying IDs, the set is sorted using a computed value that is bound to all "sessions", and the top N records are joined with the FACT tables to produce the final result.
The fundamental problem with the old schema is it relied heavily on ID joins for secondary qualification, after "picking" a primary search qualification. Large-scale ID joins require full B-tree descent for every record in the join, and if there are hundreds of thousands of records being qualified this way, the query will take minutes or more. My new schema avoided this problem by simply qualifying each criterion separately, doing a single pass through the B-tree index per criterion, and doing the qualification logic without actually needing to visit the - very wide - FACT table.
A thing that helped hugely was using INDEX ORGANIZED tables in Oracle, which means that all table data is physically stored in the index - as opposed to the standard storage method of having the index records pointing at "real" table storage in a heap. This meant my qualification searches didn't actually have to traverse index recs to get at a base table - they were pure index scans. I haven't done MySQL yet, but InnoDB storage is basically identical to INDEX ORGANIZED tables, so this trick should still work.
One advantage that I have with our application is it has a magic number I can use to short-circuit searches, but even without this magic number, my new approach is still far faster than a more "standard" approach using simple joins.
The final results are that my new approach is nearly four orders of magnitude faster than the existing schema, and is completely predictable. We're able to go after much bigger deals than before now that we have a more scalable schema.
Comments:
<< Home
HI
Great information in this post and this meant my qualification searches didn't actually have to traverse index recs to get at a base table - they were pure index scans.
Outsourcing
Great information in this post and this meant my qualification searches didn't actually have to traverse index recs to get at a base table - they were pure index scans.
Outsourcing
Hi
Good post and I think the final results are that my new approach is nearly four orders of magnitude faster than the existing schema, and is completely predictable.
Alan Smith
Maths private tutor
Post a Comment
Good post and I think the final results are that my new approach is nearly four orders of magnitude faster than the existing schema, and is completely predictable.
Alan Smith
Maths private tutor
<< Home