rolling-parts
|
By Dan Roth When loading data to a warehouse on of the biggest problems is removing old data to make room for new data. Partitioning is a method used by RDBMS’ to physically isolate data within a single table. It allows you to store related groups of data in a single table but still keep the data physically separated. The biggest advantage is that the individual partitions can be managed, accessed and updated apart from the other partitions. Within a Data Warehouse one of the most common methods of grouping data is by time. This article will attempt to explain how we can partition data by time and automate the replacing of the oldest data with the newest data. There are many benefits to partitioning.
Let’s say we have a request from Joe’s T-Shirts.Com to track Internet sales. They want the current month plus the six previous months of detailed sales information stored in a data warehouse (I realize that six months is not a realistic amount of time to track trends but it makes the examples easier. The same theory applies to six or six hundred months). At the end of six months the oldest data will be archived. We would start by creating a partitioned table with 7 partitions (one for the current month and one for each of the previous 6 months). If we started collecting data in January 2001 the first months data would have been placed in partition 1. February’s data will be in partition 2, March’s in partition 3. We would continue this until we hit July 2001. Joe’s T-shirt sales table as of 7/01/2001
Now we have a little problem. Where do we put August’s data? We could add another partition but we only need to keep 6 months worth of history. Therefore when August begins we no longer need January’s data. At the beginning of August we can therefore clear out partition one and begin populating that with August’s data. In September we will do the same thing with partition 2. If we load the data this way we will not have to create new partition’s and we will always have the rolling seven months of data. It does not matter to the RDBMS that September’s data is in a lower numbered partition than April. The table's indexes will ensure that the data is retrieved in the proper order. Joe’s T-shirt sales table as of 9/01/2001
This concept may be confusing if we think of it in a linear fashion (like the above example). It is often clearer if we think of the data in a circular fashion (see diagram 1). The newest data constantly replaces the oldest around the circle. We are simply removing the slice of January’s data and replacing it with a slice of August data. Each month we move around the circle replacing the oldest slice with a new one.
This entire process can be automated with a few simple steps. When a table is defined we can tell the RDBMS how to determine which data goes into which partition. For example address data may be partitioned on zip code, therefore each partition would contain a range of zip codes. For this example I would recommend adding a column named PART_NR to the table (later it will become obvious why I chose this method). When we do an insert the database will interrogate the value of PART_NR and automatically place the data into the proper partition. Here is sample DDL to create a table named TSHIRT_SALES and divide it into 7 partitions (PART1 thru PART7). The concept is the same in any RDBMS that supports partitioning, though the syntax may vary slightly. Create table TSHIRT_SALES ( MONTH number, PRODUCT number, REGION number, SALES number(5,2), PART_NR number Constraint TSHIRT_SALES_PK primary key (MONTH,PRODUCT,REGION) ) partition by range (PART_NR) (partition PART1 value less than (2) tablespace PART1_TS, (partition PART2 value less than (3) tablespace PART2_TS, (partition PART3 value less than (4) tablespace PART3_TS, (partition PART4 value less than (5) tablespace PART4_TS, (partition PART5 value less than (6) tablespace PART5_TS, (partition PART6 value less than (7) tablespace PART6_TS, (partition PART7 value less than (MAXVALUE) tablespace PART7_TS, In the above example we are telling Oracle that we are using the column PART_NR as our partitioning column. Then we are telling it to put any rows where the value of PART_NR is less than 2 into partition 1. We must repeat the partition assignment for each partition. Partition 7 uses MAXVALUE instead of '(8)'. This is just saying anything that did not meet the criteria for the first 6 partitions put into partition 7. We should never encounter a situation where PART_NR is greater than 7 but this is just a safe guard. So now we have a table created with 7 partitions and we told Oracle that we would provide a value in a column named PART_NR. This value will be used to determine the partition into which the data should be inserted. So how can we keep track of what data goes where? I would recommend creating a small table who's only purpose is to manage the partitions. The only columns it would require would be month, year, and the partition that will contain the data.
Whenever we create data to insert into our table we can access the table above to determine what value should be inserted into PART_NR. All programs that insert into this table, or create load files, must ensure that they are accessing this table to properly populate PART_NR. This table could be preloaded years in advance. Therefore as the months roll over no changes will need to be made to the table definition or the application programs. The only thing we must remember is to clear out the oldest month's data before we start inserting the new data. Good Luck!! |

