Before discussing tuning tips on histograms lets cover some background
A histogram is extra basically extra data about the data IN the column.
So a Histogram contains a summary of what is IN the column.
The reason, why I am stressing the word IN, is because the database has to scan the table and read the values in the column to build the histogram.
The Oracle database has 2 types of histograms:
FREQUENCY histograms and HEIGHT-BALANCED histograms
The basic difference is a frequency histogram is more precise, than Height Balanced histogram because it stores the exact count for each value. Height Balanced histograms store points along the range of the values in the column.
This has to be one of the MOST important questions anyone should ask. Why use a histograms? The only reason one should use a histograms is to change how data is queried out of a table. (This included updates) Or stated a different way, histograms are used to change the SQL plan when processing a SQL statement. The real purpose of adding a histogram to table is to give the optimizer more information as to how to come up with a better plan. If the goal is not to change the SQL PLAN, then do not create histograms. Please note this is important because gathering stats with AUTO can have the negative side effect of creating histograms where they should not be. More about this later on this page.
The database view DBA_tab_col_statistics or USER_tab_col_statistics has a column called HISTOGRAM, the value can be HEIGHTBALANCED, FREQUENCY, or NONE.
If NONE, then no histogram present for the column.
The database view DBA_tab_col_statistics or USER_tab_col_statistics has a column called HISTOGRAM, the value can be HEIGHTBALANCED, FREQUENCY, or NONE.
If NONE, then no histogram present for the column.
Oracle will ONLY create a Frequency histograms IF and ONLY IF (Number of Distinct Values in the column) <= (Number of buckets)
If the number of buckets is < the number of distinct values in the column then a Height Balanced histogram is created.
The reason I'm noting this is I have seen several cases where a column can have only 5 values and the number of buckets is 3 and the database creates a height-based histogram. Using repeat option on gathering stats usually causes this problem.
If the column contained the values
A, B, B, B, C, C, C, C, C, C, C, C, C, D, E, E, F, F, F, G
And the table was analyzed correctly to build a Frequency Histogram, the Frequency Histogram would contain:
A - 1 B - 3 C - 9 D - 1 E - 2 F - 3 G - 1
Notice: The EXACT number of rows that will be returned is stored for each value of the column. This gives the optimizer the extra information to know how many rows would be returned and helps the optimizer determine if it should use the index on the column.
Note: If you are using bind values having a histogram on columns can cause bind peeking issues.
Let me explain, lets assume the SQL IS
SELECT DATA FROM TABLE WHERE COLUMN = :BIND
Lets assume the histogram has the value counts of
A - 100
B - 3000
C - 90000
D - 100
E - 2000
F - 3000
G - 1000
If the BIND is A the histogram tells the database that only 100 rows would be returned and therefore use the index.
Now, if the next statement passes in a BIND of C, the index is used again and performance will go down because the database has all ready decided to use the index when it optimized the plan when it peeked at the bind value of A above. In this case using an index will return 90000 rows and this is not ideal.
The Solution for this is to not to use bind values for this ONE statement or when this one column is used in SQL. This will create more SQL statements in the SGA, one for each of the possible bind values. However the SQL will be better optimized better for each of the binds and your performance will improve. This of course assumes the SGA is sized correctly, if you are not sure then I recommend that you hire us or use our Tuning Suite.
If the column contained the values
Row Number | A | B | B | C | C | C | C | C | C | C | C | C | C | D | E | E | F | F | F | G |
Row Number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
Values Saved | Yes | Yes | Yes | Yes |
The height-balanced histogram contains the values
0 - C
1 - C
2 - E
3 - G
As stated before, in a height-balanced histogram, the column values are divided into bands (Buckets) so that each band contains approximately the same number of rows.
If the column contained the values
A, B, B, B, C, C, C, C, C, C, C, C, C, D, E, E, F, F, F, G
And the number of buckets was 5 the database, the data base cuts this up into 5 bands/buckets/ranges( 20 rows / 5 buckets = Store value of every 4th row) and stores the values at value of the column so if the data was
Row Number | A | B | B | C | C | C | C | C | C | C | C | C | C | D | E | E | F | F | F | G |
Row Number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
Values Saved | Yes | Yes | Yes | Yes | Yes |
The height-balanced histogram contains the values
0 - B
1 - C
2 - C
3 - E
4 - G
In both of the above examples, notice that the value of C is stored twice, this is how the database knows that a value of a column is very popular in the column. And using an index, for example to retrieve rows from the table may not be very efficient since the number of rows is likely to be high.
So In a height-balanced histogram, the column values are divided into bands (Buckets) so that each band contains approximately the same number of rows.
If the table has 100,000 rows and the number of buckets is 200 the database, the data base cuts this up into 200 bands/buckets/ranges( 100,000 rows / 200 buckets = Store value of every 500th row)
One important point is when the column has a lot of distinct values and the popular values can easily be obtained, then histograms work in helping change the plan. This is important because in each of the above cases we were able to identify the value that was most popular. But how does the database handle the case when there is not a clear popular value. When this happens the database uses the value of the density to determine sql plan options. The column density can be thought of how many database blocks returned for a given value. Please notice, rows was NOT specifically specified, the statement was blocks, specifically database block. This is a very important point, because the database IO is done in Blocks not rows. This may make more sense with a question.
Question, if a SQL statement is going to return 10,000 rows in 20,000,000,000 row table which will be faster? Have the data to be returned in 10,000 blocks or 10 blocks?
The answer of couse is 10 blocks.
How the data is packed into the database blocks makes a difference! Simply put, less database blocks, faster response time.
As a side note, can you see or have a sense how data conversion or the conversion of data from the old system to the new system change how data is packed into the blocks? Think about it, converted data is almost ALWAYS loaded into the system differently then live transactional data, meaning stats can be wrong leading to wrong plans, and how stats are gather is very important. More on this point in another tip.
Here are some general tips on histograms or what is called "A Rules of thumb"
No rule of thumb is ever perfect!
1) Goal of histogram is to change the SQL plan the optimizer comes up with when it determines the plan to return the results of the sql statement.
2) Frequency histograms are more precise, than Height Balanced histograms.
3) The column Density of a column is used by optimizer when it considers table/index accessed. The column Density is also used when data in histogram table does not make sense to the optimizer.
2) Changing the number of buckets in histograms changes the density.
2.A) More Buckets DOES NOT mean better density.
3) Using degree != 1 can affect density. (Gather stats in parallel) histogram columns will get a different value for Density. And Density can cause plan changes that are not desired.
4) How stats are gathered can cause plan changes that are not desired.
5) Not using binds for a column helps the optimizer pick better plans when bind peeking is an issue. NOTE: The db has a setting to force all sql to use binds. this will cause bind peeking issues.
Do not create Histograms when
1) Data in column is evenly distributed
2) Column is not used in a where clause.
3) Do not create them on every column of every table. This requires more time when the table is analyzed. Increases parse time. And can result in poor plans being generated by the optimizer.
4) No Histograms on Primary Key (PK) of a table. You may find them on PK columns because someone used 'Auto' on the gather stats command.
5) Use Histograms like Hot Peppers, A little makes a dish great, too many well you get the point I hope ;-)
Point here is here is a list when you should consider not creating a histogram. Adding Histograms may have negative effect on the plan the optimizer comes up with.
1) When the number of distinct values in a column is > 254. This forces a Height Balanced histogram and SQL plans can be wrong.
2) Date columns
3) Time stamp columns
Please send us comments here [email protected]
This document is provided for information purposes only and the contents hereof are subject to change without notice. This document is not warranted to be error-free, nor subject to any other warranties or conditions, whether expressed orally or implied in law, including implied warranties and conditions of merchantability or fitness for a particular purpose. We specifically disclaim any liability with respect to this document and no contractual obligations are formed either directly or indirectly by this document.