Clustered Index

  • A clustered index is a type of index in a database that determines the physical order of the data in a table. It defines the way data is stored on disk, based on the values of one or more columns.
  • When a clustered index is created, the database system arranges the data in the table according to the order specified by the index. This means that the rows in the table are physically sorted on disk based on the values of the columns included in the clustered index. This can improve the performance of queries that involve range searches or sorting, as the data is already ordered in a meaningful way.
  • In most database management systems, the primary key is automatically created as a clustered index by default. The primary key is a unique identifier for each row in a table, and having it as a clustered index ensures that the data is physically sorted in the order of the primary key values. This can be particularly beneficial for tables that are frequently queried using the primary key, as it allows for fast and efficient access to the desired rows.
  • It is worth noting that a table can have only one clustered index, as the physical ordering of the data can only be determined by one set of columns. However, it is possible to have multiple non-clustered indexes on a table, which provide alternative ways to access the data based on different columns or combinations of columns.
  • In summary, a clustered index in a database determines the physical order of the data in a table, based on the values of one or more columns. It is automatically created with the primary key in most database management systems, and it can improve the performance of queries involving range searches or sorting.
  • Clustered indexes are stored as trees. With clustered index, the actual data is stored in the leaf nodes. This can speed up getting the data when a lookup is performed on the index. As a consequence, a lower number of IO operations are required.
  • A clustered index is a special type of index that reorders the way records in the table are physically stored. Therefore table can have only one clustered index. The leaf nodes of a clustered index contain the data pages.
  • Clustered index determines the storage order of rows in the table, and hence doesn’t require additional disk space, but where as Non clustered index is stored separately from the table, additional storage space is required.
  • [Stack overflow] With a clustered index the rows are stored physically on the disk in the same order as the index. Therefore, there can be only one clustered index.
  • Data is sorted in sorted order by cluster key.
  • Clustered index points to the base data using key-values (not rids)
  • It is generally faster to read from a clustered index if you want to get back all the columns. You do not have to go first to the index and then to the table.
  • This means that whenever you insert or update a record, the clustered index ensures that the order is maintained. This process can be a hit on the performance of your application.
    → Writing to a table with a clustered index can be slower, if there is a need to rearrange the data.
  • For the tables without a clustered index, data is stored in an unordered heap. Heaps do not have an ordered structure, they do not have a natural order. This can get problematic as the table grows in size and data gets fragmented over time.
  • each new index will increase the time it takes to write new records.

If you want fast performant read-only applications that require all of the columns of a table then implement a clustered index on the table

  • Additionally the index can be reorganized and rebuilt without moving the data out into a new table.
  • SQL Server automatically creates a Primary key constraint on a clustered index. Having said that, the primary key constraint can be removed. Primary key should uniquely identify a row.

Non unique clustered index?

  • If a index is non-unique, a uniquifier value is added internally to make it unique, and it carries through into the reference values. RIDs are always unique.

Non-Clustered Index

  • A non-clustered index is a special type of index in which the logical order of the index does not match the physical stored order of the rows on disk. The leaf node of a non-clustered index does not consist of the data pages. Instead, the leaf nodes contain index rows.
  • Also known as B-Tree index.
  • The data is ordered in a logical manner in a non-clustered index. The rows can be stored physically in a different order than the columns in a non-clustered index. Therefore, the index is created and the data in the index is ordered logically by the columns of the index.
  • Non clustered indexes are created on top of the actual data.
  • Unlike clustered index, the leaf pages of the index do not contain the actual data. The leaf pages of the non clustered index contain the pointers.
  • These pointers point to the clustered index leaf nodes. They contain the address of the required row.

Pointers are like the page numbers in the Index page of a book.

  • Remember the leaf nodes of a clustered index store the actual data.
  • If a clustered index does not exist in a table then the leaf nodes of a non clustered index points to the heap page where the data is stored. Heap is essentially an unordered (randomly ordered) rows of data.
  • Therefore when we query for data, first the non-clustered index is searched to get the address of the data and then the lookup is performed on the clustered index to get the data. Hence this makes the non-clustered index usually slower than the clustered index.
  • There can be multiple non-clustered indexes in a table.

Although the rows in the table are physically ordered as specified by the order of the clustered index but the non-clustered index contains unique values of columns in the order that is specified for the index, and they contain pointers to the actual data. Think of non clustered index as a dictionary of a table.

  • Can have more than one non-clustered index.
  • Non-clustered index is not dependent on base data, instead it references to the physical location of base data using rids (row identifiers)
  • Since, non-clustered index is separate from base data. Base data could exist as non-clustered index
  • [Stack overflow] With a non clustered index there is a second list that has pointers to the physical rows. You can have many non clustered indices, although each new index will increase the time it takes to write new records.
  • Once non-clustered indexes are built, you can start querying them. Indexes employ an ideal search technique known as binary search. Binary searches function by constantly dividing the data in half and determining whether the item you’re looking for comes before or after the entry in the center of the current section of data. This works well with B-trees since they are built to begin at the middle entry; while searching for entries within the tree, you know that entries down the left path will be smaller or before the current entry, while entries to the right will be larger or after the current entry. In a table, it would look like:
Source — https://www.atlassian.com/data/sql/how-indexing-works

When Should We Create A Clustered Index?

Note, always execute your query and analyze the execution plan to understand where the performance bottle-necks are but it is wise to create clustered index when:

  1. Your query contains filters on specific columns such as WHERE clauses or on JOINS and you are required to query those columns over and over again. Additionally, the requirement is that these columns are not updated as frequently.
  2. When you are always required to return data in an ordered manner then you can create a clustered index with the required columns of the ORDER BY clause. As a consequence, all of the rows of the table are not required to be scanned.
  3. If your application performs a large number of reads on a table then the clustered index will make the operation extremely fast.
  4. Your queries SELECT all or most of the columns of a table then you can think of creating a clustered index.

When Should We Create A Non-Clustered Index?

Note, always execute your query and analyze the execution plan to understand where the performance bottle-necks are but it is wise to create a non-clustered index when:

  • Multiple queries are required to filter rows on a table and there are different set of columns which are used in the WHERE clause and JOINS. Additionally, these columns are updated frequently.
  • Again, if you are constantly returning data in specific order then create a non-clustered index to speed up the performance. It will also reduce the memory footprint as you will not be required to perform an additional sorting.
  • If certain columns are used more frequently in the queries then you can create a non clustered index on the tables. This is known as the Cover Non Clustered Index.
  • If you want to create an index on only those rows that meet a specific criteria then you can add a WHERE clause in the non clustered index. This is known as a Filter Non Clustered Index.
  • We can also INCLUDE the columns that we need in the SELECT statement to be added as leaf nodes in the non clustered index. It speeds up the retrieval time.

Issues With Creating Indexes

Sometimes the existences of indexes can badly impact the performance.

  1. Indexes take space on disk and can impact the memory footprint of the SQL process. The clustered index does not take as much space as the non-clustered index does because the non clustered index are stored in a separate space on the disk.
  2. Clustered index are useful if you are performing a large number of reads but for every insert, the data needs to be shuffled and re-ordered. Hence they are not appropriate for tables that require fast inserts. Matter of fact, if you want fast inserts then remove all of the indexes.
  3. Clustered indexes are not useful if you need to perform ORDER BY on different set of columns than the columns of the clustered index.
  4. Non-clustered indexes need to be carefully designed because if you include only sub-set of the columns then the index will not be as useful than it would be if you were to include all of the required columns. Having said that, more the columns, larger the index and more the space it consumes on disk. If you create two non clustered indexes that contain a column then you will be duplicating the unique values of that column. This in turn will consume more space on the disk.
  5. Too many indexes can destroy the performance. Let’s assume you create two non-clustered indexes where the first non-clustered index is on columns A and B of a table and the second non clustered index is on columns B, C and D of a table. If you query for columns A, C and D then SQL will be using the two indexes to get the required pointers and then it will lookup for the data in the table. This will badly impact the performance of the queries.
  6. If you need to bulk import data into a staging table then it will very likely be better if you do not create indexes at all as indexes can impact the performance.

Let’s Understand With An Example

Imagine Trades is a table in your database. The full name of the table is FinTechExplained.Trades.

The table contains a large number of columns but for the sake of simplicity, assume it contains following key columns:

  1. Let’s also consider that the combination of TradeId and TradeType makes each row unique and most of the queries require us to get the trades based on the trade Id and Trade Type. Additionally the natural order of the trades is based on the TradeId and TradeType.
  • Therefore we can create a Clustered Index on the TradeId and TradeType columns:
-- using create index statement
CREATE CLUSTERED INDEX IX_Trade_TradeId ON FinTechExplained.Trade(TradeId ASC, TradeType ASC);
  • The clustered index IX_Trade_TradeId will store the trades in the leaf node of the index, ordered by the TradeId and TradeType columns. It implies that the rows of the Trade table will be physically ordered by TradeId and TradeType.

2. A large number of queries require us to return the Created at column of the trades. Additionally the queries occasionally pass in the required Created at and Created by values in the filter clause. Therefore we can create a non-clustered index:

CREATE INDEX IX_Trade_CreatedAtCreatedBy ON FinTechExplained.Trade (CreatedAt ASC,CreatedBy ASC) INCLUDE (CreatedAt)
  • Because we have INCLUDE in our non clustered index, the leaf node will contain the values of the CreatedAt column.
  • If a large number of queries require us to return the trades that were created by Farhad then we can create a filter non clustered index:
CREATE INDEX IX_Trade_FarhadCreatedBy ON FinTechExplained.Trade (CreatedAt ASC,CreatedBy ASC) where createdBy='Farhad'
  • The WHERE clause in the index will create a non clustered index for rows where CreatedBY meets the required criteria.

Resources

--

--