Column-oriented | Columnar databases
Table of Content
· Columnar Database
· [Important] Wide-column stores versus columnar databases
· When to use Columnar Database?
· Colmnar Database Internal Architecture? Why Columnar Database are faster in performance for Analytics?
· Few examples
· [Example] Use case: Analytics for Food Delivery Application
· Limitations of Columnar Database
· Fun fact:
· What’s the Difference Between Columnar databases vs. Wide-column databases?
· Row vs Column-oriented DB
· ToDo
· References
Follow this link for all System design articles
Columnar Database
In a Column-oriented or a columnar database, data are stored on disk in a column-wise manner
e.g: Table Bonuses
table
ID Last First Bonus
1 Doe John 8000
2 Smith Jane 4000
3 Beck Sam 1000
- In a row-oriented database management system, the data would be stored like this:
1,Doe,John,8000;2,Smith,Jane,4000;3,Beck,Sam,1000;
- In a column-oriented database management system, the data would be stored like this:
1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000;
- Cassandra is basically a column-family store
- Cassandra would store the above data as:
Bonuses: { row1: { "ID":1, "Last":"Doe", "First":"John", "Bonus":8000}, row2: { "ID":2, "Last":"Smith", "Jane":"John", "Bonus":4000} ... }
- In a typical relational database table, each row contains field values for a single record. In row-wise database storage, data blocks store values sequentially for each consecutive column making up the entire row. If the block size is smaller than the size of a record, storage for an entire record may take more than one block. If the block size is larger than the size of a record, storage for an entire record may take less than one block, resulting in inefficient use of disk space. In online transaction processing (OLTP) applications, most transactions involve frequently reading and writing all of the values for entire records, typically one record or a small number of records at a time. As a result, row-wise storage is optimal for OLTP databases.
- In this simplified example, using columnar storage, each data block holds column field values for as many as three times as many records as row-based storage. This means that reading the same number of column field values for the same number of records requires a third of the I/O operations compared to row-wise storage. In practice, using tables with very large numbers of columns and very large row counts, storage efficiency is even greater.
- An added advantage is that, since each block holds the same type of data, block data can use a compression scheme selected specifically for the column data type, further reducing disk space and I/O. For more information about compression encodings based on data types, see Compression encodings.
- The savings in space for storing data on disk also carries over to retrieving and then storing that data in memory. Since many database operations only need to access or operate on one or a small number of columns at a time, you can save memory space by only retrieving blocks for columns you actually need for a query. Where OLTP transactions typically involve most or all of the columns in a row for a small number of records, data warehouse queries commonly read only a few columns for a very large number of rows. This means that reading the same number of column field values for the same number of rows requires a fraction of the I/O operations and uses a fraction of the memory that would be required for processing row-wise blocks. In practice, using tables with very large numbers of columns and very large row counts, the efficiency gains are proportionally greater.
- For example, suppose a table contains 100 columns. A query that uses five columns will only need to read about five per cent of the data contained in the table. This savings is repeated for possibly billions or even trillions of records for large databases. In contrast, a row-wise database would read the blocks that contain the 95 unneeded columns as well.
- Typical database block sizes range from 2 KB to 32 KB. Amazon Redshift uses a block size of 1 MB, which is more efficient and further reduces the number of I/O requests needed to perform any database loading or other operations that are part of query execution.
[Important] Wide-column stores versus columnar databases
- Wide-column stores such as Bigtable and Apache Cassandra are not column stores in the original sense of the term, since their two-level structures do not use a columnar data layout.
- In genuine column stores, a columnar data layout is adopted such that each column is stored separately on a disk.
- Wide-column stores do often support the notion of column families that are stored separately. However, each such column family typically contains multiple columns that are used together, similar to traditional relational database tables.
- Within a given column family, all data is stored in a row-by-row fashion, such that the columns for a given row are stored together, rather than each column being stored separately.
Wide-column stores that support column families are also known as column family databases.
When to use Columnar Database?
Columnar databases are used in data warehouses where businesses send massive amounts of data from multiple sources for BI analysis. Columnar Databases reduced I/O for the queries having fewer columns as it has to lookup for particular files only based on query columns.
Colmnar Database Internal Architecture? Why Columnar Database are faster in performance for Analytics?
Column-oriented databases (aka columnar databases) are more suitable for analytical workloads because the data format (column format) lends itself to faster query processing — scans, aggregation, etc.
- Columnar databases store data of a single column together (contiguously) on disk or in-memory or both. Thus we can quickly loop over all the column values potentially loading some/all of them into the CPU in a single memory reference.
- On the other hand, row-oriented databases store a single row (and all its columns) contiguously. In order to go over some/all values of a particular column (and this is what most analytic operations do), we have to go over each tuple and skip the unnecessary columns simply because the values of a column are not stored contiguously.
- Columnar databases tend to easily leverage efficient parallel processing on hardware. For example, SIMD instructions are used in most columnar databases to get data-level parallelism and evaluate multiple column values in a single instruction. This improves the performance of typical analytical workload queries — scans, scalar aggregation, joins. Thus most columnar databases use vectorized instructions for query processing.
- Compression — Because each column is stored contiguously as a separate unit, we can use a compression technique that is most suitable for the particular column depending on the column cardinality, data type (number, string, varchar etc) and whether the column is sorted or not.
- Simple and powerful techniques like RLE (Run Length Encoding), Bit Vector Encoding, Null suppression etc can be efficiently used on a per column basis and give better compression ratios because the compression algorithm operates on related (same data type) values.
- Another technique used by columnar databases is that they directly operate on the compressed column values during predicate evaluation etc. They avoid decompression (and its CPU cost) until later when it's absolutely necessary — presentation of results (necessary column values) to the end-user.
- Because each column is stored separately, query processing doesn’t have to go over unnecessary columns that are not touched by the query at all for operator evaluation or output list etc. The required set of columns (usually few in analytical queries) can be directly accessed individually. In most row-oriented databases, we need to walk the row-column by column and then extract the set of required columns for projection, evaluation etc.
- Late Materialization — Many columnar databases use this technique for constructing only the necessary tuples that really qualify for the resultset. Thus values from multiple columns are not stitched together to form a tuple until later when the predicate evaluation has already happened on the column and we are ready to present the result set to the user.
Few examples
Example 1 — SELECT SUM(SALARY) FROM EMPLOYEE;
This is a very simple analytical query typically used in report generation etc. The query can be efficiently processed because:
- All values in the SALARY column are stored together. So looping over all the column values to do a simple summation is quick.
- The query can be made more efficient by using SIMD processing techniques to do a parallel ADD operation and some clever bit shifting to compute the sum.
- Unless there is a secondary index on the SALARY column, this operation is going to be very slow in traditional row-oriented databases.
Example 2 — SELECT COLUMN1 + COLUMN2 AS COLUMN3;
This simple operation can easily leverage SIMD processing to do a parallel ADD operation between corresponding values of COLUMN1 and COLUMN2.
Example 3 — SELECT SALE_ID, SHIPPING_ADDR FROM SALES WHERE STATE=’CA’
- From the SALEID and SHIPPING_ADDR column we only need the relevant column values where the predicate evaluation is successful on the corresponding value in STATE column.
- Firstly, the predicate evaluation will be super fast because multiple STATE column values can be compared (against CA) in a single instruction by loading them into a SIMD register (typically 128 bit wide).
- Secondly, we will access only the corresponding values from SALEID and SHIPPING_ADDR column (and decompress if they are compressed) where the SIMD evaluation was TRUE. This is where late materialization and the ability to directly operate on compressed data come in.
Example 4 — SELECT COUNT(*) FROM EMPLOYEE WHERE SALARY >= 80000 AND SALARY < = 150000
- Again, the predicate evaluation on the SALARY column can use SIMD processing to get data-level parallelism.
- Once the SIMD evaluation has given its result (typically a bit vector or a mask) over a set of column values, all we have to do is count the number of bits that are set to compute the result of COUNT(*).
Example 5 — SELECT FIRST_NAME, LAST_NAME FROM EMPLOYEE WHERE SALARY >= 80000 AND SALARY < = 150000
- Should be similar to Example 3.
In general, the analytical query workload comprises of queries that touch a subset of table columns but a large set of rows to construct the result. Such queries usually require predicate evaluation on all values of one or more columns. Thus they are more efficient with columnar databases because of the organization of data (column values are stored together). On the other hand, OLTP workloads run well against row-oriented databases since they comprise operations interested in full rows (all columns of a row).
Consider the query — SELECT * FROM EMPLOYEE.
This is not really an analytical query since it is interested in all the columns of a single row and will in fact be much slower with columnar databases because it will require several disks seeks to all the separately stored columns. The same query will be very fast with row stores since a row is stored contiguously and all the column values can be loaded into memory in a single physical I/O
[Example] Use case: Analytics for Food Delivery Application
For eg., different events occur in the interaction of users with the food ordering apps
1. They search for a dish (search event)
2. They select a restaurant and view the menu (view event)
3. They may add a dish to the cart (add_to_cart event)
4. They may place an order (order event)
5. They may abandon the cart (abondon_cart event)
Now if you want to have an answer to the queries like:
1. Top 10 restaurants that got the highest orders in the last one month after user searched “Cold Coffee” where the city was Hyderabad?
2. How many people searched for pure-veg restaurants and placed an order with a particular restaurant in the last year.
3. Most popular items in Jaipur in the last one week?
Column-oriented databases have faster query performance because the column design keeps data closer together, which reduces seek time.
Limitations of Columnar Database
- The columnar database is not suited for incremental data loading.
- Online Transaction Processing (OLTP) applications are not suitable in column-oriented databases, as isolation needs to be done across different columns in different files, that’s I/O sensitive. Columnar databases have a reputation for performing poorly on join operations, which is why they’re not recommended for OLTP workloads.
- User queries against only a few rows cannot give you any benefits in columnar databases.
- As the use of in-memory analytics increases, the relative benefits of row-oriented versus column-oriented databases may become less important. In-memory analytics is not concerned with efficiently reading and writing data to a hard disk. Instead, it allows data to be queried in random access memory.
- Updates can be inefficient
- Perform poorly with joins making them unsuitable for Online Transaction Processing (OLTP)
Fun fact:
Swiggy uses Snowflake for Swiggylytics, here is the architecture of Swiggylytics:
What’s the Difference Between Columnar Database vs. Wide-column Database?
A Columnar data store will store each column separately on disk. A Wide-column database is a type of columnar database that supports a column family stored together on a disk, not just a single column.
Row vs Column-oriented DB
ToDo
- https://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html
- answer this — https://stackoverflow.com/questions/43379117/is-hbase-a-columnar-db?noredirect=1&lq=1
- https://towardsdatascience.com/columnar-stores-when-how-why-2d6759914319