{"componentChunkName":"component---src-templates-post-template-js","path":"/posts/column-based","result":{"data":{"markdownRemark":{"id":"38286cf3-da67-54a5-a985-02704faed06b","html":"<h4 id=\"preface\" style=\"position:relative;\"><a href=\"#preface\" aria-label=\"preface permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Preface</h4>\n<p>This is the third article of a series to summarize the key concepts of Chapter 3. Storage and Retrieval in the <a href=\"https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Designing Data Intensive Application</a> book. The series consists of 3 articles, including log-structure storage engine (SSTables and LSM Tree), page-oriented storage engine (B Tree) and column based storage engine.</p>\n<h4 id=\"landscape-of-database-storages\" style=\"position:relative;\"><a href=\"#landscape-of-database-storages\" aria-label=\"landscape of database storages permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Landscape of database storages</h4>\n<figure>\n\t<span class=\"gatsby-resp-image-wrapper\" style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 960px; \">\n      <a class=\"gatsby-resp-image-link\" href=\"/static/b0453fbb95c32586f445dca4f9795c72/3e992/storage-engine-tree.png\" style=\"display: block\" target=\"_blank\" rel=\"noopener\">\n    <span class=\"gatsby-resp-image-background-image\" style=\"padding-bottom: 46.25%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAJCAYAAAAywQxIAAAACXBIWXMAAAsSAAALEgHS3X78AAAA+ElEQVQoz32SiYrFMAhF8/+/WCil0H3f9zpzhEDa92aEiyHR69Vo5Nfu+xbrXfxl0zRJFEXi+760bfvINW8yt8B5nrIsi+z7rmf8dV0yDIOUZamk4zg+ipm3Gns+jkOqqlIV+KIolKSua/VJkojnedL3vcbO8yzruopxSVBDMgkEUn3bNgXBvHPmnrg0TaXrOsmyTIIgUG/c9khyQeVvRixEkKIMUEQVooZ2UMSFNcjsnJqm0bkBkmkbNWEY6t1jhrYNC3eW/Gae55pEHKAoxWmX+XJ+/LJL4BID1JAA0XuteLcb8LE2FhDQHrvFjOI41mGj8j9zt+QHDZm8w9QVNO4AAAAASUVORK5CYII=&apos;); background-size: cover; display: block;\"></span>\n  <picture>\n        <source srcset=\"/static/b0453fbb95c32586f445dca4f9795c72/8ac56/storage-engine-tree.webp 240w,\n/static/b0453fbb95c32586f445dca4f9795c72/d3be9/storage-engine-tree.webp 480w,\n/static/b0453fbb95c32586f445dca4f9795c72/e46b2/storage-engine-tree.webp 960w,\n/static/b0453fbb95c32586f445dca4f9795c72/f992d/storage-engine-tree.webp 1440w,\n/static/b0453fbb95c32586f445dca4f9795c72/97599/storage-engine-tree.webp 1902w\" sizes=\"(max-width: 960px) 100vw, 960px\" type=\"image/webp\">\n        <source srcset=\"/static/b0453fbb95c32586f445dca4f9795c72/8ff5a/storage-engine-tree.png 240w,\n/static/b0453fbb95c32586f445dca4f9795c72/e85cb/storage-engine-tree.png 480w,\n/static/b0453fbb95c32586f445dca4f9795c72/d9199/storage-engine-tree.png 960w,\n/static/b0453fbb95c32586f445dca4f9795c72/07a9c/storage-engine-tree.png 1440w,\n/static/b0453fbb95c32586f445dca4f9795c72/3e992/storage-engine-tree.png 1902w\" sizes=\"(max-width: 960px) 100vw, 960px\" type=\"image/png\">\n        <img class=\"gatsby-resp-image-image\" src=\"/static/b0453fbb95c32586f445dca4f9795c72/d9199/storage-engine-tree.png\" alt=\"Landscape of database storages\" title=\"Landscape of database storages\" loading=\"lazy\" style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\">\n      </picture>\n  </a>\n    </span>\n</figure>\n<p>In short, database storages are divided into two categories, row-based and column based. In this article we will focus on the core concept of column-based storage engine, and compare the pros and  cons between row-based storage engine.</p>\n<h2 id=\"column-based-storage-engine\" style=\"position:relative;\"><a href=\"#column-based-storage-engine\" aria-label=\"column based storage engine permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Column-based storage engine</h2>\n<p>Before introducing column-based storage engine, we need to know 2 terms first:</p>\n<h3 id=\"oltp-vs-olap\" style=\"position:relative;\"><a href=\"#oltp-vs-olap\" aria-label=\"oltp vs olap permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>OLTP v.s. OLAP</h3>\n<figure style=\"max-width: 500px\">\n\t<span class=\"gatsby-resp-image-wrapper\" style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 960px; \">\n      <a class=\"gatsby-resp-image-link\" href=\"/static/64569c4e418ce51c64efa94915eee260/fbf08/OLTP-OLAP.png\" style=\"display: block\" target=\"_blank\" rel=\"noopener\">\n    <span class=\"gatsby-resp-image-background-image\" style=\"padding-bottom: 25%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAFCAYAAABFA8wzAAAACXBIWXMAAAsSAAALEgHS3X78AAABJElEQVQY0z2Q226CQBRF+f+/qEUbYxprU216tzEICiLgtanWC1VMhVgriIzI7nCadJKVM2ceVvYewZyG6H1G6NoHmp3ZHhanx3dzdsB6ywAkSE+0XMCXaggaMnayhLCpwFfqdA/4TBFSycQDhosIhUodpUcN+ruH+Za/OTG8n+O/kK0c7LkE/S5GtxXIORFKLgtfbSC2DIS6BiFNc18fEKWnFknLrxbupD6aQxebIMH0Y4yJbSNx15SOZFkRkpiBdH4GLX8Bp/oMZugQqKq9R609Q7lqovig4ubFgDHeoDMP4e1inI4M7HSihKGq4JtXs4qX0At5LsxgcH2FsKUSVHn0lRDDJSPeHEb7YBnDpcp/J3Ic+rc0CWu3EGhN4sCrRnxPhb9JjmRbfNKovAAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;\"></span>\n  <picture>\n        <source srcset=\"/static/64569c4e418ce51c64efa94915eee260/8ac56/OLTP-OLAP.webp 240w,\n/static/64569c4e418ce51c64efa94915eee260/d3be9/OLTP-OLAP.webp 480w,\n/static/64569c4e418ce51c64efa94915eee260/e46b2/OLTP-OLAP.webp 960w,\n/static/64569c4e418ce51c64efa94915eee260/c76c7/OLTP-OLAP.webp 962w\" sizes=\"(max-width: 960px) 100vw, 960px\" type=\"image/webp\">\n        <source srcset=\"/static/64569c4e418ce51c64efa94915eee260/8ff5a/OLTP-OLAP.png 240w,\n/static/64569c4e418ce51c64efa94915eee260/e85cb/OLTP-OLAP.png 480w,\n/static/64569c4e418ce51c64efa94915eee260/d9199/OLTP-OLAP.png 960w,\n/static/64569c4e418ce51c64efa94915eee260/fbf08/OLTP-OLAP.png 962w\" sizes=\"(max-width: 960px) 100vw, 960px\" type=\"image/png\">\n        <img class=\"gatsby-resp-image-image\" src=\"/static/64569c4e418ce51c64efa94915eee260/d9199/OLTP-OLAP.png\" alt=\"OLTP v.s. OLAP\" title=\"OLTP v.s. OLAP\" loading=\"lazy\" style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\">\n      </picture>\n  </a>\n    </span>\n</figure>\n<table>\n<thead>\n<tr>\n<th></th>\n<th>OLTP (OnLine Transactional Processing)</th>\n<th>OLAP (OnLine Analytical Processing)</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Primarily used by</td>\n<td>End user/customer, via web application</td>\n<td>Internal analyst, for decision support performance reporting</td>\n</tr>\n<tr>\n<td>Main read pattern</td>\n<td>Small number of records per query, fetched by key</td>\n<td>Aggregate over large number of records</td>\n</tr>\n<tr>\n<td>Main write pattern</td>\n<td>Random-access, low-latency writes from user input</td>\n<td>Bulk import (ETL) or event stream</td>\n</tr>\n<tr>\n<td>What data represents</td>\n<td>Latest state of data (current point in time)</td>\n<td>History of events that happened over time</td>\n</tr>\n<tr>\n<td>Dataset size</td>\n<td>Gigabytes to terabytes</td>\n<td>Terabytes to petabytes</td>\n</tr>\n</tbody>\n</table>\n<p>Since the analytical requirements in large companies arise, and this two different scenarios actually have quite different access pattern to database, new database was evolved to enhance its performance. And one of the most important one is column-based storage engine.</p>\n<h3 id=\"star-schema\" style=\"position:relative;\"><a href=\"#star-schema\" aria-label=\"star schema permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Star Schema</h3>\n<p>While business analysts or data analysts are analyzing business data, it’s very common that there is a fact table (e.g. orders, transactions, events), and several metadata tables that fact tables will reference to them (e.g. product, customer table). The shape of schema and their foreign keys will be like a “star”. So we call it start schema.</p>\n<p>The fact table is usually very very large (> trillion data for large company), this makes analytical operation difficult for traditional row-based database.</p>\n<h3 id=\"column-based-storage-engine-1\" style=\"position:relative;\"><a href=\"#column-based-storage-engine-1\" aria-label=\"column based storage engine 1 permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Column-based storage engine</h3>\n<figure>\n\t<span class=\"gatsby-resp-image-wrapper\" style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 960px; \">\n      <a class=\"gatsby-resp-image-link\" href=\"/static/d24a667eefc744241802576b0bab6474/b6e50/column-based-storage-engine.png\" style=\"display: block\" target=\"_blank\" rel=\"noopener\">\n    <span class=\"gatsby-resp-image-background-image\" style=\"padding-bottom: 67.91666666666667%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAOCAYAAAAvxDzwAAAACXBIWXMAAAsSAAALEgHS3X78AAABtElEQVQ4y5VT2Y6CQBDk/3/NeK3nCiiKFyqEIKAcUtvVcRJ19cFJKjMMQ011dWG5bgTHCeG6RITZ7IQwTFGWBa7XAkXB+aow67IsFbfbDU3TPMGy7V/M5w4mkyFse4rhsI/B4AfT6RTj8Rij0UgumaHT6cj+AK1WC5vNBlmWKfnrsHx/LQe28LwlgiDAer3GbrdHFEXY7/ey3uF4PML3fX2/Wq30+XK5KA6Hg+J0YmUhrKqqlJnyvx0kZCVUT+XtdhuW4zhS8lz8c1WF53kKKjXP2+0Wi8VC1zxPRXmea8n0jeVTNT0WD204At5EUvpE77hP77hP8n6/r8+9Xk/LTtNUCUj45GEtpVZ1jYrzHdwrxYrqBTXPyUwi4m2XG3aKN4kfCpbxpnufxj+FV4lALiXpvFwqCvGsuOfN3Ew1Bq+qDJQwEUOJTAji8xmxeHMWw2kyQdKvFLIh9IY5OkjOOALJn7mVXjFfzKABzzKn70gt/gksg2qCOyHX5iCbwLiw+91uV7tMLMUac+aJkH8C8xTHMZIkUUXMGUlNNB4/JFjRx5LNX2JecE1yBpiB/eTho7JH0j/iHDGDCh0wTQAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;\"></span>\n  <picture>\n        <source srcset=\"/static/d24a667eefc744241802576b0bab6474/8ac56/column-based-storage-engine.webp 240w,\n/static/d24a667eefc744241802576b0bab6474/d3be9/column-based-storage-engine.webp 480w,\n/static/d24a667eefc744241802576b0bab6474/e46b2/column-based-storage-engine.webp 960w,\n/static/d24a667eefc744241802576b0bab6474/f992d/column-based-storage-engine.webp 1440w,\n/static/d24a667eefc744241802576b0bab6474/511b7/column-based-storage-engine.webp 1862w\" sizes=\"(max-width: 960px) 100vw, 960px\" type=\"image/webp\">\n        <source srcset=\"/static/d24a667eefc744241802576b0bab6474/8ff5a/column-based-storage-engine.png 240w,\n/static/d24a667eefc744241802576b0bab6474/e85cb/column-based-storage-engine.png 480w,\n/static/d24a667eefc744241802576b0bab6474/d9199/column-based-storage-engine.png 960w,\n/static/d24a667eefc744241802576b0bab6474/07a9c/column-based-storage-engine.png 1440w,\n/static/d24a667eefc744241802576b0bab6474/b6e50/column-based-storage-engine.png 1862w\" sizes=\"(max-width: 960px) 100vw, 960px\" type=\"image/png\">\n        <img class=\"gatsby-resp-image-image\" src=\"/static/d24a667eefc744241802576b0bab6474/d9199/column-based-storage-engine.png\" alt=\"column based storage engine\" title=\"column based storage engine\" loading=\"lazy\" style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\">\n      </picture>\n  </a>\n    </span>\n</figure>\n<p>The core concept of column-based storage engine is quite intuitive. We store data by column instead of by row. Imaging that we have millions of customer data, if we store data by row, then while doing aggregation (e.g. summation, average), it will be very time consuming because we may need to load each rows (Unless we have index).</p>\n<p>On the other hand, if we store data by column, then calculating all customers’ average age will only require to load the file for age column. It will tremendously enhance the performance.</p>\n<h4 id=\"the-benefits\" style=\"position:relative;\"><a href=\"#the-benefits\" aria-label=\"the benefits permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>The benefits</h4>\n<ul>\n<li>reduce query latency: when we want to access few columns but large portions of records, we don’t need to load every record from disk into memory and select what we want, this can extremely increase speed</li>\n<li>column compression → less disk space and faster execution: some of column have limited outcomes, we can compress it by encoding like bitmap / running length. Then we require less disk space, and have faster execution because data can be fitted in L1 cache &#x26; AND/OR operation</li>\n</ul>\n<h4 id=\"the-drawbacks\" style=\"position:relative;\"><a href=\"#the-drawbacks\" aria-label=\"the drawbacks permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>The drawbacks</h4>\n<ul>\n<li>worse write performance: insert one record need to update every column, so it’s better to do this in batch (or use memory for temporally storing data, then writing to DB in batch)</li>\n</ul>\n<h4 id=\"real-world-db\" style=\"position:relative;\"><a href=\"#real-world-db\" aria-label=\"real world db permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Real world DB</h4>\n<ul>\n<li>DBs: <a href=\"https://clickhouse.com/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Clickhouse</a> / <a href=\"https://www.vertica.com/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Vertica</a></li>\n</ul>\n<h4 id=\"comparison-matrix-of-row-based--column-based-storage-engine\" style=\"position:relative;\"><a href=\"#comparison-matrix-of-row-based--column-based-storage-engine\" aria-label=\"comparison matrix of row based  column based storage engine permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Comparison matrix of Row-based &#x26; Column based storage engine</h4>\n<table>\n<thead>\n<tr>\n<th></th>\n<th>Pros &#x26; Cons</th>\n<th>Real world DB</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Row-based</td>\n<td>- Better write performance </br> - Easier to handle transactional operation</td>\n<td>Cassandra / LevelDB  / HBase / Lucence MySQL / PostgreSQL / MongoDB</td>\n</tr>\n<tr>\n<td>Column-based</td>\n<td>- Largely better aggregation performance of few columns but on large portion of data Worse</br> - write performance, usually we will write data in batch</td>\n<td>Clickhoise / Vertica</td>\n</tr>\n</tbody>\n</table>\n<h3 id=\"summary\" style=\"position:relative;\"><a href=\"#summary\" aria-label=\"summary permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Summary</h3>\n<p>If also read the previous 2 articles about storage and retrieval, then congratulation! Now we should be able to have some thoughts in our mind to determine what databases to use for our application.</p>\n<p>Let’s recall the landscape of database storage engines. Now, how will you determine the selection of databases in your application ?</p>\n<figure>\n\t<span class=\"gatsby-resp-image-wrapper\" style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 960px; \">\n      <a class=\"gatsby-resp-image-link\" href=\"/static/b0453fbb95c32586f445dca4f9795c72/3e992/storage-engine-tree.png\" style=\"display: block\" target=\"_blank\" rel=\"noopener\">\n    <span class=\"gatsby-resp-image-background-image\" style=\"padding-bottom: 46.25%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAJCAYAAAAywQxIAAAACXBIWXMAAAsSAAALEgHS3X78AAAA+ElEQVQoz32SiYrFMAhF8/+/WCil0H3f9zpzhEDa92aEiyHR69Vo5Nfu+xbrXfxl0zRJFEXi+760bfvINW8yt8B5nrIsi+z7rmf8dV0yDIOUZamk4zg+ipm3Gns+jkOqqlIV+KIolKSua/VJkojnedL3vcbO8yzruopxSVBDMgkEUn3bNgXBvHPmnrg0TaXrOsmyTIIgUG/c9khyQeVvRixEkKIMUEQVooZ2UMSFNcjsnJqm0bkBkmkbNWEY6t1jhrYNC3eW/Gae55pEHKAoxWmX+XJ+/LJL4BID1JAA0XuteLcb8LE2FhDQHrvFjOI41mGj8j9zt+QHDZm8w9QVNO4AAAAASUVORK5CYII=&apos;); background-size: cover; display: block;\"></span>\n  <picture>\n        <source srcset=\"/static/b0453fbb95c32586f445dca4f9795c72/8ac56/storage-engine-tree.webp 240w,\n/static/b0453fbb95c32586f445dca4f9795c72/d3be9/storage-engine-tree.webp 480w,\n/static/b0453fbb95c32586f445dca4f9795c72/e46b2/storage-engine-tree.webp 960w,\n/static/b0453fbb95c32586f445dca4f9795c72/f992d/storage-engine-tree.webp 1440w,\n/static/b0453fbb95c32586f445dca4f9795c72/97599/storage-engine-tree.webp 1902w\" sizes=\"(max-width: 960px) 100vw, 960px\" type=\"image/webp\">\n        <source srcset=\"/static/b0453fbb95c32586f445dca4f9795c72/8ff5a/storage-engine-tree.png 240w,\n/static/b0453fbb95c32586f445dca4f9795c72/e85cb/storage-engine-tree.png 480w,\n/static/b0453fbb95c32586f445dca4f9795c72/d9199/storage-engine-tree.png 960w,\n/static/b0453fbb95c32586f445dca4f9795c72/07a9c/storage-engine-tree.png 1440w,\n/static/b0453fbb95c32586f445dca4f9795c72/3e992/storage-engine-tree.png 1902w\" sizes=\"(max-width: 960px) 100vw, 960px\" type=\"image/png\">\n        <img class=\"gatsby-resp-image-image\" src=\"/static/b0453fbb95c32586f445dca4f9795c72/d9199/storage-engine-tree.png\" alt=\"Landscape of database storages\" title=\"Landscape of database storages\" loading=\"lazy\" style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\">\n      </picture>\n  </a>\n    </span>\n</figure>\n<p>Mine is pretty simple, for analytical requirements like internal analysis or performance reporting, we should firstly consider column-based storage engine. Then, for write intensive application, we can consider using log-structured storage engine, and on the other hands we can consider page-oriented storage engine if number of reads are usually larger than writes.</p>\n<p>Yes, this is just the first step of how to design data intensive application. But everything comes from the first step. Let’s proceed to the future chapters of this book.</p>","fields":{"slug":"/posts/column-based","tagSlugs":["/tag/database-storage/","/tag/column-based-based-storage-engine/","/tag/row-based-storage-engine/","/tag/designing-data-intensive-application/"]},"frontmatter":{"date":"2021-12-19T23:12:04.772Z","description":"while the analytical requirement arise in large companies, the performance for aggregating large scale data gradually cannot be accepted by using row-based storage engine. Column-based storage engines were developed for optimizing analytical requirement. They have huge improvement while query large portion of data on few columns. In this article we will talk more about its core concept.","tags":["Database Storage","Column-based based storage engine","Row-based storage engine","Designing Data Intensive Application"],"title":"Storage and Retrieval (3) - Column-based storage engine","socialImage":{"publicURL":"/static/e1c2867b251a4f300016b135407f731f/social.jpg"}}}},"pageContext":{"slug":"/posts/column-based"}},"staticQueryHashes":["251939775","401334301","825871152"]}