![]() There is a potential issue - The Redshift cluster can only work on a certain size of data given it's disk space and memory. This can be helpful in that older data can be offloaded from the Redshift cluster and the access time for this data isn't important as it is very infrequently used. Like older historical data that is usually needed but is sometimes needed. You also bring up very large, infrequently used data. This can be useful for some data solutions. S3 can seem like a common data store for separate data systems. This can be a useful tool if other tools are also using S3 for this shared data. ![]() You can also get data into Redshift from S3 using Spectrum. ![]() Similarly and GROUP BY needs to be applied only on the fact table columns or this won't reduce the data coming to Redshift from S3.ĭata generally gets into Redshift through S3 and this can be done with the COPY command. Also for this to work the WHERE clause needs to apply to a column in the fact table as JOINs cannot be done in S3 so filtering on dim columns won't help. However, if your queries aren't going to apply a WHERE clause to the fact table or aggregate the data down then you won't see the advantages. This in general will apply to your fact tables and not to your dim tables. With these in mind you will want to have data in Spectrum that your query plan will want to get a subset transferred from S3 to redshift. This can result in a big win in performance and in the amount of data that can be addressed. ![]() Spectrum can be a huge benefit allowing for a very large amount of data to be filtered down quickly by a fleet of small compute elements. However, if the large amount of data stored in S3 needs to be moved to the Redshift cluster completely to perform the query then there can be a large hit to performance. If the large amount of data stored in S3 can be pared down to a small set that is shipped to Redshift then Spectrum can be a huge performance improvement. The next thing to understand is that while these embedded compute elements are close to S3 in terms of access speed, the S3 service is far away from the Redshift cluster (network distance). There are also aspects of the query plan that cannot be perform in the S3 layer such as JOINs and advanced functions such as window functions. These part centered around applying WHERE conditions and performing aggregation (GROUP BY). This is a broad topic but I'll give a few thoughts.įirst off Spectrum is a (often large) set of compute elements embedded in S3 that can do some aspect of the query plan. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |