i’ve been researching solid state disk and it’s impact on data warehousing / business intelligence for quite some time. quite frankly, i should have seen this sooner! anyhow, in this entry, i’ll dive in to the ssd technology and discuss it’s impact on the future of what i think it could do for data warehousing appliances, data modeling, and databases. i’m going to touch on things like performance, scalability, and making the most of your existing hardware and database system (sunk cost), so read on!
what is ssd?
a solid state disk / drive (ssd) – is electrically, mechanically and software compatible with a conventional (magnetic) hard disk.
the difference is that the storage medium is not magnetic (like a hard disk) or optical (like a cd) but solid state semiconductor such as battery backed ram, eprom or other electrically erasable ramlike chip such as flash.
this provides faster access time than a hard disk, because the ssd data can be randomly accessed in the same time whatever the storage location. the ssd access time does not depend on a read/write interface head synchronising with a data sector on a rotating disk. the ssd also provides greater physical resilience to physical vibration, shock and extreme temperature fluctuations. ssds are also imune to strong magnetic fields which could sanitize a hard drive.
why does it make a difference?
well, if you look at the statistics or performance comparisons to the standard hard drive, you’ll notice that in general: the reads are 10x faster (minimum), and the writes are 5x faster. sometime these increases are a lot more, i’ve seen read speeds advertised at 40x faster when you get in to high-end ssd storage. but this isn’t the only reason that it makes a difference, sure speed is important, but in the rdbms world (relational database management systems) – the data model also makes a huge difference.
i wrote a blog entry not too long ago about the future of data modeling, and i still believe that to hold true, but what i didn’t address is the interim – the today… what companies can do to save time, allow more data faster using existing hardware/software – leveraging their investments; all without changing to a nosql database/appliance or column store. i also did not address what the benefits are of sticking with your existing systems, applications, processes, and procedures. in other words, how to save money today with what is virtually almost a plug and play storage system; and what to do to handle growth/expansion without jumping into the cloud.
so, again: why does ssd make a difference?
ok, it impacts a lot more than just the speed and performance! it actually can alleviate some architectural decisions that plague the data models of today. these are the same data models that cause problems with scalability, accountability, real-time inflows, etc… the symptoms of a bad data model include:
- performance degradation
- chained rows
- join operations
- indexing coverage (lack thereof, or too many)
- lack of parallel i/o abilities at the disk level
- poor storage choice (raid 5, raid s, raid 10+1 etc..)
- over-partitioning or incorrect partitioning (could be a symptom of a bad data model!)
these are but a few of the symptoms that can be caused by “poor” data models – especially when the data warehouse or business intelligence system grows to 45 / 50 terabytes or more. but then again, ssd is a hardware component and to be fair – it simply replaces the disk – so why all this talk about the connection to the data model?
in order to answer this, let’s look at the root causes of performance problems:
- tables too wide, the number of rows per block is too little, causing double, triple, quadruple the i/os
- too many indexes causing double/triple the i/o’s for massive inserts/updates
- not enough indexes, causing table scans (more i/os) over large data sets when queried
- hierarchical data sets n levels deep, causing recursive i/o’s to access (query) the data set in random read order
- real-time inflow overwhelming the i/o buffering, causing wait times for all queries, and all other write processes (think double i/o for logging as well, and triple i/o for updating indexes – listed earlier)
data modeling can control some of these issues by vertically partitioning (normalizing) the data set. this allows the data sets to scale with parallelism, however – this introduces additional cpu processing power and more i/o’s for joining the data together. the saving grace (in most cases) is that you can architect the data model for a balanced approach to parallel i/o operations, avoiding the dreaded table scans, and avoiding over-indexing. when you avoid table scans, and avoid over-indexing the i/o count can drop a little bit; but at the end of the day it’s all about parallelism and how many parallel i/o channels you have.
there is your connection: data model normalization can increase parallelism (by design) which increases performance, and in some cases can reduce the number of i/o’s by increasing the total number of rows per block. but yes, it increases i/os by increasing the joins.
what is ssd really good at? high speed parallel, random read/write with little to no i/o blocking.
in other words – it greatly enhances parallel data access at much higher speeds when compared to traditional disk. but: don’t make the mistake of thinking that denormalized data models will see the same or better performance increases as normalized models. this simply is not true, except in the case of an appliance like netezza where the hardware/firmware works at knowing where the data lives on disk, and thrives on wide-data sets. ssd’s in netezza (without a bit of re-engineering on their part) may in fact slow down their device… but then again, only until they adapt their own algorithms to work with ssd’s.
but, if you’re thinking: how do i keep my current investment in tact without jumping to the cloud, and/or investing in an appliance? let’s talk about it.
replace some (i would say as much as you can afford because high speed ssd’s are not cheap!) of your disks in your current data warehousing environment with high speed ssd’s. it’s mostly a seemless operation, as the ssd vendors have done an outstanding job at making it plug and play with current interfaces. ssd’s dramatically increase the performance of (you guessed it) i/o’s.
so, if you couple a data vault design data model (normalized) with an ssd device for your most important data, you will end up with the highest possible performance in your existing systems without needing to buy an appliance, move to a new platform, or re-architect your entire solution. you can save your investment. by the way, if your rdbms vendor doesn’t currently “support” or advocate ssd’s, you can push them to integrate with them. there’s a really good chance that the engineering required is very minimal on their part.
where do these hypothesis come from?
believe it or not, i watch the gaming industry. why? because they are always pushing the limits of what current hardware can do, graphics, input devices, caching, i/o’s, networking – all have to be high-speed and all have to be parallel. if the gaming industry can receive benefits, then the bi industry can receive benefits in multiples. here’s a study that compares actual performance results of ssd to hard drives. http://www.samsung.com/global/business/semiconductor/products/ssd/downloads/ssd_vs_hdd_is_there_a_difference_rev_3.pdf
i would argue that bi/edw has more to gain in performance than the gaming industry, as we frequently deal with a lot more data sets – heavy in the read/write category and that’s where ssd shines.
what about other appliances? can ssd help there?
yes, it can – it could be a boon to column based appliances, or oracle exadata (not necessarily an appliance – although thats debatable), netezza, teradata warehouse appliance, and more… again, the vendors of the appliances may need to make changes to their own hardware/software to really see the benefits – but there would be an incredible boost to performance around the board. finally, those multi-core cpu’s may begin to see a little bit of load!
the ssd is clearly here to stay, and i will say that i believe – if you want to leverage your existing investments and take the next step in performance, then make the jump to ssd. but make the jump for only your most critical or hot data (as temperature of data is concerned), because today it is cost prohibitive (for high-speed ssd). i would also say in order to make the most of parallel random accesses, add a data vault normalized data warehouse to the mix. you don’t have to move your whole system – the data vault modeling techniques are agile in nature and allow you to build the system incrementally. i teach you how to make this work inside my on-line training at: http://datavaultalliance.com.
i will be experimenting with ssd on a new laptop (probably early next year), and will release some performance results after that as well, so we can all see what might be possible.
i hope this helps your endeavors to handle large data, real-time feeds, and existing investments, your current system does not need to be moth-balled (at least not yet).
as always – if you have thoughts, comments or stories you’d like to share (even if ssd didn’t work for you) i’d love to hear it. register for free, then add your comment.