The most portable way to transmit data is as text. Comma Separated values (CSV) and Tab Separated Values (TSV) are old formats, but even in this age of XML they are still in common use as a way of serialising database records.

To process this data, you have to get it off the hard disk and into RAM. To do that you will need to use string processing routines which are computationally expensive. That may not matter if you do this conversion once, and you keep your data in program memory. When your data is large, however, you may need to swap to and from disk, and string operations will kill your performance.

So before we start, we are going to convert the nice, portable text representation of the Netflix data into something more easily digested by your algorithm, which will take less time to load from disk.

Next Page