Thursday, December 25, 2014

Streaming Large (Unknown) data in Python and Java Lazily

While working with large (or unknown in terms of size) collection of data we always ask ourselves few questions. Whether my process has enough memory to hold this collection of data, if I am doing repetitive loops on this data, if some independent iterations can be combined, can we avoid unnecessary intermediate copies etc. In Java once when it comes to collections there are several collection api's. Some built in, some open source third party jars like google-collections Guava, to name a few. Despite of their good work within API implementation it is always about how we as a developer chose to solve our problem using them.
Also in case of huge collections main worry still remains - what if we do not have enough memory to hold a collection. In this blog I want to discuss approach taken by Python and Java Generators and Streams using streaming and Lazy loading concepts to improve memory footprints of collections and their operations.
Just recently I started learning Python and came across this very well known feature (now) - Generators. There are good descriptions and examples with the code to show what Generators are which can be found here. They are similar to Iterators in Python or Java except that they don't need entire collection created in advance. As their name suggests it is generator, generator of data when you are really using it. While you loop over it. How many times it happens that we load entire collection, say from external source like db or file, loop on it, filter out something and then return result to the caller. Loading source collection completely in memory does'nt make sense in this case if it is huge in size. And mainly if you don't need it.

def readNext(fileName, chunk_size=1024):
  
    # this line will be called only once irrespective of multiple calls to same generator
    fileHandle = open(fileName)

    while True:
        data = fileHandle.read(chunk_size)
        if not data:
            break

        # after yeild data will be returned to the caller
        # when next iteration of the loop calls readNext, execution will resume with next iteration
        # of this while loop till it yields again.
        yield data

# caller code
for piece in readNext('bigFile.txt'):
    processData(piece)

readNext is generator, generator of file contents. On first readNext call code executes till yeild statement. Yield returns data which will stored in piece variable. When readNext called next time it will resume iteration from while True statement. Skipping first line which which opens file handle. Basically somemhow generator knows where it last yeild value and resumes execution from there.

This is very concise and beautiful pattern. File contents are never loaded into memory. Data is streamed to caller as and when needed. Same thing could have been achieved by writing Iterator. However It can't get any close to Generator in terms of conciseness.

Now switching to Java, starting from Java 8 we have this new package, Stream. This package has various classes that helps you to work on data collections as streams of data. This notion is based on the fact that Stream is not a data structure . It represents operations to be performed before returning next value from underlying source. So we have a stream, intermediate operations and Terminal operation. Nothing is computed or stored unless terminal operation is called. Once we invoke terminal operation on stream data is scanned through lazily and only that we need.
// Stream that represents infinite series of integers
IntStream infiniteStream = IntStream.iterate(0, i -> i + 1);


// As stream loads collection lazily it knows it only needs to limit to 100 elements from infinite series
int result = infiniteStream.limit(100).map(i->i*2).sum();
System.out.println("Sum of finite values "+result);
As shown in example above we have stream of data coming through from infiniteStream. We definitely can't hold entire data collection in memory. Even though that won't stop is from performing our operations on it. We apply 2 intermediate operation "limit" and "map" over the original stream. At last we use terminal operation "sum". For such processing we don't even want to store entire collection in memory anyway. Java streams are not only limited to solve such problem. There are many more useful features of Streams.

I like this approach at api level provided by Python and Java using Generators ans Streams to let us decide when and how deep we want to use our source data. This not just allows us to control memory footprint but also makes our code more readable.