Parallelized Streams in Scala – Not What You Thought

I had a Stream of data coming in from an input source and had to do some heavily-CPU bound work. The code looked like this:

stream.map(item => cpuIntensiveWork(item))

Behind the scenes, the input stream is enumerated and items are acted upon one-by-one to create a lazily loaded list.
Wanting to parallelize this, I added par:

stream.par.map(item => cpuIntensiveWork(item))

What I imagined this would do was that it would load the lines into a buffer and have workers read from that buffer an execute the map while it was still writing. Turns out that’s not what happens.
What really happens is that the whole stream is read into memory and then that gets parallelized. Hardly what I was aiming for.

Advertisements

3 thoughts on “Parallelized Streams in Scala – Not What You Thought

  1. Are you sure you need a stream and not an iterator? Not sure if par works better there, but for what you’re doing here it actually makes a lot of sense for it to behave this way – a stream is lazily-evaluated, but it still has to keep the entire list after it’s been evaluated.

  2. I don’t mind it being stored in memory. My issue with this is that it doesn’t allow reading from the contents of the stream in parallel until all items in the stream have been read, which defeats the purpose.

    • Oh. Well, I guess you don’t need me for ideas on how to parallelize it in other ways, without using par…
      Still, I’d give iterators a shot, it can’t hurt.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s