June 20, 2010

Mutable vs Immutable datastructures - Serialization vs Performance

In my last post, I was playing around with methods to serialize Clojure data structures, especially a complex record that contains a number of other records and refs. Chas Emerick and others mentioned in the comments there, that putting a ref inside a record is probably a bad idea – and I agree in principle. But this brings me to a dilemma.

Lets assume I have a complex record that contains a number of "sub" records that need to be modified during a program's execution time. One scenario this could happen in is a record called "Table", that contains a "Row" which is updated (Think database tables and rows). Now this can be implemented in two ways,

Mutable data structures – In this case, I would put each row inside a table as a ref, and when the need to update happens, just fine the row ID and use a dosync – alter to do any modifications needed.
- The advantage is that all data is being written to in place, and would be rather efficient.
- The disadvantage however, is that when serializing such a record full of refs, I would have to build a function that would traverse the entire data structure and then serialize each ref by dereferencing it and then writing to a file. Similarly, I'd have to reconstruct the data structure when de-serializing from a file.

 
{:filename "tab1name",
 :tuples
 #<ref :field="" :tupdesc="" nil="">},
      :tup #<ref :name="">}
     {:recordid nil,
      :tupdesc
      {:x
       #<ref :field="">},
      :tup #<ref :name="">}}>,
 :tupledesc
 {:x
  #<ref :field="">}}

	</ref></ref></ref></ref></ref>

Immutable data structures – This case involves putting a ref around the entire table data structure, implying that all data within the table would remain immutable. In order to update any row within the table, any function would return a new copy of the table data structure with the only change being the modification. This could then overwrite the existing in-memory data structure, and then be propagated to the disk as and when changes are committed.
- The advantage here is that having just one ref makes it very simple to serialize – simply de-ref the table, and then write the entire thing to a binary file.
- The disadvantage here is that each row change would make it necessary to return a new "table", and writing just the "diff" of the data to disk would be hard to do.

 
#<ref :field="" :name="" :tup="" :tupdesc="" :tupledesc="" :tuples="" nil="">

<p>
So at this point, which method would you recommend?</p>
<p></p>
</ref>

If you have any questions or thoughts, don't hesitate to reach out. You can find me as @viksit on Twitter.