In Streaming data, data may come in duplicated; it could be due to various factors such as duplicated data from sources and idempotency for redundancy; for consumption though we may need to deduplicate the data for at-most-once processing. Some deduplicators retain state of long-gone elements (in which case .distinct will suffice, but have a memory cost), but in this case we are looking at only consecutive duplicate elements.
Here the goal is to implement a Deduplicator in a way that will work with any collection or streamed input, using a State machine.
This is an alternative approach to RemoveDuplicatesFromSortedListSliding
Test cases in Scala
assert(removeDuplicates(List.empty[Int]) == List.empty[Int]) assert(removeDuplicates(List(1)) == List(1)) assert(removeDuplicates(List(1, 1)) == List(1)) assert(removeDuplicates(List(1, 2)) == List(1, 2)) assert(removeDuplicates(List(1, 1, 2)) == List(1, 2)) assert(removeDuplicates(List(1, 1, 2, 3, 3)) == List(1, 2, 3)) assert( removeDuplicatesLazyList(LazyList(1, 1, 2, 3, 3)).toList == List(1, 2, 3) )
Algorithm in Scala
29 lines of Scala (compatible versions 2.13 & 3.0), showing how concise Scala can be!
stateDiagram [*] --> Start Start --> FirstOfElement FirstOfElement --> SeenElement SeenElement --> FirstOfElement
We begin with the most fundamental streaming abstraction which defines an immutable state and produces another immutable state. It includes an Emit method and an Include method
At the start of the stream, we have nothing to emit, so we do not emit anything (this is © from www.scala-algorithms.com)
Scala concepts & Hints
The 'LazyList' type (previously known as 'Stream' in Scala) is used to describe a potentially infinite list that evaluates only when necessary ('lazily').
The 'Option' type is used to describe a computation that either has a result or does not. In Scala, you can 'chain' Option processing, combine with lists and other data structures. For example, you can also turn a pattern-match into a function that return an Option, and vice-versa!
scanLeft and scanRight
Scala's `scan` functions enable you to do folds like foldLeft and foldRight, while collecting the intermediate results
Stack safety is present where a function cannot crash due to overflowing the limit of number of recursive calls.
This function will work for n = 5, but will not work for n = 2000 (crash with java.lang.StackOverflowError) - however there is a way to fix it :-)
In Scala Algorithms, we try to write the algorithms in a stack-safe way, where possible, so that when you use the algorithms, they will not crash on large inputs. However, stack-safe implementations are often more complex, and in some cases, overly complex, for the task at hand.
A state machine is the use of `sealed trait` to represent all the possible states (and transitions) of a 'machine' in a hierarchical form.