Scala algorithm: Remove duplicates from a sorted list (state machine)

Published

Algorithm goal

In Streaming data, data may come in duplicated; it could be due to various factors such as duplicated data from sources and idempotency for redundancy; for consumption though we may need to deduplicate the data for at-most-once processing. Some deduplicators retain state of long-gone elements (in which case .distinct will suffice, but have a memory cost), but in this case we are looking at only consecutive duplicate elements.

Here the goal is to implement a Deduplicator in a way that will work with any collection or streamed input, using a State machine.

This is an alternative approach to RemoveDuplicatesFromSortedListSliding

Test cases in Scala

assert(removeDuplicates(List.empty[Int]) == List.empty[Int])
assert(removeDuplicates(List(1)) == List(1))
assert(removeDuplicates(List(1, 1)) == List(1))
assert(removeDuplicates(List(1, 2)) == List(1, 2))
assert(removeDuplicates(List(1, 1, 2)) == List(1, 2))
assert(removeDuplicates(List(1, 1, 2, 3, 3)) == List(1, 2, 3))
assert(
  removeDuplicatesLazyList(LazyList(1, 1, 2, 3, 3)).toList == List(1, 2, 3)
)

Algorithm in Scala

30 lines of Scala (version 2.13), showing how concise Scala can be!

Get the full algorithm Scala algorithms logo, maze part, which looks quirky!

or

'Unlimited Scala Algorithms' gives you access to all the Scala Algorithms!

Upon purchase, you will be able to Register an account to access all the algorithms on multiple devices.

Stripe logo

Explanation

stateDiagram
    [*] --> Start
    Start --> FirstOfElement
    FirstOfElement --> SeenElement
    SeenElement --> FirstOfElement
        

We begin with the most fundamental streaming abstraction which defines an immutable state and produces another immutable state. It includes an Emit method and an Include method

At the start of the stream, we have nothing to emit, so we do not emit anything (this is © from www.scala-algorithms.com)

Full explanation is available for subscribers Scala algorithms logo, maze part, which looks quirky

Scala concepts & Hints

  1. Lazy List

    The 'LazyList' type (previously known as 'Stream' in Scala) is used to describe a potentially infinite list that evaluates only when necessary ('lazily').

  2. Option Type

    The 'Option' type is used to describe a computation that either has a result or does not. In Scala, you can 'chain' Option processing, combine with lists and other data structures. For example, you can also turn a pattern-match into a function that return an Option, and vice-versa!

    assert(Option(1).flatMap(x => Option(x + 2)) == Option(3))
    
    assert(Option(1).flatMap(x => None) == None)
    
  3. scanLeft and scanRight

    Scala's `scan` functions enable you to do folds like foldLeft and foldRight, while collecting the intermediate results

    assert(List(1, 2, 3, 4, 5).scanLeft(0)(_ + _) == List(0, 1, 3, 6, 10, 15))
    
  4. Stack Safety

    Stack safety is present where a function cannot crash due to overflowing the limit of number of recursive calls.

    This function will work for n = 5, but will not work for n = 2000 (crash with java.lang.StackOverflowError) - however there is a way to fix it :-)

    In Scala Algorithms, we try to write the algorithms in a stack-safe way, where possible, so that when you use the algorithms, they will not crash on large inputs. However, stack-safe implementations are often more complex, and in some cases, overly complex, for the task at hand.

    def sum(from: Int, until: Int): Int =
      if (from == until) until else from + sum(from + 1, until)
    
    def thisWillSucceed: Int = sum(1, 5)
    
    def thisWillFail: Int = sum(1, 300)
    
  5. State machine

    A state machine is the use of `sealed trait` to represent all the possible states (and transitions) of a 'machine' in a hierarchical form.

View the rest of Scala algorithms