Wednesday, January 21, 2015

Regular expressions

http://www.regular-expressions.info/
This website gives incredibly in-depth explanations about regular expressions. Particularly, it elaborates the subtle differences between the regex supports of popular languages and tools, which I found extremely helpful.

Saturday, January 17, 2015

Pitfalls for Scala beginners --- Part 1

Type erasure. Consider the following code:
class Box[E](e: E) {
  def unbox: E = e
  def rebox(e: E): Box[E] = new Box(e)
}
val box1 = new Box(1)
val boxes: Seq[Box[_]] = Seq(box1) // the type info of Int is lost due to "_"
val box2 = boxes.head 
box1.rebox(box1.unbox)  // ok
box2.rebox(box2.unbox)  // compile-time error!
The last line cannot compile because the Scala compiler cannot tell whether the return type of box2.unbox matches the type parameter of box. One way to solve this problem is to introduce a fresh type parameter for the rebox method, so that the argument type of rebox and the type parameter of Box are allowed to be different:
class Box[E](e: E) {
  def unbox: E = e
  def rebox[F](e: F): Box[F] = new Box(e)  // use a new type parameter for rebox
}
Capital identifiers. Scala's pattern matching mechanism automatically presumes within patterns that all identifiers with an initial capital letter are constants. Hence, val (a,b) = (1,2) is fine but val (A,b) = (1,2) cannot compile due to undefined A. On the other hand, if you set, say, A = 2 before the matching, then it will compile but instead rise a MatchError exception at runtime.

Iterated binding. According to section 4.1 of The Scala Language Specification, a value definition val p1,...,pn = e is a shorthand for the sequence of value definitions val p1 = e; ...; val pn = e. Hence, the following code
val next = { var n = 0; () => { n = n + 1; n } }
val p1, p2, p3, p4, p5, p6, p7, p8 = next()
binds $p_i$ to $i$ for each $i$.

Overridden val. The Scala compiler binds a val variable only once. Hence, if a val variable is overridden in a subclass, it will be initialized in that subclass and appear as its default value before that time, even though it is also initialized in the superclass. For example,
trait A { 
  val foo = 10 
  println(s"foo in A: $foo") 
}
class B extends A {
  override val foo = 11
  println(s"foo in B: $foo") 
}
new B // prints "foo in A: 0", "foo in B: 11"
The point is that access to foo is carried out through an accessor foo() overridden in B. Hence, even though there is a field foo in the superclass, it is effectively hidden from the constructors, and B.foo returns the value set in B after the parent constructors are called. Note that the same holds for Java: you access fields of the subclass through the overridden method, which gives the default value when the method is called in the constructor of the superclass.

Binding vs assignment. In Scala, val denotes binding and var denotes assignment. When one uses var to declare a variable, the closure is made over its reference instead of its value. Hence, the value of the variable is still mutable after the variable is enclosed. Consider the follows code:
val fs = Buffer[() => Unit]()
{// closure I
  val a = 1
  var b = 2 
  fs += (() => println(a + " " + b)) // closure II
  b = 3  
}
val a = 4
fs foreach { f => f() } // print 1 3
Inside closure II, a binds to a value (ie. 1) while b binds to a reference. As the value of b changes, the new value is visible to all closures that can access b. Hence, one has to use val to capture values in a closure.

Right-associative operators. All operators, including those custom ones, that ends in a ":" would be right-associative in Scala. For example, the List concatenation a ::: b ::: c is translated to method calls c.:::(b).:::(a).

Type widening. Suppose you have an implicit conversion from String to Option:
implicit def StrToOpt(s: String) = Option(s)
val opt1: Option[String] = "a".get()         // Ok due to conversion
val opt2: Option[String] = "a".getOrElse("") // Boo! type-mismatch error!
The problem is a result of incomplete tree traversal. The signature
 def getOrElse[B >: A](default: => B): B
allows type widening. Thus, when the compiler realizes that the return type of getOrElse is not as expected, it tries to generalize the type. In this case, it tries to identify a supertype of String, which is Serializable, and then gets stuck there. A solution to this is to use
val opt2: Option[String] = "a".getOrElse[String]("")
Now the return type is forced to be a String, so the compiler finds the implicit without wandering around the type hierarchy. The lesson: in the presence of type widening/narrowing, the compiler can make the type too general before checking for implicits. Putting explicit type annotation help avoid this problem.

Iterators vs Traversables. Iterators in Scala provide analogues of most of the methods that you find in the Traversable classes. For instance, they all provide a foreach method which executes a given procedure on each element. The biggest differences between iterators and traversables is that iterators has states. An iterator maintain a pointer that is advanced automatically and cannot be rewinded. Hence, you cannot reuse an iterator after invoking foreach on it. See this document for details.

Note: Do not confuse Iterator classes with Iterable classes. See this document for all full hierarchy of Scala's collections.

Sunday, January 11, 2015

Inheritance of Objects in Scala

Extending an object in Scala is not allowed due to the singleton semantics of objects. However, at times you may find yourself tempted to do so, e.g., to add some methods to an object. You have two choices to fulfill your requirement:

Refactoring. If possible, try to extract the interesting behaviors of the target object into an abstract trait, so that you can extend a trait instead of an object. This is also the standard solution in OOP textbooks.

Implicit conversion. You can make an object Y look like an extension of object X using implicit modifier:
object X { def a = 1 } 
object Y { def b = 2 }
implicit def YToX(y: Y.type) = X
println(Y.a) 
In this way, you can call methods on Y that were originally defined for X. Note that Y doesn't actually extend X, so there are still many big differences between this design and OO-inheritance. For example, you cannot override methods from X, you cannot substitute Y for X in your code, etc.

Saturday, January 10, 2015

Basic notions of Lambda calculus

$\alpha$-conversion changes names of bound variables to fresh ones before substitution. $\alpha$-equivalence equates formulas that differ only in the naming of bound variables. Note that we consider two normal forms to be equal if they are $\alpha$-equivalent.

$\beta$-reduction applies a $\lambda$-term to another term: $(\lambda x.\, e)\ e' \mapsto e\ [e'/x]$. Two expressions are $\beta$-equivalent, if they can be $\beta$-reduced into the same expression.

$\eta$-conversion converts between $λx.(f\ x)$ and $f$ whenever $x \not\in FV(f)$. It expresses the ideas that two functions are the same if and only if they give the same result for all arguments. Hence a function and its eta-conversion is $\eta$-equivalent. In FP, we usually use $\eta$-conversion to bound free variables inside an expression, e.g., to convert $e: A\rightarrow B \rightarrow C$ to $\lambda x.\lambda y.e$.

$\eta$-conversion (together with rule $M \equiv N \implies M\ x \equiv N\ x$) yields the Extensional Principle: $M\ x \equiv N\ x \implies M\equiv N$, given that $x\not\in FV(M)\cup FV(N)$. This principle expresses the lambda calculus form of the notion of extensional equality for functions. That is, two functions $f, g: A\rightarrow B$ are indistinguishable iff $f\ x \equiv g\ x$ for all $x:A$.

The term redex, short for reducible expression, refers to subterms that can be reduced by one of the reduction rules. For example, $(λx.M)\ N$ is a beta-redex, and $λx.M\ x$ is an $eta$-redex if $x$ is not free in $M$. The expression to which a redex reduces is called its $reduct$. Using the previous example, the reducts of these expressions are respectively $M[N/x]$ and $M$.

Thursday, January 8, 2015

Spark glossary

Data structures

RDD is a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce. As a data structure, an RDD is a lazy, read-only, partitioned collection of records created through deterministic transformations (e.g., map, join and group-by) on other RDDs.

The lineage of an RDD is the series of transformations used to build it. More vaguely, it also refers to the information needed to derive an RDD from other RDDs. Logging lineages allows an RDD to reconstruct lost partitions without requiring costly checkpointing and rollback.

RDDs vs. DSM. In DSM models, applications read and write to arbitrary locations in a global address space. DSM is a very general abstraction, but this generality makes it harder to implement in an efficient and fault-tolerant manner on commodity clusters. The main difference between RDDs and DSM is that RDDs can only be created (“written”) through bulk transformations, while DSM allows reads and writes to each memory location. This restriction of RDDs allows for more efficient fault tolerance. In particular, RDDs do not need to incur the overhead of checkpointing, as they can be recovered using lineages.

The execution model

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program, called the driver program. A worker node is a node that can run application code in the cluster. Each worker node in the cluster can have several executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. The number of executors on a node is limited by the number of cores of that node. A task is a unit of work that will be sent to one executor. A job is a parallel computation consisting of multiple tasks that gets spawned in response to a Spark action such as reduce. Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce).