Sisyphus happy

Code Up, Dream On

Maven at work —

Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project’s build, reporting and documentation from a central piece of information.

The concept of “software project management” (SPM) has never occurred to me till I started my career and worked on Hadoop at Intel. The Hadoop community was shifting from Ant to Maven then so Maven happened to be the first SPM tool I tried to make sense of. When starting out, my learning materials mainly came from 3 sources, manuals on Maven website, Maven: The definitive Guide and other tutorials on the web.

The Definitive Guide is a great for kicking off, to learn the basics (how Maven works, the concepts of lifecycle and goal, etc) and start my first Maven project but not far from here. The examples in the book don’t apply to projects in my work. And they are not supposed to be. When running into a problem at work, it is much faster to Google it, which usually leads to the official manuals. The book is better for doing homework afterwards.

As for manuals, they are good reference when I already know which weapon to use. What if I don’t ? For example, I want to manage test with maven but don’t know about the surefire plugin. I usually go through the pom file of a project which works similar to mine, look for the specific plugin or configuration that does the magic and copy it. After several turns of trials and errors, I finally find my answers. Another problem of manual is that it is too verbose. I have to skip several paragraphs before finding a solution that may work. I wish I had a concise tutorial telling me what weapons to use according to the situations.

Meanwhile, I’d like to write down how I solved the problems I’ve come across at work with maven. Hence, I decided to start a series of posts called Maven at work which would be organized around common usages (compile, test, distribution) in my daily work.

2013 年书单 —

今天大言不惭的说自己一年读十几本书,13 年即将过去,正好可以回顾一下过去一年读完的书,检验一下是否真得在说大话。(链接是豆瓣的,英文名表示原版,排名分先后)

  1. Hackers and Painters by Paul Graham
  2. A Storm of Swords by George R.R. Martin
  3. 董西成的 《Hadoop 技术内幕》
  4. 保罗.乔尔达诺的 《质数的孤独》
  5. 赫尔曼.黑塞的《荒原狼》
  6. 约翰.D.巴罗的 《宇宙之书 —— 从托勒密、爱因斯坦到多重宇宙》
  7. 曹天元的 《上帝掷骰子吗 —— 量子物理史话》


A tale of two virtual machines —

这周被两个虚拟机折磨,遇到了各式各样的问题,最终还是没能把 Guest OS 装起来。这里回顾一下,可以节省以后的时间,或许也能帮到其他人。

任务是在内网的一台机器上(CentOS 6.2)安装一个虚拟机,访问这台机器需要经过一个跳板机,并且最终可以通过用户界面访问。物理隔离“基本排除”了通过 VNC 或者 NX 远程登录的可能 (在跳板机设置一个代理服务器,是否可行?),这里的方法是通过 WEB UI(浏览器),走跳板机的 SOCKS5 代理。先后尝试了 VMware serverphpvirtualbox +virtualbox 的方案。

Read the rest of this entry »

WordCount in Scala —

Reynold S. Xin from AMPLab, UCB is visiting IMC and giving courses on Spark and Shark.

Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

So basically we could do MapReduce-like job in Spark. Probably our codes will be more intuitive and also run faster than Hadoop. Here’s the WordCount example:
[cc lang="scala"]
file = spark.textFile(“hdfs://…”)

file.flatMap(line => line.split(” “))
.map(word => (word, 1))
.reduceByKey(_ + _)

Read the rest of this entry »

Cassandra Revisited —

一直在读 Cassandra 的源码。当初计划的是把对源码的解释写到博客上,也写了几篇(大部分都还处于草稿阶段),但是效果不理想。包含大段代码的文章可读性太差,即便是自己要回忆一些细节,也是云里雾里的。另一方面,源码随着 Cassandra 版本的更新会不断变化,时效性很差。因此,我想把这一系列的文章重新写过,少一些代码,多一些图片,重要的是把问题解释清楚(如 bootstrap,write / read quorum,hinted handoff 等等)。最后,换成中文表达



  • Cassandra Daemon
  • StorageServices


  • bootstrap / decommission (unbootstrap)
  • write (hinted handoff)
  • read (read repair)
  • compaction

data model

  • commit log
  • memtable (keyspace / column family / column)
  • SSTable
  • cache (KeyCache / RowCache)
  • Token / Range


  • partitioner
  • replica placement strategy / snitch
  • consistency level
  • message service
  • gossip (failure detection)
  • streaming


  • StorageProxy
  • CLI / nodetool
  • Thrift API

这些都是我现在接触到的,让我有疑惑的。我不会再特地去研究某个模块或者类的代码了,所以这里先不涉及 (如 CQL),将来用到了再补充。