One term I use to describe myself is “software engineer.” Today I learned (TIL) about the origin of this phrase:
“I began to use the term ‘software engineering’ to distinguish it from hardware and other kinds of engineering,” Hamilton told Verne’s Jaime Rubio Hancock in an interview. “When I first started using this phrase, it was considered to be quite amusing. It was an ongoing joke for a long time. They liked to kid me about my radical ideas. Software eventually and necessarily gained the same respect as any other discipline.”
Source: Vox.com: Meet Margaret Hamilton, the badass ’60s programmer who saved the moon landing
Reading time: 1 minute
I learn most effectively by either visualization, verbal explanation (e.g. to the proverbial duck) or by understanding where the semantics or even etymology of how a concept, idea or theory is described. The following is example of the latter.
A week or so ago, the obvious way to semantically distinguish between different types of depth-first tree traversals suddenly (and finally!) crystallized in my mind: pre-order, in-order and post-order. The parts in bold, i.e. the prefix that semantically differentiates the approaches, refers to the relative position of the root node. Doh! Note that here the left child node comes before the right child node.
In the world of large scale data analytics, there is batch-type processing and real-time. Sometimes, the former is fine, such as when log files need be periodically analyzed using a framework such as MapReduce. Processing data in real-time can become paramount in applications such as Advertising. This is where Apache Storm comes in.
I used Udacity’s Real-Time Analytics with Apache Storm course to work with Storm: setting up my own Topologies, going from a basic Word-count application to an application linked to the Twitter sample stream.
On testing Topologies:
Gradually construct your topology.
Starting on one node to test, then distribute the topology.
Test a sample data set on a small setup, and see whether that setup has enough throughput capacity to handle the amount of data coming through. This is done by monitoring CPU usage over several days, to capture usage patterns during a day or over the course of a week. Importantly, these CPU profiles need to show some headroom to allow for load fluctuations.
At Twitter, a particular topology will be tested like this for a few days before going to production.
If you’d like to compute say a moving average, this can be done by storing the last few minutes of data in a bolt, and periodically store an aggregate statistics into persistent key-value storage such as Redis.