If you want to read faster, I'll share a way that will radically change how fast you can read books, and, more importantly, comprehend the information. You can read faster and absorb a lot more books with a rapid reading method -- the sticky note way. You can do extreme reading with sticky notes.
This approach is for printed books and magazines. I'm a fan of the Kindle. You can read my Kindle review. My main scenario for Kindle is instant access, reading fiction, and having books at my finger tips. That said, I can read and learn faster with physical books, using the "Sticky Note Method."
I read a lot of books each month. I usually spend in the neighborhood of $300 a month. Books are my fastest way to learn new ideas, new methods, new techniques that I can test at work to keep growing my capabilities. Books are the short-cuts for personal development and rapid learning.
Why the Sticky Note Method
Here's a quick story that might help show how it works. The other day I was looking for a key concept. I knew which book it was in. I started rapidly flipping through pages. I couldn't find it. I started to put yellow stickies in the book as I flipped through. On each sticky, I wrote down one nugget of insight -- one key idea or action that was worth noting. As I wrote down each insight, I put it into easy to understand terms. I wrote it as a one-liner reminder. Within 20 minutes, I had parsed my 300+ page book. It was riddled with stickies, my one-liner reminders, and I now had a personal index, with key take aways.
It was the wrong book.
I took a break and realized I was intently looking the right way, but in the wrong book. I grabbed the right book, and found the idea I had been looking for within seconds. Meanwhile, what dawned on me was just how powerful this rapid reading method is.
The sticky note method is powerful because it forces you to internalize what you read, while turning insight into action. It's simple too. But don't let the simplicity fool you.
How To Use the Sticky Note Method for Rapid Reading
The steps are simple:
That's it. I told you it was simple. It’s simple, but effective.
You will get faster with practice. When I asked one of my mentors what's the secret to running faster, he said run faster. I thought he was joking but he was serious. The same is true for reading faster. To read faster ... read faster. But now you have a method to make the most of what you read, as you go – The Sticky Note Method for Rapid Reading. It works because it forces you to focus, it forces you to internalize information rather than regurgitate information, it forces you to create a personalized, meaningful index into your book, and it forces you to distill information into easier to little insights and actions (one-liner reminders) that turn insight into action.
I've been using this approach for years. I've tried many ways to read faster, and they all add up, but if I could only share one approach with you, this is the one that will radically change your game and take your reading to the next level.
User group leaders, listen up! We have an extra Full Experience tickets to php|tek, and we'd like to give it to the community to use. php|tek happens May 22-25 in Chicago, Illinois, and is jam packed with PHP goodness, including an Unconference and a Hackathon. I mean, have you seen this schedule?! Your winner is also welcome to join us at the Engine Yard JAUNT on Friday night after the conference. There is undoubtedly much awesomeness to be had next week.
If you are interested in participating, drop me a line by Thursday, May 17 and let me know. We'll randomly select a user group to receive the ticket, and you can give it out however you wish. Thumb wrestling champion? Karaoke competition? Hackathon winner? Best high fiver? It's really up to you.
As soon as you have a winner, let me know and we'll make the arrangements with the fine folks at Blue Parabola, hosts of the conference.

In Zen And The Art Of Scaling - A Koan And Epigram Approach, Russell Sullivan offered an interesting conjecture: there are 20 classic bottlenecks. This sounds suspiciously like the idea that there only 20 basic story plots. And depending on how you chunkify things, it may be true, but in practice we all know bottlenecks come in infinite flavors, all tasting of sour and ash.
One day Aurelien Broszniowski from Terracotta emailed me his list of bottlenecks, we cc’ed Russell in on the conversation, he gave me his list, I have a list, and here’s the resulting stone soup.
Russell said this is his “I wish I knew when I was younger" list and I think that’s an enriching way to look at it. The more experience you have, the more different types of projects you tackle, the more lessons you’ll be able add to a list like this. So when you read this list, and when you make your own, you are stepping through years of accumulated experience and more than a little frustration, but in each there is a story worth grokking.
Why do people register for the Stoos Stampede in Amsterdam?
Let’s try and make some educated guesses by using the champfrogs schale.
CuriosityI am sure plenty of change agents are curious to know how this new kind of event will play out. Will it lead to actionable results for organizational transformation? Will it lead to nothing? No better way to know than to be there.
Maybe some will feel it an honor to be there. It will be the first “stampede” under the Stoos umbrella name. And there are already rumors that others will follow on other continents. It’s great to be among the first.
Some who attend the stampede might be looking for acceptance. I know that many change agents feel like aliens in their own organizations. But at this event they will be among similarly-minded people.
The point of being a change agent is to become good at changing a little bit of the world around you. At the Stoos Stampede we will share stories about how to do this, which will help people to become more competent at it.
Many people who are trying to change organizations are looking for freedom from command and control. The Stoos Stampede organization is exactly like that. There is (almost) no control. And lots of freedom.
There’s no better way to get connected with other people than face-to-face conversations where we inspire each other about organizational change. I’m quite sure this event will lead to lots of new relationships being forged.
Ah, a tricky one. The people who crave for a bit of order should not feel disappointed. The stampede will not be a chaos. We will have some rules and constraints in place, to maximize the power of self-organization.
People who look for purpose won’t need to look further. Everyone who gathers at the Stoos Stampede is looking for ways to get happier employees and customers, and more meaning to running a healthy organization.
Hm, this will depend on yourself I suppose. A change agent is only as good as the change he or she was able to introduce in the system. And status will follow if you did a good job at that.
What is your reason to come to Amsterdam on 6+7 July?
Kelly's Contemplation has a nice post on The 3 Phases of an Agile Project. He starts the example with receiving a project that has few actual requirements. The notion of a traditional - or better yet conventional - approach to project management starts looking for requirements. The agile approach is presented in a way that is typical of COTS based deployment through progressive development of the capabilities of the tool set tailored to the needs of the client.
Here's where traditional, er., conventional, project management fails to deal with the current approach to complex project, program, and product development.
I our Defense and Space program, there is a critical first step, that is baked into the procurement process.
CAPABILITIES BASED PLANNING
What capabilities does the customer want to posses when this project is over? These capabilities are the prerequisites to defining the requirements. Here's an end-to-end process for developing the capabilities description. This looks very non-agile, but these are the principles, put these into practice in a way that best suites your need. But check to see if you've covered off all the steps, because is something is missing, you may think you're being agile, but in fact you're missing a critical piece of data.
The critical reason for starting with capabilities is to establish a home for all the requirements. To answer the question “why is this requirement present?” “Why is this requirement needed?” “What business or mission value does fulfilling this requirement provide?”
Capabilities statement can then be used to define the units of measure for program progress. Measuring progress with physical percent complete at each level is mandatory. But measuring how the Capabilities are being fulfilled is most meaningful to the customer.
The “meaningful to the customer” unit of measures are critical to the success of any program. Without these measures the program may be cost, schedule, and technically successful but fail to fulfill the mission.
The process flow above is the starting point for Identifying the Needed Capabilities and determining their priorities.
Starting with the Capabilities prevents the “Bottom Up” requirements gathering process from producing a “list” of requirements – all needed – that is missing a well formed topology. This Requirements Architecture is no different than the Technical or Programmatic architecture of the system. Capabilities Based Planning (CBP) focuses on “outputs” rather than “inputs.” These “outputs” are the mission capabilities that fulfilled by the program.
In order to fulfill these capabilities, requirements need to be met. But we need the capabilities first. Without the capabilities, it is never clear the mission will be a success, because there is no clear and concise description of what success means.
The concept of CBP recognizes the interdependence of systems, strategy organization, and support in delivering the capability, and the need to examine options and trade‒offs in terms of performance, cost and risk to identify optimum development investments. CBP relies on scenarios to provide the context to measure the level of capability.
Here are the details of how to capture the capabilties needed by the customer. Again do these any the way the best suites your need. But do check that you've got - or are going to get - all the information. Otherwise you get to do the work twice.
Notice that most everything talked about in the post is here, plus some other stuff. Trade offs between capabilities is critical. Assessing the costs and the risks. It doesn't do anny good to charge ahead with everything the customer wants to do if you have no idea of the REAL costs and the REAL risk that is being created by your agile approach.
But defined capabilities through Use Cases and Scenarios is the standard approach in large defense and space programs. There is even a language for doing that - SysML is a systems engineering modeling language for Use Cases and Scenarios. I know that doesn't sound very agile, but on a multi-billion weapons systems (mostly software these days) scenarios are the starting point for identifying the actual requirements.
This is the core concept missing from the PMI approach, the dreaded waterfall approach, the conventional approach. And of course these approaches are no longer allowed in the procurement of system for the Defense and Space industry. Those places procure systems using Capabilities Based Planning.
The picure above is one of 5 processes we use in our Performance Based Management activities in our domain. Next comex the requirements, since we're spending the governments money, they want to know those sorts of things.
"Employ your time in improving yourself by other men's writings, so that you shall gain easily what others have labored hard for." - Socrates
But rememeber, you've got to provide attribution, otherwise they'll think you're a poser. Taking credit for others ideas as your own, or even embellishing them and calling it new is bad form these days.
In the comments on my post about generating random numbers to test a function David Turner suggested that this was exactly the use case for which QuickCheck was intended for so I’ve been learning a bit more about that this week.
I started with a simple property to check that the brute force (bf) and divide and conquer (dc) versions of the algorithm returned the same result, assuming that there were enough values in the list to have a closest pair:
prop_closest_pairs xs = length xs >= 2 ==> dcClosest xs == (fromJust $ bfClosest xs)
I could then run that as follows:
> import QuickCheck.Test > quickCheck(prop_closest_pairs :: [(Double, Double)] -> Property)
It failed pretty quickly because the bf and dc versions of the algorithm sometimes return the pairs in a different order.
e.g. bf may say the closest pair is (2.0, 0.0), (2.1, 1.1) while dc says it’s (2.1, 1.1), (2.0, 0.0) which will lead to the quick check property failing because those values aren’t equal:
> ((2.0, 0.0), (2.1, 1.1)) == ((2.1, 1.1), (2.0, 0.0)) False
The best way I could think of to get around this problem was to create a type to represent a pair of points and then write a custom equality operator.
I initially ended up with the following:
type Point a = (a, a) data Pair a = P (Point a) (Point a)
instance Eq (Pair a) where P a b == P c d = a == c && b == d || a == d && b == c
Which didn’t actually compile:
qc_test.hs:41:58:
No instance for (Eq a)
arising from a use of `=='
In the second argument of `(&&)', namely `b == c'
In the second argument of `(||)', namely `a == d && b == c'
In the expression: a == c && b == d || a == d && b == c
The problem is that while we’ve made Pair an instance of the Equality type class there’s no guarantee that the value contained inside it is an instance of the Equality type class which means we might not be able to compare its values.
We need to add a class constraint to make sure that the value inside P is a part of Eq:
instance (Eq a) => Eq (Pair a) where P a b == P c d = a == c && b == d || a == d && b == c
Now we’re saying that we want to make Pair an instance of the Equality type class but only when the value that Pair contains is already an instance of the Equality type class.
In this case we’re just storing pairs of doubles inside the Pair so it will work fine.
Now if we compare the two points from above we’ll see that they’re equal:
> P (2.0, 0.0) (2.1, 1.1) == P (2.1, 1.1) (2.0, 0.0) True
I had to go and change the existing code to make use of this new but it didn’t take more than 5-10 minutes to do that.
A couple of months ago I went to an excellent set of talks at LJC night at QCon. This not only inspired my entry on staleness but also made me think about some more architectural smells.
Gil Tene from Azul spoke about the maximum size of VMs we all use. A quick show of hands indicated that the maximum size of VM heap memory most of us ran was 2-4GB. Gil argued, quite convincingly, that this was due to legacy issues with garbage collection but new techniques should make this limit redundant.
I sort of agree... However, even if all the issues surrounding garbage collecting large VMs is solved, the 2-4GB limit is still a useful one.
In Architectural Smells I argued that there are architectural smells in the same way that there are code smells (aspects of a system that indicate but not guarantee a possible issue) and the example I gave was caching. I also think that VM size is an architectural smell.
When designing complex systems one of the most common design guidelines is to split differing responsibilities into specific services. Non-functional requirements may complicate this but most of us try to avoid having all functionality in a single service. Certain services are obviously distinct and can be isolated. However it's easy for functionality creep to result in a service gaining a lot of responsibilities and this often results in a larger run-time footprint.
During testing this shows itself as a service requiring a large VM - certainly in comparison to other services. I view this as an architectural smell and analyse the service to see if it should be split up. (Easier if you have some form of lightweight and iterative design process.)
This is very similar to the code smell of long method. It's rare that your code size or service size starts out by being too large but tends to happen over time. This is a good reason why an architect can't just throw it over the wall but needs to stay involved with the project. As developers add code to the system the design might need to change.
Whether you take an absolute size or relative size will depend on your system in question. Personally, I get concerned when a service is larger than 1GB unless its prime purpose is data storage and even then the data should probably be clustered across services rather than in a single block - but that's an argument for a different occasion.
At Engine Yard, we believe that you should have the flexibility to set up your environments and manage your data stores as you see fit. This is something we take seriously as we continue to evolve Engine Yard Cloud and today, we are happy to announce database-less environments as an alpha release. If you need to utilize data offerings outside of our natively supported MySQL or PostgreSQL, then this feature will enable you to do so.
Enabling the featureWith database-less environments, it is no longer necessary to have a MySQL or PostgreSQL instance in every environment. Simply boot up a ‘No Database’ cluster with one of our Add-on database providers or roll your own using utility instances. Now it is easier and more affordable than ever to get started on Engine Yard.
You can enable this feature using the Early Access tools. Once you have the 'No db' feature enabled, you will be able to select the "No Database (Alpha)" option under Database Stack on the new environment form.
You can add as many application instances and utilities as you need, and you can stop paying for database masters that you don’t use. For example, you can follow the Mongoid RailsCast (Episode 238) and create a simple blog using Mongoid using two application instances and three utility nodes.
You can also use the ‘No Database’ feature in combination with our Add-on Program (login required). For example, you can have a simple application with just one instance and an external database. See the Database section of our Add-on Program for more information.
We hope you enjoy this feature and let us know what you think.
NotesRemoval of the database.yml file
Environments without databases will not have a database.yml file generated by Engine Yard Cloud. Enabling this feature means that you are either not using ActiveRecord or you have supplied your own database.yml file in your repository.
Did you know that there are 3 billion more smartphones on earth than there are humans? Maybe that doesn’t come as much of a surprise to you. But what you might find more surprising is that the growth in smartphone adoption has actually contributed to Engine Yard’s success. That’s right: as smartphone adoption has grown, so has app consumption. As a result, businesses are now prioritizing mobile application development. By 2015, mobile application development projects targeting smartphones and tablets will outnumber native PC projects by a ratio of 4 to 1. Innovation in mobile is imperative, and there’s a need for tools that enable businesses to innovate quickly. Many cloud computing technologies--like Engine Yard's Platform as a Service--have enabled developers and businesses to focus on application innovation.
The below infographic includes even more interesting facts about innovations in mobile, cloud computing and PaaS. Check it out and let us know where you think these fields are headed next.

Copy and paste onto your blog:
<a href="http://www.engineyard.com/blog/2012/platform-as-a-service/"><img class="aligncenter size-full wp-image-12384" src="http://www.engineyard.com/blog/wp-content/uploads/platform-as-a-service-v2.jpg" alt="Platform as a Service" width="930" height="5572" /></a><br/>Courtesy of: <a href="http://www.engineyard.com">Engine Yard</a>
I find that action builds momentum. The best kind of action is decisive action because then you are "all in." Dipping a toe in the water doesn’t make the same splash as diving into the pool.
When I'm under the gun, "satisficing" to make decisions serves me well. Gary Klein wrote a great book on how experts make rapid decisions under fire. (The book is Sources of Power.)
Some of the techniques I use include: criteria and weights, CARVER (Criticality, Accessibility, Return, Vulnerability, Effect, and Recognizability), and Six Thinking Hats. At Microsoft, I tend to use criteria and weight when I need to get agreement with others on what the priorities are. I also tend to use Six Thinking Hats when I need to rapidly have folks change perspective, and take a more holistic view. To make the most of Six Thinking Hats, I use questions at the whiteboard to focus the thinking and work our way through the hats.
At the end of the day, I've found that a lot of the decisions come down to who do you want to be and what experiences do you want to create. Basically, the more you can connect your decisions to your "Why" or to your values, the stickier they are.
In fact, the secret of changing habits is to first decide who you want to be and our identify helps us pattern match the best fits.
We are pleased to announce that we have integrated Badgeville’s gamification technology into our Zendesk ticketing system. As you use the helpdesk to perform different actions, (searching documentation, contributing to forums, completing satisfaction surveys, etc.) you will be able to earn many different badges and complete many different missions.
Through this integration we hope to increase community engagement, and to not only give you new channels to share your experiences and ideas, but also to reward you for it!
While inside our helpdesk, if you hover over a user’s picture, a little summary profile will appear showing how many points and rewards that user has, as well as the last badge that they have earned.
By clicking on “Click for profile” you will be brought to the user’s showcase that shows their progress on the current missions, with hints on how to earn the badges associated with them. You will also be able to see your showcase anytime, by clicking on the “Profile” link in the upper right corner.
Also, there have been a few new Community forums opened up in the last week, so if you want to share your ideas and start earning some badges, check them out here!
Keep your eyes open. We will be introducing additional missions in the future. If you have any questions or feedback, please open a ticket and I will get back to you!
Happy playing!
Once you have a program (a collection of interrelated projects focused on one business goal) and you have technical debt, you have a much bigger problem. Not just because the technical debt is likely bigger. Not just because you have more people. But because you also geographically distributed teams, and those teams are almost always separated by function and time zone.
So, my nice example of a collocated team in Thoughts on Infrastructure, Technical Debt, and Automated Test Framework, rarely occurs in a program, unless you have cross-functional teams collocated in a program. If they do, great. You know what to do.
But let’s assume you don’t have them. Let’s assume you have what I see in my consulting practice: an architecture group in one location, or an architect in one location and architects around the world; developers and “their” testers in multiple time zones; product owners separated from their developers and testers. The program is called agile because the program is working in iterations. And, because it’s a program, the software pre-existed the existence of the agile transition in the organization, so you have legacy technical debt up the wazoo (the technical term). What do you do?
Let’s walk through an example, and see how it might work. Here’s a story which is a composite from several clients; no clients were harmed in the telling of this story.
Let’s also assume you are working on release 5.0 of a custom email client. Release 4 was the previous release. Release 4 had trouble. It was late by 6 months and quite buggy. Someone sold agile as the way to make software bug-free and on-time.
You do not have automated tests for much of the code, unit tests or system tests. You have a list of defects that make Jack the Ripper’s list of killings look like child’s play. But agile is your silver bullet.
The program manager is based in London. The testers for the entire program are in Bangalore because management had previously fired all the testers and outsourced the testers. That was back in release 2. They have since hired all the Bangalore testers as employees of the Bangalore subsidiary. The program architect is based in San Francisco, and there is an architect team that is dispersed into 4 other teams: Denver, LA, Munich, and Paris. The developers are clustered in “Development Centers of Excellence:” Denver, LA, Cambridge, Paris, London, Munich, and Milan. That’s 8 development teams.
Oh, and if you think I’m kidding with this scenario, I’m not. This is what most of my clients with geographically distributed teams and programs face on a daily basis. They deserve your sympathy and empathy. Do not tell them, “Don’t go agile.” That’s nuts. They have a right to go agile. You can tell them, “Don’t go Scrum.” That’s reasonable. Scrum is for a cross-functional co-located team. Agile is for everyone. Scrum is for a specific subset.
What do you do?
So far, this is all about preventing more technical debt, not what happens when you trip over technical debt as you enter code or tests you never looked at before.
If you expected to walk into a closet, take out a shirt, and close the closet door, that’s one thing. But now, you stepped into something out of one of those death-by-hoarding shows on TV, you have an obligation to do something. You can document the problem as you encounter it; you can let the product owner know; file a defect report; write a test so you can contain the debt; and maybe you have more options. Whatever you do, make sure you have done something. Do not open the door, see the mess inside and close the door on the mess. It’s tempting. Oh my, it is tempting.
See, on programs because of the size, everything is magnified. With more people and more teams, everything is harder. Things happen faster. If you have co-located cross-functional teams, no problem. But if you don’t have co-located cross-functional teams, you have to work with what you have. And, if you already have a big legacy product, you want to address technical debt in small chunks, refactoring in small bits, integrating as you proceed.
My philosophy is this: the bigger the program, the more you need to become accustomed to working in small chunks, integrating as you go. Fully implement a small story, integrate it on the mainline. Everyone on the program does that. If you need help from an integration team, so be it.
But, if everyone only implements small stories, and everyone takes care of their own technical debt as they discover it, you don’t need an army of integration people. You only need an army of integration people when you have technical debt around integration and release. Fix that, and everyone can become responsible for their own integration.
And, if you can’t release, that’s where the architects should start. If you can’t do continuous integration, that’s where the architects should start. Because that’s what’s preventing you from making progress on the product. Work backwards from release, and then the architects can work on the rest of the product. Until you can release and build reliably, the rest of the product doesn’t matter.
Oh, how we love discussions about terminology…
Is my course about management or leadership?
Are the problems we discuss complex or complicated?
Do the teams need coaches or mentors?
Should they have responsibility or accountability?
Are we being agile or lean?
The discussions about terminology go on and on. Yes, I’ve been guilty about this too. I love a good fight about a few letters, every now and then. :-)
But does it matter?
After more than a decade we should know that we can make good progress in our work despite uncertainty, fuzzy boundaries, diverse opinions, and minimal definitions.
We can go on forever trying to find the “right” specifications for our most cherished words. But maybe we should stop splitting hairs over terminology and go and produce something.
Ah! But if we produce things... should we then aim for output or outcome?
I was reading Requirements–we don’t need them! by Andrew Woodward of 21apps recently, which talks about the importance of having a visionary product owner on your SharePoint project to ensure that it delivers the right thing. That all makes sense and it's inspired me to write this blog entry ... SharePoint projects also need a healthy dose of software architecture.
If you're currently thinking "well, yes, duh!", that's awesome. Unfortunately, I've seen a number of SharePoint implementations where the basic tenets of software architecture have been forgotten or neglected. Here's a summary of why software architecture is important for SharePoint projects.
1. Many SharePoint implementations aren't just SharePointMany of the SharePoint solutions I've seen are not just simple implementations of the SharePoint product where end-users can create lists, share documents and collaborate. As with most software systems, they're a mix of new and legacy technologies, usually with complex integration points into other parts of the enterprise via web services and other integration techniques. Often, bespoke .NET code is also a part of the overall solution, either running inside or outside of SharePoint. If you don't take the "big picture" into account by understanding the environment and its constraints, there's a chance that you'll end up building the wrong thing or building something that doesn't work.
2. Non-functional requirements still apply to SharePoint solutionsEven if you're *not* writing any bespoke code as a part of your SharePoint solution, that doesn't mean you can ignore non-functional requirements. Performance, scalability, security, availability, disaster recovery, audit, monitoring, etc are all potentially applicable. I've seen SharePoint projects where the team has neglected to think about key non-functional requirements, even on public Internet-facing websites. As expected, the result was solutions that exhibited poor response times and/or severe security flaws (e.g. cross-site scripting). Often these issues were identified at a late stage in the (usually waterfall) project lifecycle.
3. SharePoint projects are complex and require technical vision tooLike any programming language, SharePoint is a complex platform and there are usually a number of different ways to solve a single problem. In order to get some consistency of approach and avoid chaos, SharePoint projects need strong technical leadership and the software architecture role is as applicable here as it is if you're writing a software system from scratch. If you've ever seen SharePoint projects where a seemingly chaotic team has eventually delivered something of a poor quality, you'll appreciate why this is important.
4. SharePoint solutions still need to be documentedWith all of this complexity in place, I'm continually amazed to see SharePoint solutions that have no documentation. I'm not talking about hefty 200+ page documents here, but there should be at least *some* lightweight documentation that provides an overview of the solution. Some diagrams to show how the SharePoint solution works at a high-level are also useful, and something along these lines works well. Some *lightweight* documentation can be a great starting point for future support, maintenance and enhancement work; particularly if the project team changes or if the project is delivered under an outsourcing agreement.
Strong leadership and discipline aren't just for software development projectsIf you're delivering software solutions then you need to make sure that you have at least one person undertaking the technical leadership role, looking after the things that I've highlighted above. If not, you're doing it wrong. As an aside, all of this applies to Microsoft Dynamics CRM projects too, especially if you're "just tacking on an Internet-facing ASP.NET website to expose CRM data via the Internet".
I mentioned all of this to somebody a while back and they replied "but SharePoint isn't software development". Regardless of whether it is or isn't software development, successful SharePoint projects need strong technical leadership and discipline. SharePoint projects need software architecture.
Apache Hive is a data warehouse system built on top of Hadoop. Using SQL-like language you can query data stored in the Hadoop filesystem (HDFS). Those queries are then translated into Map Reduce jobs and executed on your cluster.
As an example we’ll analyze tweets from the Twitter Streaming logs and calculate the top 5 hashtags per day which are associated with positive sentiment signals (smileys).
You can imagine how this can be expand this to simple sentiment analysis on your (potential) customer feedback.
Gather the data from Twitter streaming API
The JSON log lines from the Twitter Streaming API look like these:
{
"created_at": "Sat Sep 10 22:23:38 +0000 2011",
"id_str": "112652479837110273",
"text": "@twitter meets @seepicturely at #tcdisrupt cc.@boscomonkey @episod http://t.co/6J2EgYM",
...
"user": {
"name": "Eoin McMillan ",
"screen_name": "imeoin",
...
}
...
}
For now we only care about the “created_at” and “text” attributes. See detailed information on all available attributes at https://dev.twitter.com/docs/platform-objects/tweets
Import raw data into Hadoop & Hive
Now from the Hive console import the logs into a Hive table.
create table raw_tweets (json string); load data local inpath 'sample.json' into table raw_tweets;
With Hadoop it is a best practice to always preserve the raw source data. It is pretty common that you detect obscure parse errors or missing information days later, keeping the source information allows you to correct this without much hassle.
Parse JSON tweets and extract relevant information
From this raw data, we parse and extract the actual information that we care about. Using the get_json_object function we can access the “text” and “created_at” attributes using an XPath like query on the JSON object. The timestamp needs to be converted into Unix epoch format for later formatting.
create table tweets as
select get_json_object(json, "$.text") as text,
unix_timestamp(get_json_object(json, "$.created_at"),
"EEE MMM d HH:mm:ss Z yyyy") as ts_created
from raw_tweets;
Text parsing, extract hashtags and sentiment identification
Now that we got the source data in a format we can deal with (text, timestamp), it is time to identify sentiment information. For this case, we’ll compose a regular expression that matches some positive smileys:
and
. Feel free to expand this to your taste. Also we split the text on whitespaces and identify terms which look like a #hashtag. These matching hashtags we emit together with the date in YYYY-MM-DD format allowing us to do daily aggregations afterwards.
create table positive_hashtags_per_day as
select from_unixtime(ts_created, 'yyyy-MM-dd') as dt,
lower(hashtag) as hashtag from tweets
lateral view explode(split(text, ' ')) b as hashtag
where ts_created is not null
and hashtag rlike "^#[a-zA-Z0-9]+$"
and text rlike "^.*[\;:]-?\\).*$";
Aggregate occurrences per day
Aggregation is straight forward, just a matter of counting the occurrences per day. Now if you want to do multiple aggregations (per week, per month, etc), you might want to move the date string creation to this step.
create table count_positive_hashtags_per_day as
select dt, hashtag, count(*) as cnt from positive_hashtags_per_day
group by dt, hashtag;
Limit top 5 results per day using external reducer
Finally we only want the top 5 results per day. Now this is a bit tricky in Hive, as this requires a user defined function or streaming reducer call. We do the latter using a little piece of Python code that only returns the first 5 results per keyword. Because the input to the reducer call is already secondary sorted on the count in descending order, this will return the top 5 results; just how we want it.
add file topN.py;
create table top5_positive_hashtags_per_day as
reduce dt, hashtag, cnt
using 'topN.py 5' as dt, hashtag, cnt
from
(select dt, hashtag, cnt from count_positive_hashtags_per_day
distribute by dt sort by dt, cnt desc) cnts;
The Python reduce code looks like this. Due to the way the Hadoop Streaming API works, you need to detect key boundaries yourself.
#!/usr/bin/env python
# Reducer that returns the top N results per keyword
import sys
maxN = int(sys.argv[1])
last_key = None
count = 0
for line in sys.stdin:
(key, value) = line.strip().split("\t", 1)
if key != last_key:
count = 0
last_key = key;
if count < maxN:
print "%s\t%s" % (key, value)
count += 1
Results
The results of this exercise on my tiny sample set (can’t redistribute the source according to the Twitter TOS):
hive> select * from top5_positive_hashtags_per_day;
OK
2011-02-25 #ff 8
2011-02-25 #happy 3
2011-02-25 #followfriday 2
2011-02-25 #teamzeeti 2
2011-02-25 #baensv 1
2011-02-26 #db40birthday 5
2011-02-26 #bieberfact 2
2011-02-26 #teamfollowback 2
2011-02-26 #feelingsoo 1
2011-02-26 #aktf 1
2011-02-27 #1 2
2011-02-27 #12 1
2011-02-27 #27f 1
2011-02-27 #dvdmeb 1
2011-02-27 #fail 1
Time taken: 4.231 seconds
Follow Fridays make people happy… who would have thought
The whole "everyone should learn programming" meme has gotten so out of control that the mayor of New York City actually vowed to learn to code in 2012.
A noble gesture to garner the NYC tech community vote, for sure, but if the mayor of New York City actually needs to sling JavaScript code to do his job, something is deeply, horribly, terribly wrong with politics in the state of New York. Even if Mr. Bloomberg did "learn to code", with apologies to Adam Vandenberg, I expect we'd end up with this:
10 PRINT "I AM MAYOR" 20 GOTO 10
Fortunately, the odds of this technological flight of fancy happening – even in jest – are zero, and for good reason: the mayor of New York City will hopefully spend his time doing the job taxpayers paid him to do instead. According to the Office of the Mayor home page, that means working on absenteeism programs for schools, public transit improvements, the 2013 city budget, and … do I really need to go on?
To those who argue programming is an essential skill we should be teaching our children, right up there with reading, writing, and arithmetic: can you explain to me how Michael Bloomberg would be better at his day to day job of leading the largest city in the USA if he woke up one morning as a crack Java coder? It is obvious to me how being a skilled reader, a skilled writer, and at least high school level math are fundamental to performing the job of a politician. Or at any job, for that matter. But understanding variables and functions, pointers and recursion? I can't see it.
Look, I love programming. I also believe programming is important … in the right context, for some people. But so are a lot of skills. I would no more urge everyone to learn programming than I would urge everyone to learn plumbing. That'd be ridiculous, right?
The "everyone should learn to code" movement isn't just wrong because it falsely equates coding with essential life skills like reading, writing, and math. I wish. It is wrong in so many other ways.
I suppose I can support learning a tiny bit about programming just so you can recognize what code is, and when code might be an appropriate way to approach a problem you have. But I can also recognize plumbing problems when I see them without any particular training in the area. The general populace (and its political leadership) could probably benefit most of all from a basic understanding of how computers, and the Internet, work. Being able to get around on the Internet is becoming a basic life skill, and we should be worried about fixing that first and most of all, before we start jumping all the way into code.
Please don't advocate learning to code just for the sake of learning how to code. Or worse, because of the fat paychecks. Instead, I humbly suggest that we spend our time learning how to …
These are skills that extend far beyond mere coding and will help you in every aspect of your life.
[advertisement] How are you showing off your awesome? Create a Stack Overflow Careers profile and show off all of your hard work from Stack Overflow, Github, and virtually every other coding site. Who knows, you might even get recruited for a great new position!I started blogging about 5 years ago, and over the years I've published 653 posts. This will be the last one. I had some specific personal goals in mind when I started blogging, and I've gotten everything I've wanted out of it, and more. I learned a lot, and I'm happy with a lot of the feedback I've gotten over the years. But the time has come to move on. I want to get back to writing more code instead of writing about writing code. I've mostly enjoyed writing blog posts, but in the past year it has felt more like a chore than a hobby, so it's probably a good idea to just call it quits.
I'm not entirely sure yet what I'm going to do with the content on this blog. There's quite a few posts I surely want to keep around, but certainly not all of them. For now, I'm going to keep the blog up so everything stays available but after a while, I'm gonna shut it down. I might keep up some kind of static archive of my favorite posts, or I might just put them on GitHub in MarkDown format. If you have any suggestions on what I should do with it, I'd be happy to hear them.
I will remain active on Twitter and I plan to be more active on GitHub from now on. But the blogging thing ends here and now. I'd like to thank everyone for reading, especially the ones who've been around since the beginning. It's been an interesting ride for me, but it's time for something else ![]()