Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Feed aggregator

Python: Detecting the speaker in HIMYM using Parts of Speech (POS) tagging

Mark Needham - Sun, 03/01/2015 - 03:36

Over the last couple of weeks I’ve been experimenting with different classifiers to detect speakers in HIMYM transcripts and in all my attempts so far the only features I’ve used have been words.

This led to classifiers that were overfitted to the training data so I wanted to generalise them by introducing parts of speech of the words in sentences which are more generic.

First I changed the function which generates the features for each word to also contain the parts of speech of the previous and next words as well as the word itself:

def pos_features(sentence, sentence_pos, i):
    features = {}
 
    features["word"] = sentence[i]
    features["word-pos"] = sentence_pos[i][1]
 
    if i == 0:
        features["prev-word"] = "<START>"
        features["prev-word-pos"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        features["prev-word-pos"] = sentence_pos[i-1][1]
 
    if i == len(sentence) - 1:
        features["next-word"] = "<END>"
        features["next-word-pos"] = "<END>"
    else:
        features["next-word"] = sentence[i+1]
        features["next-word-pos"] = sentence_pos[i+1][1]
 
    return features

Next we need to tweak our calling code to calculate the parts of speech tags for each sentence and pass it in:

featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    sentence_pos = nltk.pos_tag(untagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append((pos_features(untagged_sent, sentence_pos, i), tag) )

I’m using nltk to do this and although it’s slower than some alternatives, the data set is small enough that it’s not an issue.

Now it’s time to train a Decision Tree with the new features. I created three variants – one with both words and POS; one with only words; one with only POS.

I took a deep copy of the training/test data sets and then removed the appropriate keys:

def get_rid_of(entry, *keys):
    for key in keys:
        del entry[key]
 
import copy
 
# Word based classifier
tmp_train_data = copy.deepcopy(train_data)
for entry, tag in tmp_train_data:
    get_rid_of(entry, 'prev-word-pos', 'word-pos', 'next-word-pos')
 
tmp_test_data = copy.deepcopy(test_data)
for entry, tag in tmp_test_data:
    get_rid_of(entry, 'prev-word-pos', 'word-pos', 'next-word-pos')
 
c = nltk.DecisionTreeClassifier.train(tmp_train_data)
c.classify(tmp_test_data)
 
# POS based classifier
tmp_train_data = copy.deepcopy(train_data)
for entry, tag in tmp_train_data:
    get_rid_of(entry, 'prev-word', 'word', 'next-word')
 
tmp_test_data = copy.deepcopy(test_data)
for entry, tag in tmp_test_data:
    get_rid_of(entry, 'prev-word', 'word', 'next-word')
 
c = nltk.DecisionTreeClassifier.train(tmp_train_data)
c.classify(tmp_test_data)

The full code is on my github but these were the results I saw:

$ time python scripts/detect_speaker.py
Classifier              speaker precision    speaker recall    non-speaker precision    non-speaker recall
--------------------  -------------------  ----------------  -----------------------  --------------------
Decision Tree All In             0.911765          0.939394                 0.997602              0.996407
Decision Tree Words              0.911765          0.939394                 0.997602              0.996407
Decision Tree POS                0.90099           0.919192                 0.996804              0.996008

There’s still not much in it – the POS one has slightly more false positives and false positives when classifying speakers but on other runs it performed better.

If we take a look at the decision tree that’s been built for the POS one we can see that it’s all about POS now as you’d expect:

>>> print(c.pseudocode(depth=2))
if next-word-pos == '$': return False
if next-word-pos == "''": return False
if next-word-pos == ',': return False
if next-word-pos == '-NONE-': return False
if next-word-pos == '.': return False
if next-word-pos == ':':
  if prev-word-pos == ',': return False
  if prev-word-pos == '.': return False
  if prev-word-pos == ':': return False
  if prev-word-pos == '<START>': return True
  if prev-word-pos == 'CC': return False
  if prev-word-pos == 'CD': return False
  if prev-word-pos == 'DT': return False
  if prev-word-pos == 'IN': return False
  if prev-word-pos == 'JJ': return False
  if prev-word-pos == 'JJS': return False
  if prev-word-pos == 'MD': return False
  if prev-word-pos == 'NN': return False
  if prev-word-pos == 'NNP': return False
  if prev-word-pos == 'NNS': return False
  if prev-word-pos == 'POS': return False
  if prev-word-pos == 'PRP': return False
  if prev-word-pos == 'PRP$': return False
  if prev-word-pos == 'RB': return False
  if prev-word-pos == 'RP': return False
  if prev-word-pos == 'TO': return False
  if prev-word-pos == 'VB': return False
  if prev-word-pos == 'VBD': return False
  if prev-word-pos == 'VBG': return False
  if prev-word-pos == 'VBN': return True
  if prev-word-pos == 'VBP': return False
  if prev-word-pos == 'VBZ': return False
if next-word-pos == '<END>': return False
if next-word-pos == 'CC': return False
if next-word-pos == 'CD':
  if word-pos == '$': return False
  if word-pos == ',': return False
  if word-pos == ':': return True
  if word-pos == 'CD': return True
  if word-pos == 'DT': return False
  if word-pos == 'IN': return False
  if word-pos == 'JJ': return False
  if word-pos == 'JJR': return False
  if word-pos == 'JJS': return False
  if word-pos == 'NN': return False
  if word-pos == 'NNP': return False
  if word-pos == 'PRP$': return False
  if word-pos == 'RB': return False
  if word-pos == 'VB': return False
  if word-pos == 'VBD': return False
  if word-pos == 'VBG': return False
  if word-pos == 'VBN': return False
  if word-pos == 'VBP': return False
  if word-pos == 'VBZ': return False
  if word-pos == 'WDT': return False
  if word-pos == '``': return False
if next-word-pos == 'DT': return False
if next-word-pos == 'EX': return False
if next-word-pos == 'IN': return False
if next-word-pos == 'JJ': return False
if next-word-pos == 'JJR': return False
if next-word-pos == 'JJS': return False
if next-word-pos == 'MD': return False
if next-word-pos == 'NN': return False
if next-word-pos == 'NNP': return False
if next-word-pos == 'NNPS': return False
if next-word-pos == 'NNS': return False
if next-word-pos == 'PDT': return False
if next-word-pos == 'POS': return False
if next-word-pos == 'PRP': return False
if next-word-pos == 'PRP$': return False
if next-word-pos == 'RB': return False
if next-word-pos == 'RBR': return False
if next-word-pos == 'RBS': return False
if next-word-pos == 'RP': return False
if next-word-pos == 'TO': return False
if next-word-pos == 'UH': return False
if next-word-pos == 'VB': return False
if next-word-pos == 'VBD': return False
if next-word-pos == 'VBG': return False
if next-word-pos == 'VBN': return False
if next-word-pos == 'VBP': return False
if next-word-pos == 'VBZ': return False
if next-word-pos == 'WDT': return False
if next-word-pos == 'WP': return False
if next-word-pos == 'WRB': return False
if next-word-pos == '``': return False

I like that it’s identified the ‘:‘ pattern:

if next-word-pos == ':':
  ...
  if prev-word-pos == '<START>': return True

Next I need to drill into the types of sentence structures that it’s failing on and work out some features that can handle those. I still need to see how well a random forest of decision trees would as well.

Categories: Programming

Re-Read Saturday: The Goal: A Process of Ongoing Improvement. Part 2

IMG_1249

Chapters 1 through 3 actively present the reader with a burning platform. The plant and division are failing. Alex Rogo has actively pursued increased efficiency and automation to generate cost reductions, however performance is falling even further behind and fear has become central feature in the corporate culture. If the book stopped here it would be brief tragedy, however Chapter 4 begins the path towards the redemption of Alex Rogo and the ideas that are the bedrock of lean.

New Characters

  • Jonah – advisor
  • Lou – plant’s chief accountant

Chapter 4:

In Chapter 3, Alex was at a company meeting that communicated the depths of the problems the division was having and to search for answers. The meeting was not holding Alex’s attention. He found a cigar in his jacket and flashed back to a chance meeting in an airport lounge. While smoking a cigar, Alex recognizes and strikes up a conversation with a professor from grad school. The discussion turns to the problems at the plant, and even though they have pursued changes that have yielded great efficiencies the problems still exist and perhaps are getting worse. Alex proudly shows Jonah a chart that “proves” that a 36% improvement in efficiency from using robots and automation. Jonah asks one very simple question: was profit up 36% too? Alex struggles and answers that, “it is not that simple.” In fact the number of people in the plant did not go down, inventory did not go down and not one additional widget had been shipped. This interaction foreshadows one of the key ideas that The Goal presents to the reader. We are measuring the wrong thing! When we measure the wrong thing we send the wrong message and we get the wrong results. Chapter 4 closes with Johan and Rogo talking about the real meaning of productivity. Productivity is defined as accomplishing something in terms of a goal. Without knowing the goal, measuring productivity and efficiency is meaningless.

Chapter 5:

In Chapter 5 we snap back to the all-day meeting to discuss the division’s performance (Chapter 3). Alex continues to ruminate on Jonah’s comments. Alex leaves the meeting under the pretext of a problem back at the plant. As he drives back to the plant he begins to reflect on the “the goal” in the definition of productivity identified in Chapter 4.  As Alex drives back he decides that he will not have time to think due to the day-to-day demands (also known as the tyranny of the urgent but not important – See Habit 3 of Stephen Covey) therefore heads to a favorite pizzeria for pizza and beer. Goldratt and Cox use Alex’s inner dialog show why most of current internal goals and measures Alex is being asked to pursue miss the point. The bottom-line goal is that the plant’s goal is to make money. If it does not make money, the rest does not matter. While The Goal is set in a manufacturing plant, the point is that unless any group or department does not materially impact the real goal of an organization it should not exist.

Chapter 6:

Chapter 6 begins with a search for the overall measures that contribute (or predict) whether the plant is meeting the goal of profitability. One the first questions Alex poses to himself is whether he can assume that making people work and making money are the same thing. This sounds like a funny question, however I often see managers and leaders mistake being busy with delivering value. Alex and Lou brainstorm a set of 3 metrics that impact the goal. They are: 1. net profit, 2. ROI, 3. cash flow. In this conversation Alex tells Lou the truth about the state of the division and the potential closure of the plant. The 3 metrics sound right, however Alex does not see the immediate connection between the measures and day-to-day operations. The chapter ends with Alex asking the 3rd Shift Supervisor how is activities impact net profit, ROI and cash flow. He simply gets the deer-in-the-headlight look.

Chapters 4 – 6 shift the focus from steps in the process to the process as a whole. Organizations have an ultimate goal. In this case the ultimate goal of the plant is make money. The goal is not quality, efficiency or even employment because in the long-run if the plant doesn’t deliver product that can be sold it won’t exist. Whether an organization is for-profit or non-profit, if they don’t attain their ultimate goal they won’t exist.


Categories: Process Management

Stuff The Internet Says On Scalability For February 27th, 2015

Hey, it's HighScalability time:


Hear ye puny mortal. 1.3 million Earths doth fill our Sun. Whence comes this monster black hole with a mass 12 billion times that of the Sun?

 

  • 1 Terabit of Data per Second: 5G; 1.9 Terabytes:  customer in stadium data usage during the Super Bowl; 1 TB: free each month on Big Query; 100x: reduced power consumption in radio chip

  • Quotable Quotes:
    • Robin Harris: But now that non-volatile memory technology - flash today, plus RRAM tomorrow - has been widely accepted, it is time to build systems that use flash directly instead of through our antique storage stacks. 
    • Sundar Pichai: That’s the essence of what we [Google] get excited about – working on problems for people at scale, which make a big difference in [people’s] lives. 
    • @timoreilly: Facebook is hacked 600,000 times a day. @futurecrimes First thing to do to protect yourself, turn on 2-factor authentication
    • @architectclippy: I see you have a poorly structured monolith. Would you like me to convert it into a poorly structured set of microservices?
    • Poppy Crum: Your brain wants as much as possible to come up with a robust actionable perception of the world and of the information and data that is coming in.
    • @BenedictEvans: Both Google and Facebook killing XMPP.  IM being euthanized just at the time messaging could become a third run-time for the internet
    • @dhh: 4-tier / micro-service architectures are organizational scaling patterns far more than they're tech. 1st rule of distributed systems: Don't.
    • @amcafee: Ex-Etsy seller: "In practical terms, scaling the handmade economy is an impossibility."
    • kurin: If you're behind a LB you can just drain the traffic to the hosts you're about to upgrade. Also, if you're above your SLA... I mean, some dropped queries aren't the end of the world.
    • @WSJ: Facebook’s 5,000+ staff generate $1.36 million each in annual revenue. The key to productivity is custom-built software tools
    • @jaykreps: Software is mostly human capital (in people's heads): losing the team is usually worse than losing the code.
    • Dylan Tweney: Mobile growth is huge, and could surge at least 3x in the next two years
    • Joe Davison: I learned that there is often more to business than meets the eye, and the only way to succeed is to plan ahead and anticipate all contingencies.
    • @etherealmind: Google published 30000 configuration changes to its network in 1 month 

  • What's different about AI this time around? Less hype, more data, more computation. The Believers: It was a stunning result. These neural nets were little different from what existed in the 1980s. This was simple supervised learning. It didn’t even require Hinton’s 2006 breakthrough. It just turned out that no other algorithm scaled up like these nets. "Retrospectively, it was a just a question of the amount of data and the amount of computations," Hinton says.

  • What lesson did Ozgun Erdogan learn while working on a database at Amazon that never saw the light of day? How to Build Your Distributed Database (1/2): This optimized plan has many computations pushed down in the query tree, and only collects a small amount of data. This enables scalability. Much more importantly, this logical plan formalizes how relational algebra operators scale in distributed systems, and why. That's one key takeaway I had from building a distributed database before. In the land of distributed systems, commutativity is king. Model your queries with respect to the king, and they will scale.

  • Replication for resiliency? Nature thought of that. Nibbled? No Problem: Champaign first observed in the 1980s, some plants respond by making more seeds, ultimately benefiting from injury in a phenomenon called overcompensation. More recently, Paige and postdoc Daniel Scholes suspected a role for endoreduplication, in which a cell makes extra copies of its genome without dividing, multiplying its number of chromosome sets, or “ploidy.”

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so please keep on reading)...

Categories: Architecture

R/ggplot: Controlling X axis order

Mark Needham - Fri, 02/27/2015 - 01:49

As part of a talk I gave at the Neo4j London meetup earlier this week I wanted to show how you could build a simple chart showing the number of friends that different actors had using the ggplot library.

I started out with the following code:

df = read.csv("/tmp/friends.csv")
top = df %>% head(20)
 
ggplot(aes(x = p.name, y = colleagues), data = top) + 
  geom_bar(fill = "dark blue", stat = "identity")

The friends CSV file is available as a gist if you want to reproduce the chart. This is how the chart renders:

2015 02 27 00 41 08

It’s in a fairly arbitrary order when it would be quite cool if we could get the most popular people to show from left to right.

I had the people’s names in the correct order in the data frame but annoyingly it was then sorting them into alphabetical order. Luckily I came across the by using the scale_x_discrete function which does exactly what I needed.

If we pass in the list of names to that function we get the chart we desire:

ggplot(aes(x = p.name, y = colleagues), data = top) + 
  geom_bar(fill = "dark blue", stat = "identity") + 
  scale_x_discrete(limits= top$p.name)

2015 02 27 00 45 01

Categories: Programming

Internal Applications Are Products Too!

Can you see the man in the moon?

Without a roadmap and a value focus it is easier to perceive that the current “project” might the last one in a while, therefore you need to ask for the moon.

It is often more difficult to take a product focus for applications that will be used internally than for an application that will be used by or sold to an external customer. Part of the issue seems to the distance an application from the ultimate end of the value chain and therefore revenue. The further away from revenue, the harder it is to view the user of the software as a customer. Therefore providing support for tools that enable or support non-customer facing work is often viewed as less critical than customer facing applications or tools. The difficulty in considering internal software as a product is less an artifact of any real difference between internal and external facing applications than perspective. Differences in perspective are typically built on minor differences in organization and market attributes. These differences include:

Ability to switch – Internal “customers” often are hostages to the services provided by internal IT organizations, at least in the short run. While that sounds strong, internal customers often do not have the option to shift providers if they don’t like the service or quality they receive. In the long run, switching can and often does occur either through outsourcing, formation of shadow IT groups in the business or changes in IT leadership. Less flexibility in the short run can often lead to a lack of discipline when it comes to defining product roadmaps or defining the true value any specific feature or function might deliver. Without a roadmap, a form of fatalism can set in, in which users always ask for more than they need at the moment but usually accept what they are offered (after a lot of noisy conversation).

Internal politics – The value of work that is sold or used by external customers is usually easier to measure. Functionality either solves a need and generates revenue or increases customer satisfaction. Developing a value for work to be consumed internally is rarely that cut and dry. Priorities are often defined by considerations that don’t reflect the true quantitative value of the work. Priorities often reflect the requestor’s (or requestor’s group) positional power. In my first job, the head of accounting requests always floated to the top of the list even though we were a garment manufacturer with a sales focus. Prioritization by factors that don’t relate to value makes it difficult to develop roadmaps or plan release for applications that don’t have the same level of political clout. Remember when you hear the saying, “the squeaky wheel gets the grease,” it often means that the organization has a project rather than a product focus.

Talking with Customers –  Another of the differences between internal applications and external products that impacts whether an application is viewed as a product is who needs to have input into direction. Products require discussion not only with internal stakeholders, but also with external customers. Internal applications supported by individual projects only require discussion with internal stakeholders. The lack of a perceived impact outside of the company’s boundaries makes it difficult to generate the motivation to get involvement across the IT/business boundary. For example, it is often harder to identify and get product owner involvement to support planning and work to be used internally. Agile techniques are often a tool to remove the barriers between IT and internal business groups. However it is easier to generate the involvement needed facilitate developing plans, road maps and communication when revenue is involved, which tends to yield a project perspective (short term) rather than a product perspective.

Perceived differences between work done for internal and external use tend to drive internal customers into a more transactional mode. Without a roadmap and a value focus it is easier to perceive that the current “project” might the last one in a while, therefore you need to ask for the moon.


Categories: Process Management

Quote of the Day

Herding Cats - Glen Alleman - Thu, 02/26/2015 - 20:08

Your breakdown is not my emergency

Categories: Project Management

Introducing gRPC, a new open source HTTP/2 RPC Framework

Google Code Blog - Thu, 02/26/2015 - 19:30

Today, we are open sourcing gRPC, a brand new framework for handling remote procedure calls. It’s BSD licensed, based on the recently finalized HTTP/2 standard, and enables easy creation of highly performant, scalable APIs and microservices in many popular programming languages and platforms. Internally at Google, we are starting to use gRPC to expose most of our public services through gRPC endpoints as part of our long term commitment to HTTP/2.

Over the years, Google has developed underlying systems and technologies to support the largest ecosystem of micro-services in the world; our servers make tens of billions of calls per second within our global datacenters. At this scale, nanoseconds matter. Efficiency, scalability and reliability are at the core of building Google’s APIs.

gRPC is based on many years of experience in building distributed systems. With the new framework, we want to bring to the developer community a modern, bandwidth and CPU efficient, low latency way to create massively distributed systems that span data centers, as well as power mobile apps, real-time communications, IoT devices and APIs.

Building on HTTP/2 standards brings many capabilities such as bidirectional streaming, flow control, header compression, multiplexing requests over a single TCP connection and more. These features save battery life and data usage on mobile while speeding up services and web applications running in the cloud.

Developers can write more responsive real-time applications, which scale more easily and make the web more efficient. Read more about the features and benefits in the FAQ.

Alongside gRPC, we are releasing a new version of Protocol Buffers, a high performance, open source binary serialization protocol that allows easy definition of services and automatic generation of client libraries. Proto 3 adds new features, is easier to use compared to previous versions, adds support for more languages and provides canonical mapping of Proto to JSON.

The project has support for C, C++, Java, Go, Node.js, Python, and Ruby. Libraries for Objective-C, PHP and C# are in development. To start contributing, please fork the Github repositories and start submitting pull requests. Also, be sure to check out the documentation, join us on the mailing list, visit the IRC #grpc channel on Freenode and tag StackOverflow questions with the “grpc” tag.

Google has been working closely with Square and other organizations on the gRPC project. We’re all excited for the potential of this technology to improve the web and look forward to further developing the project in the open with the help, direction and contributions of the community.


Post by Mugur Marculescu, Product Manager

Categories: Programming

Visualizing data in the RAW

Software Requirements Blog - Seilevel.com - Thu, 02/26/2015 - 18:46
During the search for a better, easier ways to create images to represent software concepts, I’ve come across a few tools that do a good job creating basic, official-looking graphs. Many of these tools offer teaser trials in hopes of inducing you to buy a full featured version. Not so with the open web app […]
Categories: Requirements

Do You Have What it Takes to Become a Microsoft MVP?

Making the Complex Simple - John Sonmez - Thu, 02/26/2015 - 17:00

In this video I respond to an email asking for advice on how to become a Microsoft MVP. I admit that I do not have much background on this topic but I offer the strategy that I would use in order to accomplish this.

The post Do You Have What it Takes to Become a Microsoft MVP? appeared first on Simple Programmer.

Categories: Programming

A New Way to Promote Your App on Google Play

Android Developers Blog - Thu, 02/26/2015 - 14:05

Posted by Michael Siliski, Product Management Director, Google Play

Google Play now reaches more than 1 billion people on Android devices in more than 190 countries, helping a growing number of developers like you build successful global businesses. In fact, in the past year, we paid more than $7 billion to developers distributing apps and games on Google Play. We remain as committed as ever to making Google Play the best place to find great apps, games and other entertainment.

App discovery plays a critical role in driving your continued success, and over the past year Google has provided best practices to enhance app discovery and engagement, as well as app promotion tools to get the most out of search and display advertising for developers. We are always looking for new ways to help you get your apps in front of potential new users. That’s why, in the next few weeks, we will begin piloting sponsored search results on Google Play, bringing our unique expertise in search ads to the store.

With more than 100 billion searches every month on Google.com, we’ve seen how search ads shown next to organic search results on Google.com can significantly improve content discovery for users and advertisers, both large and small. Search ads on Google Play will enable developers to drive more awareness of their apps and provide consumers new ways to discover apps that they otherwise might have missed.

In the coming weeks, a limited set of users will begin to see ads from a pilot group of advertisers who are already running Google search ads for their apps. We’ll have more to share in the coming months about the expansion of this program as we look at the results and feedback. We believe search ads will be a useful addition to Google Play for users and developers alike, and we hope this will bring even more success to our developer community.

Categories: Programming

R: Conditionally updating rows of a data frame

Mark Needham - Thu, 02/26/2015 - 01:45

In a blog post I wrote a couple of days ago about cohort analysis I had to assign a monthNumber to each row in a data frame and started out with the following code:

library(zoo)
library(dplyr)
 
monthNumber = function(cohort, date) {
  cohortAsDate = as.yearmon(cohort)
  dateAsDate = as.yearmon(date)
 
  if(cohortAsDate > dateAsDate) {
    "NA"
  } else {
    paste(round((dateAsDate - cohortAsDate) * 12), sep="")
  }
}
 
cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber)

If we time this function using system.time we’ll see that it’s not very snappy:

system.time(cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber))
 
   user  system elapsed 
  1.968   0.019   2.016

The reason for the poor performance is that we process each row of the data table individually due to the call to group_by on the second line. One way we can refactor the code is to use the ifelse which can process multiple rows at a time:

system.time(
cohortAttendance %>% 
  mutate(monthNumber = ifelse(as.yearmon(cohort) > as.yearmon(date), 
                              paste((round(as.yearmon(date) - as.yearmon(cohort))*12), sep=""), 
                              NA)))
   user  system elapsed 
  0.026   0.000   0.026

Antonios suggested another approach which involves first setting every row to ‘NA’ and then selectively updating the appropriate rows. I ended up with the following code:

cohortAttendance$monthNumber = NA
 
cohortAttendance$monthNumber[as.yearmon(cohortAttendance$cohort) > as.yearmon(cohortAttendance$date)] = paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep="")

Let’s measure that:

system.time(paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep=""))
   user  system elapsed 
  0.013   0.000   0.013

Both approaches are much quicker than my original version although this one seems to be marginally quicker than the ifelse approach.

Note to future Mark: try to avoid grouping by row number – there’s usually a better and faster solution!

Categories: Programming

Sending Windows logs to Papertrail with nxlog

Agile Testing - Grig Gheorghiu - Thu, 02/26/2015 - 01:04
I am revisiting Papertrail as a log aggregation tool. It's really easy to send Linux logs to Papertrail via syslog or rsyslog or syslog-ng (see this article on how to configure syslog with TLS) but to send Windows logs you need to jump through some hoops.

Papertrail recommends nxlog as their Windows log management tool of choice, so that's what I used. This Papertrail article explains how to install and configure nxlog on Windows (I recommend enabling TLS).  The nxlog.conf template file provided by Papertrail will send Windows Event logs over. I also wanted to send application-specific logs, so here's what I did:

1) Add an Input section to nxlog.conf for each directory containing the files you want to send to Papertrail. For example, if one of your applications logs to C:\MyApp1\logs and your log files end with .log, you could have this input section:

# Monitor MyApp1 log files 
START_ANGLE_BRACKET Input MyApp1 END_ANGLE_BRACKET
 Module im_file
 File 'C:\\MyApp1\\logs\\*.log' 
 Exec $Message = $raw_event; 
 Exec if $Message =~ /GET \/ping/ drop(); 
 Exec if file_name() =~ /.*\\(.*)/ $SourceName = $1; 
 SavePos TRUE 
 Recursive TRUE 
START_ANGLE_BRACKET /Input END_ANGLE_BRACKET

Some observations:

  • Blogger doesn't like angle brackets so replace START_ANGLE_BRACKET with < and END_ANGLE_BRACKET with >
  • The name MyApp1 is the name of this Input section
  • The File statement points to the location and name of the log files
  • The first Exec statement saves the log line under consideration as the variable $Message
  • The second Exec statement drops messages that contain a specific regular expression, in my case just 'GET /ping' -- which happens to be health checks from the load balancer that pollute the logs; you can replace this with any regular expression that will filter out log lines you don't want sent to Papertrail
  • The next few statements were in the sample Input stanza from the template nxlog.conf file so I just left them there
2) Add more Input sections, one for each log location (i.e. multiple log files under a given directory) that you want to send to Papertrail. You need to give each Input section a unique name (e.g. MyApp1 above).
3) Add a Route section for the Input sections defined previously. If you defined 2 Input sections MyApp1 and MyApp2, your Route section would look something like:

START_ANGLE_BRACKET  Route 2 END_ANGLE_BRACKET
Path MyApp1, MyApp2=> filewatcher_transformer => syslogoutSTART_ANGLE_BRACKET /Route END_ANGLE_BRACKET
The filewatcher_transformer section was already included in the sample nxlog.conf file from Papertrail. The Route section above says that the files processed by the 2 Input paths MyApp1 and MyApp2 will be processed through the statements defined in the filewatcher_transformer section, then will be sent to Papertrail by virtue of being processed through the statements defined in the syslogout section.
At this point, if you restart the nxlog service on your Windows box, you should start seeing log entries from your application(s) flowing into the Papertrail console.

Bringing apps to the workplace with Google Play for Work

Android Developers Blog - Thu, 02/26/2015 - 00:17

Posted by Matt Goodridge, Google Play team

Work doesn’t just happen in an office from 9 to 5 anymore. Today’s workers are mobile workers, and they need to be able to get things done as efficiently and collaboratively as possible, at any time. That’s why the Android for Work initiative is bringing together partners across the ecosystem, from device and app makers to networking and management solutions, to provide businesses with a secure, flexible and reliable mobility platform that users already know and love.

Google Play for Work allows businesses to securely deploy and manage enterprise-grade apps, across all of their users running Android for Work. Google Play for Work simplifies the process of distributing apps to employees and ensures that IT approves every deployed app. For developers, this is an opportunity to reach a new audience at scale through bulk installs or purchasing, which enables easy installation of your app across enterprises.

How to join Google Play for Work

Free apps will be available on Google Play for Work at launch with no action needed on your part. If you have a paid app, you’ll soon be able to opt-in to make your app available for bulk purchase on Google Play for Work in the Developer Console during the app publishing process. Find out more about publishing in the Google Play Developer Help Center.

Designing great apps for Android for Work

Apps that are installed from Google Play for Work will function without code changes. However, please note that some of the controls that Android for Work offers IT admins could affect how your app works. To ensure the best possible experience for your users, watch the first in our series of Android for Work DevBytes below to understand the best practices you should be following in developing your app.

More DevBytes will be posted to our YouTube channel soon. Find out more about Android for Work.

Join the discussion on

+Android Developers
Categories: Programming

Deep Learning without Deep Pockets

Now that you’ve transformed your system through successive evolutions of architecture goodness...you've made it cloud native, you now treat a fist full of datacenters as a single computer, you’ve microservicized it, you’ve containerized it, you’re continuously releasing and improving it, you’ve made it reactive, you’ve socialized it, you’ve mobilized it, you’ve Hadoop’ed it, you’ve made it DevOps friendly, and you have real-time dashboards that would make NORAD jealous...what’s next?

Deep learning is what’s next. Making machines that learn. The problem is how?

All the other transformations have been changes good programmers can learn to do. Deep learning is still deep magic. We are waiting for the Hadoop of deep learning to be built.

Until then, if you aren’t Google with Google sized clusters and cloisters of PhDs, what can you do? Greg Corrado, Senior Research Scientist at Google, gave a great presentation at the RE.WORK Deep Learning Summit 2015 (videos) that has some useful suggestions:

Categories: Architecture

The Blame Game

NOOP.NL - Jurgen Appelo - Wed, 02/25/2015 - 12:59
The Blame Game

“You did not receive your order? Well, I did all I could. But they aren’t doing their job properly.”

“They still haven’t paid your invoice? Strange. They told me last time it would be done.”

“The feature is still not working? I don’t understand. I did submit the service request for you.”

The post The Blame Game appeared first on NOOP.NL.

Categories: Project Management

Product and Project Perspectives: You Can’t Live Without Both

You need both the big and the small picture to deliver value.

You need both the big and the small picture to deliver value.

The concepts of project and product provide two alternatives that might lead readers to believe that one perspective is more important than another. You need both sets of behaviors generated by the project and product perspective. How these behaviors are incorporated into roles on teams is not as straightforward as designating a role representing project concerns and a role representing product concerns and never the twain shall meet. Both roles do not have to be separate people. Agile spreads the project-centric behaviors across the entire team. Even the product owner typically absorbs some of the project-centric activities. However, other than at a philosophical level, the team is not typically charged with performing the product-centric activities. Agile techniques spread project behaviors across the team while product driven behaviours are more concentrated.

Project-centric behaviors are focused on the delivery of the tactical plan, while the product owner has more of a focus on the vision of the long-term future, i.e. the product roadmap. Even though the product owner has a distinct interest in the tactical (what is to be accomplished in a sprint or release), the team has a more focused interest in day-to-day activities. The team must plan, monitor and adjust the day-to-day activities needed to meet their commitments during the sprint (commitments in Agile are by definition tactical). The product owner can contribute, however they typically do not have the technical acumen to deliver functional software. However, without a product view, the day-to-day project considerations will typically trump long-term considerations. In a mature Agile environment, the product view interacts with the project view to generate an equilibrium between long- and short-term perspectives.

Project and product focuses require a different measurement. The project focus on delivery/short-term goals generates a need to understand, pursue and measure delivery efficiency. Efficiency is a measure of transformation; how much of a set of raw materials is needed to create an output. Efficiently producing any output is only valuable, IF what is being produced is what is needed and can be actually delivered. Interestingly most software is a step toward a different product that is bought or used. Because the software being developed or enhanced is a step along a path, the value assigned often does not represent the ultimate impact to the organization (See our Re-Read Saturday series on The Goal for more on this topic). The product owner, as the steward of the product perspective, owns the definition and measurement of value. He or she needs to take the big picture view of what the market needs AND what the market will pay for. What the market will pay for is just as important for an internal product as an external. In order to understand the value a product delivers, the product owner must ask whether the result of a sprint or release positively impacts ROI, profit and cash follow. Efficiency is a mechanism to determine whether a team is making the most out of their “raw material,” but it does not provide feedback on whether what is being produced is the right thing, or whether the functionality delivered yields value to the organization.

In general the product owner will be the champion for the product perspective, however every team member needs to have an understanding of the how the future should unfold and the value they are being asked to deliver. The team will need to temper the product vision based on the constraints that the day-to-day environment provides. Both the project and product perspectives are needed to maximize value. Putting either perspective ahead of the other for any length of time will create an imbalance that will reduce team effectiveness.


Categories: Process Management

Python/nltk: Naive vs Naive Bayes vs Decision Tree

Mark Needham - Tue, 02/24/2015 - 23:39

Last week I wrote a blog post describing a decision tree I’d trained to detect the speakers in a How I met your mother transcript and after writing the post I wondered whether a simple classifier would do the job.

The simple classifier will work on the assumption that any word followed by a “:” is a speaker and anything else isn’t. Here’s the definition of a NaiveClassifier:

import nltk
from nltk import ClassifierI
 
class NaiveClassifier(ClassifierI):
    def classify(self, featureset):
        if featureset['next-word'] == ":":
            return True
        else:
            return False

As you can see it only implements the classify method and executes a static check.

While reading about ways to evaluate the effectiveness of text classifiers I came across Jacob Perkins blog which suggests that we should measure two things: precision and recall.

  • Higher precision means less false positives, while lower precision means more false positives.
  • Higher recall means less false negatives, while lower recall means more false negatives.

If (like me) you often get confused between false positives and negatives the following photo should help fix that:

False positive negative

I wrote the following function (adapted from Jacob’s blog post) to calculate precision and recall values for a given classifier:

import nltk
import collections
 
def assess_classifier(classifier, test_data, text):
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)
    for i, (feats, label) in enumerate(test_data):
        refsets[label].add(i)
        observed = classifier.classify(feats)
        testsets[observed].add(i)
 
    speaker_precision = nltk.metrics.precision(refsets[True], testsets[True])
    speaker_recall = nltk.metrics.recall(refsets[True], testsets[True])
 
    non_speaker_precision = nltk.metrics.precision(refsets[False], testsets[False])
    non_speaker_recall = nltk.metrics.recall(refsets[False], testsets[False])
 
    return [text, speaker_precision, speaker_recall, non_speaker_precision, non_speaker_recall]

Now let’s call that function with each of our classifiers:

import json
 
from sklearn.cross_validation import train_test_split
from himymutil.ml import pos_features
from himymutil.naive import NaiveClassifier
from tabulate import tabulate
 
with open("data/import/trained_sentences.json", "r") as json_file:
    json_data = json.load(json_file)
 
tagged_sents = []
for sentence in json_data:
    tagged_sents.append([(word["word"], word["speaker"]) for word in sentence["words"]])
 
featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append( (pos_features(untagged_sent, i), tag) )
 
train_data,test_data = train_test_split(featuresets, test_size=0.20, train_size=0.80)
 
table = []
table.append(assess_classifier(NaiveClassifier(), test_data, "Naive"))
table.append(assess_classifier(nltk.NaiveBayesClassifier.train(train_data), test_data, "Naive Bayes"))
table.append(assess_classifier(nltk.DecisionTreeClassifier.train(train_data), test_data, "Decision Tree"))
 
print(tabulate(table, headers=["Classifier","speaker precision", "speaker recall", "non-speaker precision", "non-speaker recall"]))

I’m using the tabulate library to print out a table showing each of the classifiers and their associated value for precision and recall. If we execute this file we’ll see the following output:

$ python scripts/detect_speaker.py
Classifier       speaker precision    speaker recall    non-speaker precision    non-speaker recall
-------------  -------------------  ----------------  -----------------------  --------------------
Naive                     0.9625            0.846154                 0.994453              0.998806
Naive Bayes               0.674603          0.934066                 0.997579              0.983685
Decision Tree             0.965517          0.923077                 0.997219              0.998806

The naive classifier is good on most measures but makes some mistakes on speaker recall – we have 16% false negatives i.e. 16% of words that should be classified as speaker aren’t.

Naive Bayes does poorly in terms of speaker false positives – 1/3 of the time when we say a word is a speaker it actually isn’t.

The decision tree performs best but has 8% speaker false negatives – 8% of words that should be classified as speakers aren’t.

The code is on github if you want to play around with it.

Categories: Programming

The Microsoft Story for the Cloud

How has the Cloud changed your world?

One of the ways we challenge people is to ask, do you want to move to the Cloud, use the Cloud, or be the Cloud?

But to answer that well, you need to really be grounded in your vision for the future, and the role you wan to play.

The Cloud creates a brave new world.  It enables and powers the Digital Economy

Businesses need to cross the Cloud chasm (and some don’t make it) in an effort to stay relevant and to be what’s next.

Businesses need to re-imagine themselves and explore the art of the possible.

Business leaders and IT leaders need to help others forge their way forward in the Digital Frontier.

And it all starts with a story.

A story that inspires the hearts and minds so people can wrap their head around the challenge and the change.

I think Satya says the Microsoft story for the Cloud in a very simple and compelling way:

"We will reinvent productivity to empower every person and every organization on the planet to do more and achieve more." -- Satya Nadella, Microsoft CEO

That’s a pretty simple and yet pretty powerful and compelling story of why do we do what we do.

It’s a great way to re-imagine and inspire our transformation to a productivity and platform company in a Mobile-first, Cloud-first world.   And, it’s a very simple story around productivity and empowerment that inspires and drives people in various roles and responsibilities to co-create the future in a profound way.

What is your simple story for how you re-imagine you or your business in a Mobile-First, Cloud-First world?

You Might Also Like

Business Scenarios for the Cloud

If You Want to Thrive at Microsoft

Microsoft Explained: Making Sense of the Microsoft Platform Story

Satya Nadella is All About Customer Focus, Employee Engagement, and Changing the World

Satya Nadella on The Future is Software

Satya Nadella on Everyone Has to Be a Leader

The Microsoft Story

Categories: Architecture, Programming

We'll see you at GDC 2015!

Android Developers Blog - Tue, 02/24/2015 - 21:05

Posted by Greg Hartrell, Senior Product Manager of Google Play Games

The Game Developers Conference (GDC) is less than one week away in San Francisco. This year we will host our annual Developer Day at West Hall and be on the Expo floor in booth #502. We’re excited to give you a glimpse into how we are helping mobile game developers build successful businesses and improve user experiences.

Our Developer Day will take place in Room 2006 of the West Hall of Moscone Center on Monday, March 2. We're keeping the content action-oriented with a few presentations and lightning talks, followed by a full afternoon of hands on hacking with Google engineers. Here’s a look at the schedule:

Opening Keynote || 10AM: We’ll kick off the day by sharing to make your games more successful with Google. You’ll hear about new platforms, new tools to make development easier, and ways to measure your mobile games and monetize them.

Running A Successful Games Business with Google || 10:30AM: Next we’ll hear from Bob Meese, the Global Head of Games Business Development from Google Play, who’ll offer some key pointers on how to make sure you're best taking advantage of unique tools on Google Play to grow your business effectively.

Lightning Talks || 11:15AM: Ready to absorb all the opportunities Google has to offer your game business? These quick, 5-minute talks will cover everything from FlatBuffers to Google Cast to data interpolation. To keep us on track, a gong may be involved.

Code Labs || 1:30PM: After lunch, we’ll turn the room into a classroom setting where you can participate in a number of self-guided code labs focused on leveraging Analytics, Google Play game services, Firebase and VR with Cardboard. These Code Labs are completely self-paced and will be available throughout the afternoon. If you want admission to the code labs earlier, sign up for Priority Access here!

Also, be sure to check out the Google booth on the Expo floor to get hands on experiences with Project Tango, Niantic Labs and Cardboard starting on Wednesday, March 4. Our teams from AdMob, AdWords, Analytics, Cloud Platform and Firebase will also be available to answer any of your product questions.

For more information on our presence at GDC, including a full list of our talks and speaker details, please visit g.co/dev/gdc2015. Please note that these events are part of the official Game Developer's Conference, so you will need a pass to attend. If you can't attend GDC in person, you can still check out our morning talks on our livestream at g.co/dev/gdc-livestream.

Join the discussion on

+Android Developers
Categories: Programming

Episode 221: Jez Humble on Continuous Delivery

Johannes Thönes interviews Jez Humble, senior vice president at Chef, about continuous delivery (CD). They discuss continuous delivery and how it was done at Go, CD, and HP firmware; the benefits of continuous delivery for developers; Conway’s law and cross-functional teams; scary releases and nonscary releases; fix-forward, blue-green deployments, and A/B testing; origins of continuous […]
Categories: Programming