Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Feed aggregator

Python/pandas: Column value in list (ValueError: The truth value of a Series is ambiguous.)

Mark Needham - Mon, 02/16/2015 - 22:39

I’ve been using Python’s pandas library while exploring some CSV files and although for the most part I’ve found it intuitive to use, I had trouble filtering a data frame based on checking whether a column value was in a list.

A subset of one of the CSV files I’ve been working with looks like this:

$ cat foo.csv
"Foo"
1
2
3
4
5
6
7
8
9
10

Loading it into a pandas data frame is reasonably simple:

import pandas as pd
df = pd.read_csv('foo.csv', index_col=False, header=0)
>>> df
   Foo
0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
9   10

If we want to find the rows which have a value of 1 we’d write the following:

>>> df[df["Foo"] == 1]
   Foo
0    1

Finding the rows with a value less than 7 is as you’d expect too:

>>> df[df["Foo"] < 7]
   Foo
0    1
1    2
2    3
3    4
4    5
5    6

Next I wanted to filter out the rows containing odd numbers which I initially tried to do like this:

odds = [i for i in range(1,10) if i % 2 <> 0]
>>> odds
[1, 3, 5, 7, 9]
 
>>> df[df["Foo"] in odds]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/pandas/core/generic.py", line 698, in __nonzero__
    .format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Unfortunately that doesn’t work and I couldn’t get any of the suggestions from the error message to work either. Luckily pandas has a special isin function for this use case which we can call like this:

>>> df[df["Foo"].isin(odds)]
   Foo
0    1
2    3
4    5
6    7
8    9

Much better!

Categories: Programming

When development resembles the ageing of wine

Xebia Blog - Mon, 02/16/2015 - 20:29

Once upon a time I was asked to help out a software product company.  The management briefing went something like this: "We need you to increase productivity, the guys in development seem to be unable to ship anything! and if they do ship something it's only a fraction of what we expected".

And so the story begins. Now there are many ways how we can improve the teams outcome and its output (the first matters more), but it always starts with observing what they do today and trying to figure out why.

It turns out that requests from the business were treated like a good wine, and were allowed to "age", in the oak barrel that was called Jira. Not so much to add flavour in the form of details, requirements, designs, non functional requirements or acceptance criteria, but mainly to see if the priority of this request would remain stable over a period of time.

In the days that followed I participated in the "Change Control Board" and saw what he meant. Management would change priorities on the fly and make swift decisions on requirements that would take weeks to implement. To stay in vinotology terms, wine was poured in and out the barrels at such a rate that it bore more resemblance to a blender than to the art of wine making.

Though management was happy to learn I had unearthed to root cause to their problem, they were less pleased to learn that they themselves were responsible.  The Agile world created the Product Owner role for this, and it turned out that this is hat, that can only be worn by a single person.

Once we funnelled all the requests through a single person, both responsible for the success of the product and for the development, we saw a big change. Not only did the business got a reliable sparring partner, but the development team had a single voice when it came to setting the priorities. Once the team starting finishing what they started we started shipping at regular intervals, with features that we all had committed to.

Of course it did not take away the dynamics of the business, but it allowed us to deliver, and become reliable in how and when we responded to change. Perhaps not the most aged wine, but enough to delight our customers and learn what we should put in our barrel for the next round.

 

ScottGu Azure event in London on March 2nd

ScottGu's Blog - Scott Guthrie - Mon, 02/16/2015 - 19:16

On March 2nd I'm doing an Azure event in London that you can attend for free.  I'll be speaking for about 2.5 hours and will do an end-to-end walkthrough of Microsoft Azure, show off a bunch of demos of great new features/capabilities, and talk about some of the improvements coming out over the next few months.

logo[1]

You can sign-up and attend the event for free (while tickets last - they are going fast).  If you are interested sign-up now.  The event is being held at the Mermaid Conference & Events Centre in Blackfriars, London:

mermaidspic3[1]

Hope to see some of you there!

Scott

omni
Categories: Architecture, Programming

Scaling Kim Kardashian to 100 Million Page Views

The team at PAPER estimated their article (NSFW) containing pictures of a very naked Kim Kardashian would quickly receive over 100 million page views. The very definition of bursty viral driven traffic.

As a comparison in 2013 it was estimated Google processed over 500 million searches a day. So a nude Kim Kardashian is worth one-fifth of a Google. Strangely, I can believe it.

How did they handle this traffic gold mine? A complete recounting of the unusual behind the scenes story is told by Paul Ford in How PAPER Magazine’s web engineers scaled their back-end for Kim Kardashian (SFW).  (BTW, only one butt pun was made intentionally in this story, all others are serendipity).

What can we learn from the experience? I know what you are thinking. This is just a single static page with a few static pictures. It’s not a complex problem like search or a social network. Shouldn’t any decent CDN be enough to handle that? And you would be correct, but that’s not the whole of the story:

Categories: Architecture

The Joel Test For Programmers (The Simple Programmer Test)

Making the Complex Simple - John Sonmez - Mon, 02/16/2015 - 17:00

A while back—the year 2000 to be exact—Joel Spolsky wrote a blog post entitled: “The Joel Test: 12 Steps to Better Code.” Many software engineers and developers use this test for evaluating a company to determine if a company is a good company to work for. In fact, many software development organizations use the Joel Test as a sort of self-test ... Read More

The post The Joel Test For Programmers (The Simple Programmer Test) appeared first on Simple Programmer.

Categories: Programming

Agile Misconceptions: There Is One Right Approach

I have an article up on agileconnection.com called Common Misconceptions about Agile: There Is Only One Approach.

If you read my Design Your Agile Project series, you know I am a fan of determining what approach works when for your organization or project.

Please leave comments over there. Thanks!

Two notes:

  1. If you would like to write an article for agileconnection.com, I’m the technical editor. Send me your article and we can go from there.
  2. If you would like more common-sense approaches to agile, sign up for the Influential Agile Leader. We’re leading it in San Francisco and London this year. Early bird pricing ends soon.
Categories: Project Management

SPaMCAST 329 – Commitment, Message and Themes, HALT Testing

www.spamcast.net

http://www.spamcast.net

Listen Now

Subscribe on iTunes

This week’s Software Process and Measurement Cast is our magazine with three features.  We begin with Jo Ann Sweeney’s Explaining Change column.  In this column Jo Ann tackles the concepts of messages and themes.  I consider this the core of communication.  Visit Jo Ann’s website at http://www.sweeneycomms.com and let her know what you think of her column.

The middle segment is our essay on commitment.  The making and keeping of commitments are core components of both professional behavior and Agile. The simple definition of a commitment is a promise to perform. Whether Agile or Waterfall, commitments are used to manage software projects. Commitments drive the behavior of individuals, teams and organizations.  Commitments are powerful!

We wrap this week’s podcast up with a new column from the Software Sensei, Kim Pries. In this installment Kim discusses software HALT testing.  HALT stands for highly accelerated life test.  The goal is to find defects, faults and things that go bump in the night in hours or days rather than waiting for weeks, months or years.  Whether you are testing software, hardware or some combination this is a concept you need to have in your portfolio.

Call to action!

Can you tell a friend about the podcast?  Even better, show them how you listen to the Software Process and Measurement Cast and subscribe them!  Send me the name of you person you subscribed and I will give both you and the horde you have converted to listeners a call out on the show.

Re-Read Saturday News

The next book in our Re-Read Saturday feature will be Eliyahu M. Goldratt and Jeff Cox’s The Goal: A Process of Ongoing Improvement. Originally published in 1984, it has been hugely influential because it introduced the Theory of Constraints, which is central to lean thinking. The book is written as a business novel. On February 21st we will begin re-read on the Software Process and Measurement Blog

Note: If you don’t have a copy of the book, buy one.  If you use the link below it will support the Software Process and Measurement blog and podcast.

Dead Tree Version or Kindle Version 

Next SPaMCAST

In the next Software Process and Measurement Cast we will feature our interview Anthony Mersino, author of Emotional Intelligence for Project Managers and the newly published Agile Project Management.  Anthony and I talked about Agile, coaching and organizational change.  A wide ranging interview that will help any leader raise the bar!

Shameless Ad for my book!

Mastering Software Project Management: Best Practices, Tools and Techniques co-authored by Murali Chematuri and myself and published by J. Ross Publishing. We have received unsolicited reviews like the following: “This book will prove that software projects should not be a tedious process, neither for you or your team.” Support SPaMCAST by buying the book here.

Available in English and Chinese


Categories: Process Management

SPaMCAST 329 – Commitment, Message and Themes, HALT Testing

Software Process and Measurement Cast - Sun, 02/15/2015 - 23:00

This week’s Software Process and Measurement Cast is our magazine with three features.  We begin with Jo Ann Sweeney’s Explaining Change column.  In this column Jo Ann tackles the concepts of messages and themes.  I consider this the core of communication.  Visit Jo Ann’s website at http://www.sweeneycomms.com and let her know what you think of her column.

The middle segment is our essay on commitment.  The making and keeping of commitments are core components of both professional behavior and Agile. The simple definition of a commitment is a promise to perform. Whether Agile or Waterfall, commitments are used to manage software projects. Commitments drive the behavior of individuals, teams and organizations.  Commitments are powerful! 

We wrap this week’s podcast up with a new column from the Software Sensei, Kim Pries. In this installment Kim discusses software HALT testing.  HALT stands for highly accelerated life test.  The goal is to find defects, faults and things that go bump in the night in hours or days rather than waiting for weeks, months or years.  Whether you are testing software, hardware or some combination this is a concept you need to have in your portfolio.

Call to action!

Can you tell a friend about the podcast?  Even better, show them how you listen to the Software Process and Measurement Cast and subscribe them!  Send me the name of you person you subscribed and I will give both you and the horde you have converted to listeners a call out on the show.  

Re-Read Saturday News

The next book in our Re-Read Saturday feature will be Eliyahu M. Goldratt and Jeff Cox’s The Goal: A Process of Ongoing Improvement. Originally published in 1984, it has been hugely influential because it introduced the Theory of Constraints, which is central to lean thinking. The book is written as a business novel. On February 21st we will begin re-read on the Software Process and Measurement Blog

Note: If you don’t have a copy of the book, buy one.  If you use the link below it will support the Software Process and Measurement blog and podcast.

Dead Tree Version or Kindle Version  

 

Next SPaMCast

In the next Software Process and Measurement Cast we will feature our interview Anthony Mersino, author of Emotional Intelligence for Project Managers and the newly published Agile Project Management.  Anthony and I talked about Agile, coaching and organizational change.  A wide ranging interview that will help any leader raise the bar! 

Shameless Ad for my book!

Mastering Software Project Management: Best Practices, Tools and Techniques co-authored by Murali Chematuri and myself and published by J. Ross Publishing. We have received unsolicited reviews like the following: “This book will prove that software projects should not be a tedious process, neither for you or your team.” Support SPaMCAST by buying the book here.

Available in English and Chinese

Categories: Process Management

Early Bird Ends Soon for Influential Agile Leader

If you are a leader for your agile efforts in your organization, you need to consider participating in The Influential Agile Leader. If you are working on how to transition to agile, how to talk about agile, how to help your peers, managers, or teams, you want to participate.

Gil Broza and I designed it to be experiential and interactive. We’re leading the workshop in San Francisco, Mar 31-Apr 1. We’ll be in London April 14-15.

The early bird pricing ends Feb 20.

People who participate see great results, especially when they bring peers/managers from their organization. Sign up now.

Categories: Project Management

Python/scikit-learn: Calculating TF/IDF on How I met your mother transcripts

Mark Needham - Sun, 02/15/2015 - 16:56

Over the past few weeks I’ve been playing around with various NLP techniques to find interesting insights into How I met your mother from its transcripts and one technique that kept coming up is TF/IDF.

The Wikipedia definition reads like this:

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

It is often used as a weighting factor in information retrieval and text mining.

The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

I wanted to generate a TF/IDF representation of phrases used in the hope that it would reveal some common themes used in the show.

Python’s scikit-learn library gives you two ways to generate the TF/IDF representation:

  1. Generate a matrix of token/phrase counts from a collection of text documents using CountVectorizer and feed it to TfidfTransformer to generate the TF/IDF representation.
  2. Feed the collection of text documents directly to TfidfVectorizer and go straight to the TF/IDF representation skipping the middle man.

I started out using the first approach and hadn’t quite got it working when I realised there was a much easier way!

I have a collection of sentences in a CSV file so the first step is to convert those into a list of documents:

from collections import defaultdict
import csv
 
episodes = defaultdict(list)
with open("data/import/sentences.csv", "r") as sentences_file:
    reader = csv.reader(sentences_file, delimiter=',')
    reader.next()
    for row in reader:
        episodes[row[1]].append(row[4])
 
for episode_id, text in episodes.iteritems():
    episodes[episode_id] = "".join(text)
 
corpus = []
for id, episode in sorted(episodes.iteritems(), key=lambda t: int(t[0])):
    corpus.append(episode)

corpus contains 208 entries (1 per episode), each of which is a string containing the transcript of that episode. Next it’s time to train our TF/IDF model which is only a few lines of code:

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')

The most interesting parameter here is ngram_range – we’re telling it to generate 2 and 3 word phrases along with the single words from the corpus.

e.g. if we had the sentence “Python is cool” we’d end up with 6 phrases – ‘Python’, ‘is’, ‘cool’, ‘Python is’, ‘Python is cool’ and ‘is cool’.

Let’s execute the model against our corpus:

tfidf_matrix =  tf.fit_transform(corpus)
>>> len(feature_names)
498254
 
>>> feature_names[50:70]
[u'00 does sound', u'00 don', u'00 don buy', u'00 dressed', u'00 dressed blond', u'00 drunkenly', u'00 drunkenly slurred', u'00 fair', u'00 fair tonight', u'00 fall', u'00 fall foliage', u'00 far', u'00 far impossible', u'00 fart', u'00 fart sure', u'00 friends', u'00 friends singing', u'00 getting', u'00 getting guys', u'00 god']

So we’re got nearly 500,000 phrases and if we look at tfidf_matrix we’d expect it to be a 208 x 498254 matrix – one row per episode, one column per phrase:

>>> tfidf_matrix
<208x498254 sparse matrix of type '<type 'numpy.float64'>'
	with 740396 stored elements in Compressed Sparse Row format>

This is what we’ve got although under the covers it’s using a sparse representation to save space. Let’s convert the matrix to dense format to explore further and find out why:

dense = tfidf_matrix.todense()
>>> len(dense[0].tolist()[0])
498254

What I’ve printed out here is the size of one row of the matrix which contains the TF/IDF score for every phrase in our corpus for the 1st episode of How I met your mother. A lot of those phrases won’t have happened in the 1st episode so let’s filter those out:

episode = dense[0].tolist()[0]
phrase_scores = [pair for pair in zip(range(0, len(episode)), episode) if pair[1] > 0]
 
>>> len(phrase_scores)
4823

There are just under 5000 phrases used in this episode, roughly 1% of the phrases in the whole corpus.
The sparse matrix makes a bit more sense – if scipy used a dense matrix representation there’d be 493,000 entries with no score which becomes more significant as the number of documents increases.

Next we’ll sort the phrases by score in descending order to find the most interesting phrases for the first episode of How I met your mother:

>>> sorted(phrase_scores, key=lambda t: t[1] * -1)[:5]
[(419207, 0.2625177493269755), (312591, 0.19571419072701732), (267538, 0.15551468983363487), (490429, 0.15227880637176266), (356632, 0.1304175242341549)]

The first value in each tuple is the phrase’s position in our initial vector and also corresponds to the phrase’s position in feature_names which allows us to map the scores back to phrases. Let’s look up a couple of phrases:

>>> feature_names[419207]
u'ted'
>>> feature_names[312591]
u'olives'
>>> feature_names[356632]
u'robin'

Let’s automate that lookup:

sorted_phrase_scores = sorted(phrase_scores, key=lambda t: t[1] * -1)
for phrase, score in [(feature_names[word_id], score) for (word_id, score) in sorted_phrase_scores][:20]:
   print('{0: <20} {1}'.format(phrase, score))
 
ted                  0.262517749327
olives               0.195714190727
marshall             0.155514689834
yasmine              0.152278806372
robin                0.130417524234
barney               0.124411751867
lily                 0.122924977859
signal               0.103793246466
goanna               0.0981379875009
scene                0.0953423604123
cut                  0.0917336653574
narrator             0.0864622981985
flashback            0.078295921554
flashback date       0.0702825260177
ranjit               0.0693927691559
flashback date robin 0.0585687716814
ted yasmine          0.0585687716814
carl                 0.0582101172888
eye patch            0.0543650529797
lebanese             0.0543650529797

We see all the main characters names which aren’t that interested – perhaps they should be part of the stop list – but ‘olives’ which is where the olive theory is first mentioned. I thought olives came up more often but a quick search for the term suggests it isn’t mentioned again until Episode 9 in Season 9:

$ grep -rni --color "olives" data/import/sentences.csv | cut -d, -f 2,3,4 | sort | uniq -c
  16 1,1,1
   3 193,9,9

‘yasmine’ is also an interesting phrase in this episode but she’s never mentioned again:

$ grep -h -rni --color "yasmine" data/import/sentences.csv
49:48,1,1,1,"Barney: (Taps a woman names Yasmine) Hi, have you met Ted? (Leaves and watches from a distance)."
50:49,1,1,1,"Ted: (To Yasmine) Hi, I'm Ted."
51:50,1,1,1,Yasmine: Yasmine.
53:52,1,1,1,"Yasmine: Thanks, It's Lebanese."
65:64,1,1,1,"[Cut to the bar, Ted is chatting with Yasmine]"
67:66,1,1,1,Yasmine: So do you think you'll ever get married?
68:67,1,1,1,"Ted: Well maybe eventually. Some fall day. Possibly in Central Park. Simple ceremony, we'll write our own vows. But--eh--no DJ, people will dance. I'm not going to worry about it! Damn it, why did Marshall have to get engaged? (Yasmine laughs) Yeah, nothing hotter than a guy planning out his own imaginary wedding, huh?"
69:68,1,1,1,"Yasmine: Actually, I think it's cute."
79:78,1,1,1,"Lily: You are unbelievable, Marshall. No-(Scene splits in half and shows both Lily and Marshall on top arguing and Ted and Yasmine on the bottom mingling)"
82:81,1,1,1,Ted: (To Yasmine) you wanna go out sometime?
85:84,1,1,1,[Cut to Scene with Ted and Yasmine at bar]
86:85,1,1,1,Yasmine: I'm sorry; Carl's my boyfriend (points to bartender)

It would be interesting to filter out the phrases which don’t occur in any other episode and see what insights we get from doing that. For now though we’ll extract phrases for all episodes and write to CSV so we can explore more easily:

with open("data/import/tfidf_scikit.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["EpisodeId", "Phrase", "Score"])
 
    doc_id = 0
    for doc in tfidf_matrix.todense():
        print "Document %d" %(doc_id)
        word_id = 0
        for score in doc.tolist()[0]:
            if score > 0:
                word = feature_names[word_id]
                writer.writerow([doc_id+1, word.encode("utf-8"), score])
            word_id +=1
        doc_id +=1

And finally a quick look at the contents of the CSV:

$ tail -n 10 data/import/tfidf_scikit.csv
208,york apparently laughs,0.012174304095213192
208,york aren,0.012174304095213192
208,york aren supposed,0.012174304095213192
208,young,0.013397275854758335
208,young ladies,0.012174304095213192
208,young ladies need,0.012174304095213192
208,young man,0.008437685963000223
208,young man game,0.012174304095213192
208,young stupid,0.011506395106658192
208,young stupid sighs,0.012174304095213192
Categories: Programming

Diamond Kata - Some Thoughts on Tests as Documentation

Mistaeks I Hav Made - Nat Pryce - Sun, 02/15/2015 - 13:13
Comparing example-based tests and property-based tests for the Diamond Kata, I’m struck by how well property-based tests reduce duplication of test code. For example, in the solutions by Sandro Mancuso and George Dinwiddie, not only do multiple tests exercise the same property with different examples but the tests duplicate assertions. Property-based tests avoid the former by defining generators of input data, but I’m not sure why the latter occurs. Perhaps Seb’s “test recycling” approach would avoid this kind of duplication. But compared to example based tests, property based tests do not work so well as as an explanatory overview. Examples convey an overall impression of what the functionality is, but are are not good at describing precise details. When reading example-based tests, you have to infer the properties of the code from multiple examples and informal text in identifiers and comments. The property-based tests I wrote for the Diamond Kata specify precise properties of the diamond function, but nowhere is there a test that describes that the function draws a diamond! There’s a place for both examples and properties. It’s not an either/or decision. However, explanatory examples used for documentation need not be test inputs. If we’re generating inputs for property tests and generating documentation for our software, we can combine the two, and insert generated inputs and calculated ouputs into generated documentation.
Categories: Programming, Testing & QA

Re-Read Saturday . . . And The Readers Have Spoken

IMG_1249

The next book in our Re-Read Saturday feature will be Eliyahu M. Goldratt and Jeff Cox’s The Goal: A Process of Ongoing Improvement. Originally published in 1984, it has been hugely influential because it introduced the Theory of Constraints, which is central to lean thinking. The book is written as a business novel. On February 21st we will begin re-read.

Note: If you don’t have a copy of the book, buy one.  If you use the link below it will support the Software Process and Measurement blog and podcast. Dead Tree Version or Kindle Version 

For the record, the top five books in the overall voting were:

  1. The Goal: A Process of Ongoing Improvement – Eliyahu M. Goldrattand Jeff Cox 71%
  2. Checklist Manifesto: How to Get Things Done Right- Atul Gawande 43%
  3. Three Tied:
    The Principles of Product Development Flow – Donald G. Reinertsen57%
    The Art of Software Testing – Glenford J. Myers, Cory Sandler and Tom Badgett8.57%
    The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses – Eric Reis 8.57%

I was asked on LinkedIn for a list of the other books that we have featured in the Re-Read Saturday series. Here they are:

7 Habits of Highly Effective People – Stephen Covey

Dr. Covey lays out seven behaviors of successful people (hence the title).  The book is based on observation, interviews and research; therefore the habits presented in the book not only make common sense, but also have a solid evidentiary basis. One of the reasons the book works is the integration of character and ethics into the principles.  I have written and podcasted on the importance and value of character and ethics in the IT environment many times.

Note: If you don’t have a copy of the book, buy one (I would loan you mine, but I suspect I will read it again).  If you use the link below it will support the Software Process and Measurement blog and podcast. Dead Tree Version Kindle Version

The re-read blog entries:

The audio podcast can be listened to HERE

Leading Change – John P. Kotter

Leading Change by John P. Kotter, originally published in 1996, has become a classic reference that most process improvement specialists either have or should have on their bookshelf. The core of the book lays out an eight-step model for effective change that anyone involved in change will find useful. However there is more to the book than just the model.

Note: If you don’t have a copy of the book, buy one.  If you use the link below it will support the Software Process and Measurement blog and podcast. Dead Tree Version 

Entries in the Re-Read are:

I have not compiled the entries into a single essay and podcast as of February 2015.


Categories: Process Management

The Great Love Quotes Collection Revamped

A while back I put together a comprehensive collection of love quotes.   It’s a combination of the wisdom of the ages + modern sages.   In the spirit of Valentine’s Day, I gave it a good revamp.  Here it is:

The Great Love Quotes Collection

It's a serious collection of love quotes and includes lessons from the likes of Lucille Ball, Shakespeare, Socrates, and even The Princess Bride.

How I Organized the Categories for Love Quotes

I organized the quotes into a set of buckets:
Beauty
Broken Hearts and Loss
Falling in Love
Fear and Love
Fun and Love
Kissing
Love and Life
Significance and Meaning
The Power of Love
True Love

I think there’s a little something for everyone among the various buckets.   If you walk away with three new quotes that make you feel a little lighter, put a little skip in your step, or help you see love in a new light, then mission accomplished.

Think of Love as Warmth and Connection

If you think of love like warmth and connection, you can create more micro-moments of love in your life.

This might not seem like a big deal, but if you knew all the benefits for your heart, brain, bodily processes, and even your life span, you might think twice.

You might be surprised by how much your career can be limited if you don’t balance connection with conviction.  It’s not uncommon to hear a lot of turning points in the careers of developers, program managers, IT leaders, and business leaders that changed their game, when they changed their heart.

In fact, on one of the teams I was on, the original mantra was “business before technology”, but people in the halls started to say, “people before business, business before technology” to remind people of what makes business go round.

When people treat each other better, work and life get better.

Love Quotes Help with Insights and Actions

Here are a few of my favorite love quotes from the collection …

“Love is like heaven, but it can hurt like hell.” – Unknown

“Love is not a feeling, it’s an ability.” — Dan in Real Life

“There is a place you can touch a woman that will drive her crazy. Her heart.” — Milk Money

“Hearts will be practical only when they are made unbreakable.”  – The Wizard of Oz

“Things are beautiful if you love them.” – Jean Anouilh

“Life is messy. Love is messier.” – Catch and Release

“To the world you may be just one person, but to one person you may be the world.” – Unknown

For many more quotes, explore The Great Love Quotes Collection.

You Might Also Like

Happiness Quotes Revamped

My Story of Personal Transformation

The Great Leadership Quotes Collection Revamped

The Great Personal Development Quotes Collection Revamped

The Great Productivity Quotes Collection

Categories: Architecture, Programming

Neo4j: Building a topic graph with Prismatic Interest Graph API

Mark Needham - Sat, 02/14/2015 - 00:38

Over the last few weeks I’ve been using various NLP libraries to derive topics for my corpus of How I met your mother episodes without success and was therefore enthused to see the release of Prismatic’s Interest Graph API

The Interest Graph API exposes a web service to which you feed a block of text and get back a set of topics and associated score.

It has been trained over the last few years with millions of articles that people share on their social media accounts and in my experience using Prismatic the topics have been very useful for finding new material to read.

The first step is to head to interest-graph.getprismatic.com and get an API key which will be emailed to you.

Having done that we’re ready to make some calls to the API and get back some topics.

I’m going to use Python to call the API and I’ve found the requests library the easiest library to use for this type of work. Our call to the API looks like this:

import requests
payload = { 'title': "insert title of article here",
            'body': "insert body of text here"),
            'api-token': "insert token sent by email here"}
r = requests.post("http://interest-graph.getprismatic.com/text/topic", data=payload)

One thing to keep in mind is that the API is rate limited to 20 requests a second so we need to restrict our requests or we’re going to receive error response codes. Luckily I came across an excellent blog post showing how to write a decorator around a function and only allow it to execute at a certain frequency.

To rate limit our calls to the Interest Graph we need to pull the above code into a function and annotate it appropriately:

import time
 
def RateLimited(maxPerSecond):
    minInterval = 1.0 / float(maxPerSecond)
    def decorate(func):
        lastTimeCalled = [0.0]
        def rateLimitedFunction(*args,**kargs):
            elapsed = time.clock() - lastTimeCalled[0]
            leftToWait = minInterval - elapsed
            if leftToWait>0:
                time.sleep(leftToWait)
            ret = func(*args,**kargs)
            lastTimeCalled[0] = time.clock()
            return ret
        return rateLimitedFunction
    return decorate
 
@RateLimited(0.3)
def topics(title, body):
    payload = { 'title': title,
                'body': body,
                'api-token': "insert token sent by email here"}
    r = requests.post("http://interest-graph.getprismatic.com/text/topic", data=payload)
    return r

The text I want to classify is stored in a CSV file – one sentence per line. Here’s a sample:

$ head -n 10 data/import/sentences.csv
SentenceId,EpisodeId,Season,Episode,Sentence
1,1,1,1,Pilot
2,1,1,1,Scene One
3,1,1,1,[Title: The Year 2030]
4,1,1,1,"Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother"
5,1,1,1,Son: Are we being punished for something?
6,1,1,1,Narrator: No
7,1,1,1,"Daughter: Yeah, is this going to take a while?"
8,1,1,1,"Narrator: Yes. (Kids are annoyed) Twenty-five years ago, before I was dad, I had this whole other life."
9,1,1,1,"(Music Plays, Title ""How I Met Your Mother"" appears)"

We’ll also need to refer to another CSV file to get the title of each episode since it isn’t being stored with the sentence:

$ head -n 10 data/import/episodes_full.csv
NumberOverall,NumberInSeason,Episode,Season,DateAired,Timestamp,Title,Director,Viewers,Writers,Rating
1,1,/wiki/Pilot,1,"September 19, 2005",1127084400,Pilot,Pamela Fryman,10.94,"Carter Bays,Craig Thomas",68
2,2,/wiki/Purple_Giraffe,1,"September 26, 2005",1127689200,Purple Giraffe,Pamela Fryman,10.40,"Carter Bays,Craig Thomas",63
3,3,/wiki/Sweet_Taste_of_Liberty,1,"October 3, 2005",1128294000,Sweet Taste of Liberty,Pamela Fryman,10.44,"Phil Lord,Chris Miller",67
4,4,/wiki/Return_of_the_Shirt,1,"October 10, 2005",1128898800,Return of the Shirt,Pamela Fryman,9.84,Kourtney Kang,59
5,5,/wiki/Okay_Awesome,1,"October 17, 2005",1129503600,Okay Awesome,Pamela Fryman,10.14,Chris Harris,53
6,6,/wiki/Slutty_Pumpkin,1,"October 24, 2005",1130108400,Slutty Pumpkin,Pamela Fryman,10.89,Brenda Hsueh,62
7,7,/wiki/Matchmaker,1,"November 7, 2005",1131321600,Matchmaker,Pamela Fryman,10.55,"Sam Johnson,Chris Marcil",57
8,8,/wiki/The_Duel,1,"November 14, 2005",1131926400,The Duel,Pamela Fryman,10.35,Gloria Calderon Kellett,46
9,9,/wiki/Belly_Full_of_Turkey,1,"November 21, 2005",1132531200,Belly Full of Turkey,Pamela Fryman,10.29,"Phil Lord,Chris Miller",60

Now we need to get our episode titles and transcripts ready to pass to the topics function. Since we’ve only got ~ 200 episodes we can create a dictionary to store that data:

episodes = {}
with open("data/import/episodes_full.csv", "r") as episodesfile:
    episodes_reader = csv.reader(episodesfile, delimiter=",")
    episodes_reader.next()
    for episode in episodes_reader:
        episodes[int(episode[0])] = {"title": episode[6], "sentences" : [] }
 
with open("data/import/sentences.csv", "r") as sentencesfile:
     sentences_reader = csv.reader(sentencesfile, delimiter=",")
     sentences_reader.next()
     for sentence in sentences_reader:
         episodes[int(sentence[1])]["sentences"].append(sentence[4])
 
>>> episodes[1]["title"]
'Pilot'
>>> episodes[1]["sentences"][:5]
['Pilot', 'Scene One', '[Title: The Year 2030]', "Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother", 'Son: Are we being punished for something?']

Now we’re going to loop through each of the episodes, call topics and write the result into a CSV file so we can load it into Neo4j afterwards to explore the data:

import json
 
with open("data/import/topics.csv", "w") as topicsfile:
    topics_writer = csv.writer(topicsfile, delimiter=",")
    topics_writer.writerow(["EpisodeId", "TopicId", "Topic", "Score"])
 
    for episode_id, episode in episodes.iteritems():
        tmp = topics(episode["title"], "".join(episode["sentences"]).json()
        print episode_id, tmp
        for topic in tmp['topics']:
            topics_writer.writerow([episode_id, topic["id"], topic["topic"], topic["score"]])

It takes about 10 minutes to run and this is a sample of the output:

$ head -n 10 data/import/topics.csv
EpisodeId,TopicId,Topic,Score
1,1519,Fiction,0.5798245566455255
1,2015,Humour,0.565154963605359
1,24031,Laughing,0.5587120401021765
1,16693,Flirting,0.5514098189505282
1,1163,Dating and Courtship,0.5487490108554022
1,2386,Kissing,0.5476185929151934
1,31929,Puns,0.5375100569837977
2,24031,Laughing,0.5670926949850333
2,1519,Fiction,0.5396488295397263

We’ll use Neo4j’s LOAD CSV command to load the data in:

// make sure the topics exist
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MERGE (topic:Topic {id: TOINT(row.TopicId)})
ON CREATE SET topic.value = row.Topic
// make sure the topics exist
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MERGE (topic:Topic {id: TOINT(row.TopicId)})
ON CREATE SET topic.value = row.Topic
// now link the episodes and topics
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MATCH (topic:Topic {id: TOINT(row.TopicId)})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (episode)-[:TOPIC {score: TOFLOAT(row.Score)}]->(topic)

We’ll assume that the episodes and seasons are already loaded – the commands to load those in are on github.

We can now write some queries against our topic graph. We’ll start simple – show me the topics for an episode:

MATCH (episode:Episode {id: 1})-[r:TOPIC]->(topic)
RETURN topic, r

Graph

Let’s say we liked the ‘Puns’ aspect of the Pilot episode and want to find out which other episodes had puns. The following query would let us find those:

MATCH (episode:Episode {id: 1})-[r:TOPIC]->(topic {value: "Puns"})<-[:TOPIC]-(other)
RETURN episode, topic, other

Graph  1

Or maybe we want to find the episode which has the most topics in common:

MATCH (episode:Episode {id: 1})-[:TOPIC]->(topic),
      (topic)<-[r:TOPIC]-(otherEpisode)
RETURN otherEpisode.title as episode, COUNT(r) AS topicsInCommon
ORDER BY topicsInCommon DESC
LIMIT 10
==> +------------------------------------------------+
==> | episode                       | topicsInCommon |
==> +------------------------------------------------+
==> | "Purple Giraffe"              | 6              |
==> | "Ten Sessions"                | 5              |
==> | "Farhampton"                  | 4              |
==> | "The Three Days Rule"         | 4              |
==> | "How I Met Everyone Else"     | 4              |
==> | "The Time Travelers"          | 4              |
==> | "Mary the Paralegal"          | 4              |
==> | "Lobster Crawl"               | 4              |
==> | "The Magician's Code, Part 2" | 4              |
==> | "Slutty Pumpkin"              | 4              |
==> +------------------------------------------------+
==> 10 rows

We could then tweak that query to get the names of those topics:

MATCH (episode:Episode {id: 1})-[:TOPIC]->(topic),
      (topic)<-[r:TOPIC]-(otherEpisode)-[:IN_SEASON]->(season)
RETURN otherEpisode.title as episode, season.number AS season, COUNT(r) AS topicsInCommon, COLLECT(topic.value)
ORDER BY topicsInCommon DESC
LIMIT 10
 
==> +-----------------------------------------------------------------------------------------------------------------------------------+
==> | episode                   | season | topicsInCommon | COLLECT(topic.value)                                                        |
==> +-----------------------------------------------------------------------------------------------------------------------------------+
==> | "Purple Giraffe"          | "1"    | 6              | ["Humour","Fiction","Kissing","Dating and Courtship","Flirting","Laughing"] |
==> | "Ten Sessions"            | "3"    | 5              | ["Humour","Puns","Dating and Courtship","Flirting","Laughing"]              |
==> | "How I Met Everyone Else" | "3"    | 4              | ["Humour","Fiction","Dating and Courtship","Laughing"]                      |
==> | "Farhampton"              | "8"    | 4              | ["Humour","Fiction","Kissing","Dating and Courtship"]                       |
==> | "Bedtime Stories"         | "9"    | 4              | ["Humour","Puns","Dating and Courtship","Laughing"]                         |
==> | "Definitions"             | "5"    | 4              | ["Kissing","Dating and Courtship","Flirting","Laughing"]                    |
==> | "Lobster Crawl"           | "8"    | 4              | ["Humour","Dating and Courtship","Flirting","Laughing"]                     |
==> | "Little Boys"             | "3"    | 4              | ["Humour","Puns","Dating and Courtship","Laughing"]                         |
==> | "Wait for It"             | "3"    | 4              | ["Fiction","Puns","Flirting","Laughing"]                                    |
==> | "Mary the Paralegal"      | "1"    | 4              | ["Humour","Dating and Courtship","Flirting","Laughing"]                     |
==> +-----------------------------------------------------------------------------------------------------------------------------------+

Overall 168 (out of 208) of the other episodes have a topic in common with the first episode so perhaps just having a topic in common isn’t the best indication of similarity.

An interesting next step would be to calculate cosine or jaccard similarity between the episodes and store that value in the graph for querying later on.

I’ve also calculated the most common bigrams across all the transcripts so it would be interesting to see if there are any interesting insights at the intersection of episodes, topics and phrases.

Categories: Programming

Real Life Sources of Empirical Data for Project Estimates

Herding Cats - Glen Alleman - Fri, 02/13/2015 - 19:39

In the #NoEstimates conversation, the term empirical data is used as a substitute for Not Estimating. This notion of No Estimates - that is making decisions (about the future) with No Estimates, is oxymoronic since gathering data and making decisions about the future from empirical data is actually estimating.

But that aside for the moment, the examples in the No Estimates community of empirical data are woefully inadequate for any credible decision making. Using 22  or so data samples with a ±30 variance to forecast future outcomes when spending other peoples money doesn't pass the smell test where I work. 

Here's some sources of actual data for IT projects that can be used to build Reference Class that have better statistics.

The current issue of ORMS Today has resources as well ORMS can be obtained for free. There are several professional societies that provide guidance for estimating

Are two I participate in.

As well I have a colleague, Mario Vanhoucke, who speaks at our Earned Value Management conferences, whose graduate studies do research on project performance management. A recent paper, "Construction and Evaluation of Frameworks for Real Life Project Database," is a good source of how to apply empirical data to making estimates of outcomes in the future. Mario teaches Economics and Business Administration, at Ghent University and is a founder of OR-AS. 

All of this is to say, using empirical data is necessary but not sufficient. Especially when the data being used if too small a sample size, statistically unstable, or at a minimum statistical broad variances. To be sufficient, we need a few more things:

  • The correlations between the data samples as the evolve in time. This is Time Series Analysis.
  • sample sizes sufficient to draw variances assessment of the future outcomes.
  • A broader Reference Class basis, than just the small number of samples in the current work stream. These small samples can be useful IF the future work represents the same class of work. This would imply the project itself is straightforward, has little emergent risk (reducible or irreducible), and we're confident not much is going to change. Without those assumption the statistics from those 20 or so samples should not be used.

What's Next?

Starting with empirical samples to make estimates of future outcomes is call Estimating. Labeling it as No Estimates seems a bit odd at best. 

With the basic understanding the empirical data is needed for any credible estimating process, look further into the principles and practices of probabilistic estimating for project work. 

This, hopefully, will result in an understanding of sample size calculations to determine the confidence in the forecast as a start. 

Related articles What does it mean when we say 80% confidence in a number? Intellectual Honesty of Managing in the Presence of Uncertainty Bayesian Statistics and Project Work The Actual Science in Management Science A Theory of Speculation
Categories: Project Management

Stuff The Internet Says On Scalability For February 13th, 2015

Hey, it's HighScalability time:


Stunning depiction of every space mission over the past 50 years. (Max Roser)
  • 700 billion: Apple's valuation; 1: Number of lines of code it takes to bring down UK air traffic control; 20: how old that line of code was in years; 2: the problem was of course a never been seen before double failure; 1: atom-thin silicon transistors may mean super-fast computing; a few: how many data points it takes to identify you
  • Quotable Quotes:
    • @EpicureanDeal: The Uber model is everywhere: using the internet to connect infrequent, consumers of ad hoc, spot market services with fragmented suppliers.
    • @awendt: “I’m sorry you learned about transactions and joins in college, but you’ll have to de-normalize for #microservices” – @adrianco #microxchg
    • @samnewman: @adrianco "JSON was 10x faster than XML. Protobufs 10x faster than JSON. Avro same speed as Protobufs, but half the size"
    • @RichardWarburto: Premature Optimization isn't the root of all evil: misunderstood domain models are.
    • @MichaelPisula: Says @ewolff at #microxchg: start big with your microservices, splitting is easier than joining and your architecture will be wrong anyway
    • @MJFKlewitz: "With vertical scaling the problem is you end up giving a lot of money to Larry Ellison" #greatquote @crichardson #microxchg
    • Jenny Rood: Species of ants which differ in size can coexist peacefully, but the insects will chase away similarly sized competing species.
    • Steven Levy: The nonlinear gains that Moore predicted are so mind-bending that it is no wonder that very few were able to bend their minds around it.
    • ntoshev: There seems to be a fundamental trade-off between latency and throughput, with stream processors optimizing for latency and batch processors optimizing for throughput.
    • Sam Altman: Nobody cares if you’re using an Intel Edison or a 555 to blink the LED in the prototype you show them: people care about whether you’ve made something that they want.
    • @swardley: 30-50 years from genesis to industrialisation is about the average these days
    • Alex Clemmer: 84% of a single-threaded 1KB write in Redis is spent in the kernel
    • @allspaw: Psst: while lots of folks hope for fully "autonomous" tech to solve all the world's ills, I'll just be over here getting some work done.
    • @alejandrocrosa: “The database you read from is just a cached view of the event log”
    • @viktorklang: Optimizing for latency (as in "time to serve") will also yield higher throughput. Thank you, Mr Little.
    • rakoo: using GOMAXPROCS doesn't automagically turn your program into a parallel one
    • fluidcruft: Data science manifesto: The purpose of computing is numbers.

  • Is the golden age of the cheap startup over? The Rising Costs Of Scaling A Startup. In San Francisco it is. Twice as expensive in 2014 than it was in 2009. Wages have doubled. Op-ex has doubled. And thus startup round sizes have increased. People and place costs dwarf compute infrastructure cost savings.

  • Just in case you are of the fashionable opinion Perl code must look like line noise, take a gander: Real measurement is hard. Nice, eh?

  • Magic tricks for algorithms. This may prove helpful in your new job as Algorithm Profiler...Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images:  It is possible to produce images totally unrecognizable to human eyes that DNNs believe with near certainty are familiar objects. 

  • Here's the rare Docker nocker. Many reasons why you should stop using Docker. One side: Idea good, implementation leaves something to be desired. Other side: it tastes great and is less filling. Interesting from dacjames: In general, I agree 100%. With the case of Docker, the concept of containerization is more important than the current project. Decoupling the application environment from the infrastructure environment is an immensely valuable paradigm.

  • Bounty Hunter. A job title that conjures up romantic images and dreams of the never was. You can still be one in the digital age. And make some money too. 11 Essential Bug Bounty Programs of 2015. Hundreds of thousands of dollars are available. Good hunting.

  • When algorithms rule the world you are just one weighting factor away from insignificance. Apple,Apps and Algorithmic Glitches. Divination used to be how we attempted to contol the future. Now we attempt to penetrate to the unknowable heart of opaque algorithms with something different...data.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so please keep on reading)...

Categories: Architecture

Beta Channel for the Android WebView

Android Developers Blog - Fri, 02/13/2015 - 16:59

Posted by Richard Coles, Software Engineer, Google London

Many Android apps use a WebView for displaying HTML content. In Android 5.0 Lollipop, Google has the ability to update WebView independently of the Android platform. Beginning today, developers can use a new beta channel to test the latest version of WebView and provide feedback.

WebView updates bring numerous bug fixes, new web platform APIs and updates from Chromium. If you’re making use of the WebView in your app, becoming a beta channel tester will give you an early start with new APIs as well as the chance to test your app before the WebView rolls out to your users.

The first version offered in the beta channel will be based on Chrome 40 and you can find a full list of changes on the chromium blog entry.

To become a beta tester, join the community which will enable you to sign up for the Beta program; you’ll then be able to install the beta version of the WebView via the Play Store. If you find any bugs, please file them on the Chromium issue tracker.

Join the discussion on

+Android Developers
Categories: Programming

Who can be Agile?

Transformation is possible if you really want it.

Transformation is possible if you really want it.

I am often asked which projects or organizations should use Agile. Underlying the question is a feeling that there is some formula that indicates specific types of projects or organizations should use Agile and which should not. I feel the answer is very simple. Since Agile is at heart a philosophy, any project or organization that wants to embrace Agile can use Agile. The most important word in that statement is “wants”. We often say we want something, but that statement doesn’t translate to action — for many reasons. For example, I “want” to learn Ruby. I have even gone as far to put a card into the backlog of my personal Kanban board. However, I have not found the time to begin the process. Finding the time would require I change my current behavior (get up earlier) or to abandon another project. Really wanting to become Agile means making changes both to how organizations and teams work and interact. As we all know, organizational change is typically not easy. The recent re-read of Kotter’s Leading Change recounted and discussed a framework for generating major organizational change. The requirements for all changes requires focus, effort and constancy of purpose. A change to embrace Agile requires adopting the philosophy of Agile, abandoning much of the trappings of command and control and thinking more in product terms of than project.

Agile is a Philosophy – the Agile Manifesto is comprised of four values and twelve principles that create a philosophy for structuring and managing work. The focus is on the human side of delivering value. There are innumerable frameworks, methods and techniques for implementing those techniques. For example, while Scrum is the most popular framework being used in Agile projects, it is not the only method. Crystal, DSDM and xP are a few of the other common frameworks. How any of these frameworks is implemented makes them more or less Agile.

Abandoning Command and Control – Effective Agile teams are self-organizing and self-managing. Self-managing teams plan and manage their day-to-day activities with little overt direction and supervision. Command and control management techniques, which hit their heyday in the 1950’s and 60’s and are still common, make the assumption that managers will assign and direct individuals (read that as tell them what to do). Command and control management strips much of the team’s ability to quickly make decisions and to react to change at a tactical level, which slows progress and reduces productivity.

Taking A Product Perspective – Agile embraces short iterative cycles of development that deliver value continuously or in the form of frequent releases. Techniques such as gathering needs into a backlog, involving product owners in planning and prioritizing needs over several sprints and releases are a reflection of product thinking. This type of thinking fosters trust between the business and IT so that the onus is not on the business to think of everything they need at once. Projects, as opposed to products, have a discernible beginning and end and have a known scope or budget, which forces users will to try to maximize the functions they ask for from the project team. The incentive is for the business to ask for everything and to make everything the top priority.

Who Can be Agile? Unfortunately any team or organization can say they are Agile. Any team can begin the day with a stand-up meeting or do a demo or retrospective. However just using a technique or framework does not make anyone Agile. Anyone can be Agile, however only if they want to be Agile enough to make changes in how they think and how they organize their work. Being Agile is only way to get all of the benefits in customer satisfaction, quality and productivity organizations are seeking when they say they “want” to be Agile.


Categories: Process Management

Python/gensim: Creating bigrams over How I met your mother transcripts

Mark Needham - Fri, 02/13/2015 - 00:45

As part of my continued playing around with How I met your mother transcripts I wanted to identify plot arcs and as a first step I wrote some code using the gensim and nltk libraries to identify bigrams (two word phrases).

There’s an easy to follow tutorial in the gensim docs showing how to go about this but I needed to do a couple of extra steps to get my text data from a CSV file into the structure gensim expects.

Let’s first remind ourselves what the sentences CSV file looks like:

$ head -n 15 data/import/sentences.csv  | tail
5,1,1,1,Son: Are we being punished for something?
6,1,1,1,Narrator: No
7,1,1,1,"Daughter: Yeah, is this going to take a while?"
8,1,1,1,"Narrator: Yes. (Kids are annoyed) Twenty-five years ago, before I was dad, I had this whole other life."
9,1,1,1,"(Music Plays, Title ""How I Met Your Mother"" appears)"
10,1,1,1,"Narrator: It was way back in 2005. I was twenty-seven just starting to make it as an architect and living in New York with my friend Marshall, my best friend from college. My life was good and then Uncle Marshall went and screwed the whole thing up."
11,1,1,1,Marshall: (Opens ring) Will you marry me.
12,1,1,1,"Ted: Yes, perfect! And then you're engaged, you pop the champagne! You drink a toast! You have s*x on the kitchen floor... Don't have s*x on our kitchen floor."
13,1,1,1,"Marshall: Got it. Thanks for helping me plan this out, Ted."
14,1,1,1,"Ted: Dude, are you kidding? It's you and Lily! I've been there for all the big moments of you and Lily. The night you met. Your first date... other first things."

We need to transform those sentences into an array of words for each line and feed it into gensim’s models.Phrase object:

import nltk
import csv
import string
from gensim.models import Phrases
from gensim.models import Word2Vec
from nltk.corpus import stopwords
 
sentences = []
bigram = Phrases()
with open("data/import/sentences.csv", "r") as sentencesfile:
    reader = csv.reader(sentencesfile, delimiter = ",")
    reader.next()
    for row in reader:
        sentence = [word.decode("utf-8")
                    for word in nltk.word_tokenize(row[4].lower())
                    if word not in string.punctuation]
        sentences.append(sentence)
        bigram.add_vocab([sentence])

We’re used nltk’s word_tokezine function to create our array of words and then we’ve got a clause to make sure we remove any words which are punctuation otherwise they will dominate our phrases.

We can take a quick peek at some of the phrases that have been created like so:

>>> list(bigram[sentences])[:5]
[[u'pilot'], [u'scene', u'one'], [u'title', u'the', u'year_2030'], [u'narrator_kids', u'i', u"'m", u'going', u'to', u'tell', u'you', u'an_incredible', u'story.', u'the', u'story', u'of', u'how', u'i', u'met_your', u'mother'], [u'son', u'are', u'we', u'being', u'punished', u'for', u'something']]

gensim uses an underscore character to indicate when it’s joined two words together and in this sample we’ve got three phrases – ‘narrator_kids’, ‘met_you’ and ‘an_incredible’.

We can now populate a Counter with our phrases and their counts and find out the most common phrases. One thing to note is that I’ve chosen to get rid of stopwords at this point rather than earlier because I didn’t want to generate ‘false bigrams’ where there was actually a stop word sitting in between.

bigram_counter = Counter()
for key in bigram.vocab.keys():
    if key not in stopwords.words("english"):
        if len(key.split("_")) > 1:
            bigram_counter[key] += bigram.vocab[key]
 
for key, counts in bigram_counter.most_common(20):
    print '{0: <20} {1}'.format(key.encode("utf-8"), counts)
 
i_'m                 4607
it_'s                4288
you_'re              2659
do_n't               2436
that_'s              2044
in_the               1696
gon_na               1576
you_know             1497
i_do                 1464
this_is              1407
and_i                1389
want_to              1071
it_was               1053
on_the               1052
at_the               1035
we_'re               1033
i_was                1018
of_the               1014
ca_n't               1010
are_you              994

Most of the phrases aren’t really that interesting and I had better luck feeding the phrases into a Word2Vec model and repeating the exercise:

bigram_model = Word2Vec(bigram[sentences], size=100)
bigram_model_counter = Counter()
for key in bigram_model.vocab.keys():
    if key not in stopwords.words("english"):
        if len(key.split("_")) > 1:
            bigram_model_counter[key] += bigram_model.vocab[key].count
 
for key, counts in bigram_model_counter.most_common(50):
    print '{0: <20} {1}'.format(key.encode("utf-8"), counts)
 
do_n't               2436
gon_na               1576
ca_n't               1010
did_n't              704
come_on              499
end_of               460
kind_of              396
from_2030            394
my_god               360
they_'re             351
'm_sorry             349
does_n't             341
end_flashback        327
all_right            308
've_been             303
'll_be               301
of_course            289
a_lot                284
right_now            279
new_york             270
look_at              265
trying_to            238
tell_me              196
a_few                195
've_got              189
wo_n't               174
so_much              172
got_ta               168
each_other           166
my_life              157
talking_about        157
talk_about           154
what_happened        151
at_least             141
oh_god               138
wan_na               129
supposed_to          126
give_me              124
last_night           121
my_dad               120
more_than            119
met_your             115
excuse_me            112
part_of              110
phone_rings          109
get_married          107
looks_like           105
'm_sorry.            104
said_``              101

The first 20 phrases or so aren’t particularly interesting although we do have ‘new_york’ in there which is good as that’s where the show is set. If we go further we’ll notice phrases like ‘my_dad’, ‘get_married’ and ‘last_night’ which may all explain interesting parts of the plot.

Having the data in the Word2Vec model allows us to do some other fun queries too. e.g.

>>> bigram_model.most_similar(['marshall', 'lily'], ['ted'], topn=10)
[(u'robin', 0.5474381446838379), (u'go_ahead', 0.5138797760009766), (u'zoey', 0.505358874797821), (u'karen', 0.48617005348205566), (u'cootes', 0.4757827818393707), (u'then', 0.45426881313323975), (u'lewis', 0.4510520100593567), (u'natalie.', 0.45070385932922363), (u'vo', 0.4189065098762512), (u'players', 0.4149518311023712)]
 
>>> bigram_model.similarity("ted", "robin")
0.51928683064927905
 
>>> bigram_model.similarity("barney", "robin")
0.62980405583219112
 
>>> bigram_model.most_similar(positive=['getting_married'])
[(u'so_glad', 0.780311107635498), (u'kidding', 0.7683225274085999), (u'awake', 0.7682262659072876), (u'lunch.', 0.7591195702552795), (u'ready.', 0.7372316718101501), (u'single.', 0.7350872755050659), (u'excited_about', 0.725479006767273), (u'swamped', 0.7252731323242188), (u'boyfriends', 0.7127221822738647), (u'believe_this.', 0.71015864610672)]
 
>>> bigram_model.most_similar(positive=['my_dad'])
[(u'my_mom', 0.7994954586029053), (u'somebody', 0.7758427262306213), (u'easier', 0.7305313944816589), (u'hot.', 0.7282992601394653), (u'pregnant.', 0.7103987336158752), (u'nobody', 0.7059557437896729), (u'himself.', 0.7046393156051636), (u'physically', 0.7044381499290466), (u'young_lady', 0.69412761926651), (u'at_bernie', 0.682607889175415)]

I’m not quite at the stage where I can automatically pull out the results of a gensim model and do something with it but it is helping me to see some of the main themes in the show.

Next up I’ll try out trigrams and then TF/IDF over the bigrams to see which are the most important on a per episode basis. I also need to dig into Word2Vec to figure out why it comes up with different top phrases than the Phrases model.

Categories: Programming

How to Create a Book’s Front and Back Matter

NOOP.NL - Jurgen Appelo - Thu, 02/12/2015 - 22:37
mock-up-spread-0-wit-1024

So, your book has a great title, a nice cover, good content, and proper design? What about the front and back matter? The what? The front matter and the back matter! That’s all the extra stuff at the beginning and at the end of your book that sits between the cover and the chapters. Oh that, I’ll do all that stuff the day before I publish the book. *EEEEE* Wrong!!

A book is only good when it’s good from start to finish. It starts with the cover and it ends with the backside (or the final page in case of an eBook).

The post How to Create a Book’s Front and Back Matter appeared first on NOOP.NL.

Categories: Project Management