Skip to content

Software Development Blogs: Programming, Software Testing, Agile Project Management

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Programming

Python’s pandas vs Neo4j’s cypher: Exploring popular phrases in How I met your mother transcripts

Mark Needham - Thu, 02/19/2015 - 01:52

I’ve previously written about extracting TF/IDF scores for phrases in documents using scikit-learn and the final step in that post involved writing the words into a CSV file for analysis later on.

I wasn’t sure what the most appropriate tool of choice for that analysis was so I decided to explore the data using Python’s pandas library and load it into Neo4j and write some Cypher queries.

To do anything with Neo4j we need to first load the CSV file into the database. The easiest way to do that is with Cypher’s LOAD CSV command.

First we’ll load the phrases in and then we’ll connect them to the episodes which were previously loaded:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/tfidf_scikit.csv" AS row
MERGE (phrase:Phrase {value: row.Phrase});
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/tfidf_scikit.csv" AS row
MATCH (phrase:Phrase {value: row.Phrase})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (phrase)-[:USED_IN_EPISODE {tfidfScore: TOFLOAT(row.Score)}]->(episode);

Now we’re ready to start writing some queries. To start with we’ll write a simple query to find the top 3 phrases for each episode.

In pandas this is quite easy – we just need to group by the appropriate field and then take the top 3 records in that grouping:

top_words_by_episode = df \
    .sort(["EpisodeId", "Score"], ascending = [True, False]) \
    .groupby(["EpisodeId"], sort = False) \
    .head(3)
 
>>> print(top_words_by_episode.to_string())
 
        EpisodeId              Phrase     Score
3976            1                 ted  0.262518
2912            1              olives  0.195714
2441            1            marshall  0.155515
8143            2                 ted  0.292184
5197            2              carlos  0.227454
7482            2               robin  0.195150
12551           3                 ted  0.232662
9040            3              barney  0.187255
11254           3              mcneil  0.170619
15641           4             natalie  0.562485
16763           4                 ted  0.191873
16234           4               robin  0.102671
20715           5            subtitle  0.310866
18121           5          coat check  0.181682
20861           5                 ted  0.169973
...

The cypher version looks quite similar, the main difference being that we use the COLLECT to generate an array of phrases by episode and then take the top 3:

MATCH (e:Episode)<-[rel:USED_IN_EPISODE]-(phrase)
WITH e, rel, phrase
ORDER BY e.id, rel.tfidfScore DESC
RETURN e.id, e.title, COLLECT({phrase: phrase.value, score: rel.tfidfScore})[..3]
ORDER BY e.id
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | e.id | e.title                                     | COLLECT({phrase: phrase.value, score: rel.tfidfScore})[..3]                                                                                                               |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | 1    | "Pilot"                                     | [{phrase -> "ted", score -> 0.2625177493269755},{phrase -> "olives", score -> 0.19571419072701732},{phrase -> "marshall", score -> 0.15551468983363487}]                  |
==> | 2    | "Purple Giraffe"                            | [{phrase -> "ted", score -> 0.292184496766088},{phrase -> "carlos", score -> 0.22745438090499026},{phrase -> "robin", score -> 0.19514993122773566}]                      |
==> | 3    | "Sweet Taste of Liberty"                    | [{phrase -> "ted", score -> 0.23266190616714866},{phrase -> "barney", score -> 0.18725456678444408},{phrase -> "officer mcneil", score -> 0.17061872221616137}]           |
==> | 4    | "Return of the Shirt"                       | [{phrase -> "natalie", score -> 0.5624848345525686},{phrase -> "ted", score -> 0.19187323894701674},{phrase -> "robin", score -> 0.10267067360622682}]                    |
==> | 5    | "Okay Awesome"                              | [{phrase -> "subtitle", score -> 0.310865508347106},{phrase -> "coat check", score -> 0.18168178787561182},{phrase -> "ted", score -> 0.16997258596683185}]               |
==> | 6    | "Slutty Pumpkin"                            | [{phrase -> "mike", score -> 0.2966610054610693},{phrase -> "ted", score -> 0.19333276951599407},{phrase -> "robin", score -> 0.1656172994411056}]                        |
==> | 7    | "Matchmaker"                                | [{phrase -> "ellen", score -> 0.4947912795578686},{phrase -> "sarah", score -> 0.24462913913669443},{phrase -> "ted", score -> 0.23728319597607636}]                      |
==> | 8    | "The Duel"                                  | [{phrase -> "ted", score -> 0.26713931416222847},{phrase -> "marshall", score -> 0.22816702335751904},{phrase -> "swords", score -> 0.17841675237702592}]                 |
==> | 9    | "Belly Full of Turkey"                      | [{phrase -> "ericksen", score -> 0.43145756691027665},{phrase -> "mrs ericksen", score -> 0.1939318283559959},{phrase -> "kendall", score -> 0.1846969793866628}]         |
==> | 10   | "The Pineapple Incident"                    | [{phrase -> "ted", score -> 0.439756993033922},{phrase -> "trudy", score -> 0.36367907631894536},{phrase -> "carl", score -> 0.16413071244131686}]                        |
==> | 11   | "The Limo"                                  | [{phrase -> "moby", score -> 0.48314164479037003},{phrase -> "party number", score -> 0.30458929780262456},{phrase -> "ranjit", score -> 0.1991061739767796}]             |
...
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

In the cypher version we get one row per episode whereas with the Python version we get 3 rows. It might be possible to achieve this effect with pandas too but I wasn’t sure how to do so.

Next let’s find the top phrases for a single episode – the type of query that might be part of an episode page on a How I met your mother wiki:

top_words = df[(df["EpisodeId"] == 1)] \
    .sort(["Score"], ascending = False) \
    .head(20)
 
>>> print(top_words.to_string())
 
      EpisodeId                Phrase     Score
3976          1                   ted  0.262518
2912          1                olives  0.195714
2441          1              marshall  0.155515
4732          1               yasmine  0.152279
3347          1                 robin  0.130418
209           1                barney  0.124412
2146          1                  lily  0.122925
3637          1                signal  0.103793
1366          1                goanna  0.098138
3524          1                 scene  0.095342
710           1                   cut  0.091734
2720          1              narrator  0.086462
1147          1             flashback  0.078296
1148          1        flashback date  0.070283
3224          1                ranjit  0.069393
4178          1           ted yasmine  0.058569
1149          1  flashback date robin  0.058569
525           1                  carl  0.058210
3714          1           smurf pen1s  0.054365
2048          1              lebanese  0.054365
MATCH (e:Episode {title: "Pilot"})<-[rel:USED_IN_EPISODE]-(phrase)
WITH phrase, rel
ORDER BY rel.tfidfScore DESC
RETURN phrase.value AS phrase, rel.tfidfScore AS score
LIMIT 20
 
==> +-----------------------------------------------+
==> | phrase                 | score                |
==> +-----------------------------------------------+
==> | "ted"                  | 0.2625177493269755   |
==> | "olives"               | 0.19571419072701732  |
==> | "marshall"             | 0.15551468983363487  |
==> | "yasmine"              | 0.15227880637176266  |
==> | "robin"                | 0.1304175242341549   |
==> | "barney"               | 0.12441175186690791  |
==> | "lily"                 | 0.12292497785945679  |
==> | "signal"               | 0.1037932464656365   |
==> | "goanna"               | 0.09813798750091524  |
==> | "scene"                | 0.09534236041231685  |
==> | "cut"                  | 0.09173366535740156  |
==> | "narrator"             | 0.08646229819848741  |
==> | "flashback"            | 0.07829592155397117  |
==> | "flashback date"       | 0.07028252601773662  |
==> | "ranjit"               | 0.06939276915589167  |
==> | "ted yasmine"          | 0.05856877168144719  |
==> | "flashback date robin" | 0.05856877168144719  |
==> | "carl"                 | 0.058210117288760355 |
==> | "smurf pen1s"          | 0.05436505297972703  |
==> | "lebanese"             | 0.05436505297972703  |
==> +-----------------------------------------------+

Our next query is a negation – find the episodes which don’t mention the phrase ‘robin’. In python we can do some simple set operations to work this out:

all_episodes = set(range(1, 209))
robin_episodes = set(df[(df["Phrase"] == "robin")]["EpisodeId"])
 
>>> print(set(all_episodes) - set(robin_episodes))
set([145, 198, 143])

In cypher land a query will suffice:

MATCH (episode:Episode), (phrase:Phrase {value: "robin"})
WHERE NOT (episode)<-[:USED_IN_EPISODE]-(phrase)
RETURN episode.id AS id, episode.season AS season, episode.number AS episode

And finally a mini recommendation engine type query – how many of the top phrases in Episode 1 were used in other episodes:

First python:

phrases_used = set(df[(df["EpisodeId"] == 1)] \
    .sort(["Score"], ascending = False) \
    .head(10)["Phrase"])
 
phrases = df[df["Phrase"].isin(phrases_used)]
 
print (phrases[phrases["EpisodeId"] != 1] \
    .groupby(["Phrase"]) \
    .size() \
    .order(ascending = False))

Here we’ve pulled it out into a few steps – first we identify the top phrases, then we find out where they occur across the whole data set and finally we filter out the occurrences in the first episode and count the other occurrences.

Phrase
marshall    207
barney      207
ted         206
lily        206
robin       204
scene        36
signal        4
goanna        3
olives        1

In cypher we can write a query to do this as well:

MATCH (episode:Episode {title: "Pilot"})<-[rel:USED_IN_EPISODE]-(phrase)
WITH phrase, rel, episode
ORDER BY rel.tfidfScore DESC
LIMIT 10
MATCH (phrase)-[:USED_IN_EPISODE]->(otherEpisode)
WHERE otherEpisode <> episode
RETURN phrase.value AS phrase, COUNT(*) AS numberOfOtherEpisodes
ORDER BY numberOfOtherEpisodes DESC
 
==> +------------------------------------+
==> | phrase     | numberOfOtherEpisodes |
==> +------------------------------------+
==> | "barney"   | 207                   |
==> | "marshall" | 207                   |
==> | "ted"      | 206                   |
==> | "lily"     | 206                   |
==> | "robin"    | 204                   |
==> | "scene"    | 36                    |
==> | "signal"   | 4                     |
==> | "goanna"   | 3                     |
==> | "olives"   | 1                     |
==> +------------------------------------+

Overall there’s not much in it – for some of the queries I found it easier in cypher and for others easier with pandas. It’s always useful to have multiple tools in the toolbox!

Categories: Programming

Retiring the Email Migration API

Google Code Blog - Thu, 02/19/2015 - 01:15

Posted by Wesley Chun, Developer Advocate, Google Apps

Last summer, we launched the new Gmail API, giving developers more flexible, powerful, and higher-level access to programmatic email management, not to mention improved performance. Since then, it has been expanded to replace the Google Apps Admin SDK's Email Migration API (EMAPI v2). Going forward, we recommend developers integrate with the Gmail API.

EMAPI v2 will be turned down on November 1, 2015, so you should switch to the Gmail API soon. To aid you with this effort, we've put together a developer’s guide to help you migrate from EMAPI v2 to the Gmail API. Before you do that, here’s your final reminder to not forget about these deprecations including EMAPI v1, which are coming even sooner (April 20, 2015).

Categories: Programming

Diamond Kata - Thoughts on Incremental Development

Mistaeks I Hav Made - Nat Pryce - Thu, 02/19/2015 - 00:35
Some more thoughts on my experience doing the Diamond Kata with property-based tests… When test-driving development with example-based tests, an essential skill to be learned is how to pick each example to be most helpful in driving development forwards in small steps. You want to avoid picking examples that force you to take too big a step (A.K.A. “now draw the rest of the owl”). Conversely, you don’t want to get sidetracked into a boring morass of degenerate cases and error handling when you’ve not yet addressed the core of the problem to be solved. Property-based tests are similar: the skill is in picking the right next property to let you make useful progress in small steps. But the progress from nothing to solution is different. Doing TDD with example-based tests, I’ll start with an arbitrary, specific example (arbitrary but carefully chosen to help me make useful progress), and write a specific implementation to support just that one example. I’ll add more examples to “triangulate” the property I want to implement, and generalise the implementation to pass the tests. I continue adding examples and triangulating, and generalising the implementation until I have extensive coverage and a general implementation that meets the required properties. Example TDD progress Doing TDD with property-based tests, I’ll start with a very general property, and write a specific but arbitrary implementation that meets the property (arbitrary but carefully chosen to help me make useful progress). I’ll add more specific properties, which force me to generalise the the implementation to make it meet all the properties. The properties also double-check one another (testing the tests, so to speak). I continue adding properties and generalising the implementation until I have a general implementation that meets the required properties. Property TDD progress I find that property-based tests let me work more easily in larger steps than when testing individual examples. I am able to go longer without breaking the problem down to avoid duplication and boilerplate in my tests, because the property-based tests have less scope for duplication and boilerplate. For example, if solving the Diamond Kata with example-based tests, my reaction to the “now draw the rest of the owl” problem that Seb identified would be to move the implementation towards a compositional design so that I could define the overall behaviour declaratively and not need to write integration tests for the whole thing. For example, when I reached the test for “C”, I might break the problem down into two parts. Firstly, a higher-order mirroredLines function that is passed a function to generate each line, with the type (in Haskell syntax): mirroredLines :: (Char -> Char -> String) -> Char -> String I would test-drive mirroredLines with a stub function to generate fake lines, such as: let fakeLine ch maxCh = "line for " ++ [ch] Then, I would write a diamondLine function, that calculates the actual lines of the diamond. And declare the diamond function by currying: let diamond = mirroredLines diamondLine I wouldn’t feel the need to write tests for the diamond function given adequate tests of mirroredLines and diamondLine.
Categories: Programming, Testing & QA

Smaller Fonts with WOFF 2.0 and unicode-range

Google Code Blog - Wed, 02/18/2015 - 23:31

Posted by Rod Sheeter, Software Engineer

The Google Fonts and Chrome teams are constantly looking for ways to make fonts better for online content. In 2014, we deployed two big optimizations: WOFF 2.0 and unicode-range. Combined, they are reducing the size of the downloaded fonts by more than 40 percent on average. For users, that means faster download times and lower data costs!

The HTTP Archive has a graph of observed font sizes across the web, primarily in WOFF format. If we compare this to our best estimate of the savings attributable to WOFF 2.0 and unicode-range optimizations that went live in 2014, it looks like this:

WOFF 2.0 is a new font format using a new compression algorithm, Brotli, created by the Google Compression team. WOFF 2.0 responses use ~25 percent less bytes than WOFF with Zopfli.

unicode-range allows the browser to automatically select what subset(s) of the font it needs based on the particular font glyphs used on the page. In practice, we’ve observed approximately 50 percent reduction in response size on some large sites with this optimization. On average, we see unicode-range responses use ~20 percent less bytes than they would without.

So, how do you take advantage of these optimizations? The great news is, if you are using Google Fonts on your site, than you are already taking advantage of these optimizations! We’ve already done all the work to enable both WOFF 2.0 and unicode-range support for browsers that support it (see caniuse.com/woff2, caniuse.com/unicode-range) on your behalf. It’s that easy.

If you are using a different provider, or hosting the fonts yourself, then you will have to do a little bit of work to enable these optimizations:

To use WOFF 2.0, convert your fonts using a supporting editor or via the command line tools and either serve WOFF 2.0 only to supporting browsers (what we do at Google Fonts), or use a bulletproof font-face declaration similar to the following (courtesy of Font Squirrel, with svg removed):

/* Generated by Font Squirrel (http://www.fontsquirrel.com) on February 2, 2015 */
@font-face {
font-family: 'lobster';
src: url('lobster-webfont.eot');
src: url('lobster-webfont.eot?#iefix') format('embedded-opentype'),
url('lobster-webfont.woff2') format('woff2'),
url('lobster-webfont.woff') format('woff'),
url('lobster-webfont.ttf') format('truetype');
font-weight: normal;
font-style: normal;
}

To use unicode-range, the story is a bit more complicated. Due to inconsistent behavior of some older browsers, which will download all subsets regardless of whether they are required, you should serve unicode-range property only to browsers that support it fully. With that in place, cut your font into pieces as you see fit (Google Fonts uses fontTools for this), and serve multiple @font-face rules, one for each segment. You can see an example of this by looking at the CSS generated by Google Fonts for a browser which supports these features (e.g. access http://fonts.googleapis.com/css?family=Lobster with Chrome).

For more tips on optimizing your fonts, see Ilya Grigorik’s Optimizing Web Font Rendering Performance. If you try any of this let us know how it works out @googlefonts!

Categories: Programming

Building for Android Wear: Depth and Flexibility

Android Developers Blog - Wed, 02/18/2015 - 22:07

Posted by Timothy Jordan, Developer Advocate

With so many recent updates and improvements to Android Wear, it's high time to share an updated overview of the platform. We're certainly not done—there's a lot more to come—but this is the picture today as you start or continue developing your groundbreaking Android Wear user experiences.

Guns'n'Glory Heros and Strava

The Android Wear platform emphasizes depth and flexibility. Built on Android, it allows developers to use familiar APIs to create useful, performant, and imaginative apps that run directly on the watch. In the spirit of Android, you have the freedom to make substantial changes to the user experience, including the creation of custom watch faces. There are three main categories of experiences you can build: apps, custom watch faces, and notifications.

Apps

Apps that are built for Android Wear run directly on the watch and can do nearly anything a phone can, from tracking your run to giving you a little entertainment while waiting for the bus. Some even work without a connection to the phone, such as fitness and music apps. There are libraries to help you move data between the phone and the wearable, as well as create stunning and adaptable UIs. Here's a list of some of the great features you have access to:

table, th, td { border: 1px solid black; border-collapse: collapse; } td { padding: 5px; } Feature Documentation Full screen activities with touch events Creating Custom UIs for Wear Devices Notifications and custom actions UI Patterns for Android Wear Custom Watch faces Creating Watch Faces Layouts for round and square devices Creating Custom UIs for Wear Devices OpenGL Displaying Graphics with OpenGL ES Sensors
  • Accelerometer
  • Gyroscope
  • Compass
  • Barometer
  • Heart rate sensor
SensorManager Haptics Vibrator Microphone AudioRecord Voice actions Adding Voice Capabilities GPS Detecting Location on Android Wear Offline storing of data / music Transferring Assets Media playback controls MediaSession, MediaController Framework based on Android 5.0 API 21 Android 5.0 APIs Standalone or synchronized apps Sending and Syncing Data

Selected watch faces

Watch Faces

The ability to create custom watch faces gives you direct access to the most prominent UI element on a user's most personal device. The API is simple enough to build watch faces quickly and flexible enough to allow personalization. Again, given the depth and flexibility of the Android platform, you can create something for the user that's both beautiful and packed with unique features.

The development journey starts with the simplicity of bringing your design to the wrist. At the core of the watch face API is the onDraw method that allows you to draw whatever design you can think of to the canvas at a high enough frame rate to deliver fluid animation. This will come through at full fidelity while the watch is in interactive mode.

At other times, when the watch is in ambient mode, you're able to draw a more discreet version of the watch face. Additional preferences can be set to arrange the system UI elements appropriately for your design. Once those basics are covered, the limits are your imagination! You can go further with additions like the moon phase, current weather, or fitness stats. Watchmakers call these items "complications" -- but with Android they're hardly complicated. Once you have the data, just draw it on the canvas as you did the time.

Glympse and WhatsApp

Notifications

Of course, Android Wear Notifications are the easiest way to get started in the world of wearables. If you've got an Android app with notifications -- they already work on a Wear watch. If you've already enhanced your notification with actions, this is even better and also automatically already works. You can take things further with Wear-specific functionality like Stacks, Pages, and Voice Replies that make your notifications richer experiences on the wrist.

The user experiences you build for Wear get to take advantage of the power and flexibility of the Android platform. It's easy to get started and possible to create truly groundbreaking UI for your users. Together, we can create an ecosystem of user experiences as diverse as the watches they run on and the people who wear them.

Check out the developer videos and documentation for more, and share your thoughts on the Android Wear Developers community. We can’t wait to see the innovative user experiences you will build on Android Wear.

Join the discussion on

+Android Developers
Categories: Programming

Episode 220: Jon Gifford on Logging and Logging Infrastructure

Robert Blumen talks to Jon Gifford of Loggly about logging and logging infrastructure. Topics include logging defined, purposes of logging, uses of logging in understanding the run-time behavior of programs, who produces logs, who consumes logs and for what reasons, software as the consumer of logs, log formats (structured versus free form), log meta-data, logging […]
Categories: Programming

Azure: Machine Learning Service, Hadoop Storm, Cluster Scaling, Linux Support, Site Recovery and More

ScottGu's Blog - Scott Guthrie - Wed, 02/18/2015 - 17:06

Today we released a number of great enhancements to Microsoft Azure. These include:

  • Machine Learning: General Availability of the Azure Machine Learning Service
  • Hadoop: General Availability of Apache Storm Support, Hadoop 2.6 support, Cluster Scaling, Node Size Selection and preview of next Linux OS support
  • Site Recovery: General Availability of DR capabilities with SAN arrays

I've also included details in this blog post of other great Azure features that went live earlier this month:

  • SQL Database: General Availability of SQL Database (V12)
  • Web Sites: Support for Slot Settings
  • API Management: New Premium Tier
  • DocumentDB: New Asia and US Regions, SQL Parameterization and Increased Account Limits
  • Search: Portal Enhancements, Suggestions & Scoring, New Regions
  • Media: General Availability of Content Protection Service for Azure Media Services
  • Management: General Availability of the Azure Resource Manager

All of these improvements are now available to use immediately (note that some features are still in preview).  Below are more details about them: Machine Learning: General Availability of Azure ML Service

Today, I’m excited to announce the General Availability of our Azure Machine Learning service.  The Azure Machine Learning Service is a powerful cloud-based predictive analytics service that makes it possible to quickly create analytics solutions.  It is a fully managed service - which means you do not need to buy any hardware nor manage VMs manually.

Data Scientists and Developers can use our innovative browser-based machine learning IDE to quickly create and automate machine learning workflows.  You can literally drag/drop hundreds of existing ML libraries to jump-start your predictive analytics solutions, and then optionally add your own custom R and Python scripts to extend them.  Our Machine Learning IDE works in any browser and enables you to rapidly develop and iterate on solutions:

image

With today's General Availability release you can easily discover and create web services, train/retrain your models through APIs, manage endpoints and scale web services on a per customer basis, and configure diagnostics for service monitoring and debugging. Additional new capabilities with today's release include:

  • The ability to create a configurable custom R module, incorporate your own train/predict R-scripts, and add python scripts using a large ecosystem of libraries such as numpy, scipy, pandas, scikit-learn etc. You can now train on terabytes of data using “Learning with Counts”, use PCA or one-class SVM for anomaly detection, and easily modify, filter, and clean data using familiar SQLite.
  • Azure ML Community Gallery that allows you to discover & learn experiments, and share through Twitter and LinkedIn. You can purchase marketplace apps through an Azure subscription and consume finished web services for Recommendation, Text Analytics, and Anomaly Detection directly from the Azure Marketplace.
  • A step-by-step guide for the Data Science journey from raw data to a consumable web service to ease the path for cloud-based data science. We have added the ability to use popular tools such as iPython Notebook and Python Tools for Visual Studio along with Azure ML.

Get Started

You can learn the basics of predictive analytics and machine learning using our step-by-step data science guide and tutorials.  No sign-up or credit card is required to get started using Azure Machine Learning (you can use the machine learning IDE and try experiments for free):

image

Also browse our machine learning gallery to run existing machine learning experiments others have already built - and optionally publish your own experiments for others to learn from:

image

Machine Learning and predictive analytics will fundamentally change the way all applications are built in the future.  The new Azure Machine Learning service provides an incredibly powerful and easy way to achieve this.  Start using it for production apps today! HDInsight: General Availability of Apache Storm, Cluster Scaling, Hadoop 2.6, Node Sizes, and Preview of HDInsight on Linux

Today I’m happy to also announce several major enhancements to HDInsight, our managed Hadoop service for powering Big Data workloads in Azure.

General Availability of Apache Storm support

With today's release, we are making it easy for you to do real-time streaming analytics using Hadoop by providing Apache Storm as a fully managed Service and making it generally available on HDInsight. This makes it incredibly easy to stand up and manage Storm clusters. As part of the Storm service on HDInsight we have improved productivity by enabling some key features:

  • Integration with our Azure Event Hubs service - which allows you to easily process any data that is collected via Event Hubs
  • First class .NET experience on top of Apache Storm giving you the option to use both Java and .NET with it
  • Library of spouts and bolts let you easily integrate other Azure services like SQL, HBase and DocumentDB
  • Visual Studio integration that makes it easy for developers to do full project management from within the Visual Studio environment

Creating Storm cluster and running a sample topology

You can easily spin up a new Storm cluster from the Azure management portal. The Storm Dashboard allows you to either upload an existing Storm topology or pick one of the sample topologies from the dropdown.  Topologies can be authored in code, or higher level programming models like Trident can be used. You can also monitor and manage all the topologies that are currently on your cluster via the Storm Dashboard.

image

.NET Topologies and a Visual Studio Experience

One of the big improvements we have done on top of Storm is to enable developers to write Storm topologies in .NET. One of the things I am particularly excited about with the Storm release is the Visual Studio experience that we have enabled for Storm on HDInsight. With the latest version of the Azure SDK, you will get Storm project templates under HDInsight. This will quickly get you started with writing Storm topologies without having to worry or setup the right references or write the skeleton code that is needed for every Storm topology.

Since Storm is available as part of the HDInsight service, all HDInsight features also apply to Storm clusters. For example, you can easily scale up or scale down a Storm cluster with no impact to the existing running topologies. This will enable you to easily grow or shrink Storm clusters depending on the speed of ingest data and latency requirements with no impact on the data which is being processed.  At the time of the cluster creation you have the choice to pick from a long list of available VMs to use for their Storm cluster on HDInsight.

HDInsight 3.2 Support

I’m pleased to announce the availability of the next major version of Hadoop in HDInsight clusters for Windows and Linux. This includes Hadoop 2.6, Hive 0.14, and substantial updates to all of the components in the stack.  Hive 0.14 contains work to improve performance and scalability through Tez, adds a powerful cost based optimizer, and introduces capabilities for handling UPDATE, INSERT and DELETE SQL statements, temporary tables which live for the duration of a development session and more. You can find more details on the Hive 0.14 release here.   Pig 0.14 adds support for ORC, allowing a single high performance format to be leveraged across Pig and Hive.  Additionally Pig can now target Tez instead of Map/Reduce, resulting in substantial performance improvements by changing the execution engine. Details on the Pig 0.14 release are here.  These bring the latest improvements in the open source ecosystem to HDInsight. 

To get started with a 3.2 cluster, use the Azure Management portal or the command-line. In addition to the VS tools for Storm, we've also updated the VS tools to include Hive query authoring.  We've also added improved statement completion, local validation, access in Visual Studio to the YARN task logs, and support for HDInsight clusters on Linux. In order to get these, you just need to install the Azure SDK for Visual Studio which contains the latest HDInsight tooling.

Cluster Scaling

Many of our customers have asked for the ability to change HDInsight cluster sizes on the fly.  This capability is now accessible in both the Azure portal, as well as through the command line and SDK's.  You can grow or shrink a Hadoop cluster to fit your workload by simply dragging the sizing slider.  We'll add more nodes to your cluster while it is processing and when your larger jobs are done, you can reduce the size of the cluster.  If you need more cores available in your subscription, you can open a Billing support ticket to request a larger quota. 

Node Size Selection

Finally, you can also now specify the VM sizes for the nodes within your HDInsight cluster.  This lets you optimize your cluster's resources to fit your workload.  We've made the entire A and D series of VM sizes available.  For each of the different types of roles within a cluster, we'll let you specify the machine type.  This allows you to tune the amount of CPU, RAM and SSD available to your jobs. 

HDInsight on Linux

Today we are also releasing a preview version of our HDInsight service that allows you to deploy HDInsight clusters using Ubuntu Linux containers.  This expands the operating system options you can use when running managed Hadoop workloads on Azure (previously HDInsight only supported Windows Server containers).

The new Linux support enables you to easily use familiar tools like SSH and Ambari to build Big Data workloads in Azure.  HDInsight on Linux clusters are built on the same Hadoop distribution as the Windows clusters, are fully integrated with Azure storage, and make it easy for customers leveraging Hadoop to take advantage of the SLA, management and support that HDInsight offers.  To get started, sign up for the preview here.  You can then easily create Linux clusters using the Azure Management Portal or via our command-line interfaces.

SSH connectivity to your HDInsight clusters is enabled by default for all HDInsight on Linux clusters. You can use an SSH client of your choice to connect to the cluster.  Additionally, SSH tunneling can be leveraged for forwarding traffic from your browser to all of the Hadoop web applications.

Learn More

For more information about Azure HDInsight, check out the following resources:

Site Recovery: General Availability of Enterprise DR with SANs

With today’s Azure release, we are also adding another significant capability to Azure Site Recovery’s disaster recovery and replication portfolio. Enterprises that seek to leverage their Storage Area Network (SAN) Arrays to enable high performance synchronous and asynchronous replication across their on-premises Hyper-V private clouds can now orchestrate end-to-end storage array-based replication and disaster recovery with Azure Site Recovery and System Center Virtual Machine Manager (SCVMM).

The addition of SAN as a replication channel enables key scenarios such as Synchronous Replication, Multi-VM Consistency, and support for Guest Clusters with Azure Site Recovery. With support for Shared VHDX and iSCSI Target LUNs, ASR will now be able to better meet the needs of enterprise-class applications such as SQL Server, SharePoint, and SAP etc.

To enable SAN Replication, in the Azure Management Portal select SAN when configuring SCVMM clouds in ASR. ASR in turn validates that the cloud being configured has host clusters that have been correctly zoned to a Storage Array, either via Fibre Channel or iSCSI. Once the cloud configuration is complete and the storage pools have been mapped, Replication Groups (group of storage LUNs that replicate together and thereby enable multi-VM replication consistency) can be enabled for replication. ASR automates the creation of target LUNs, target Replication Groups, and starts the array-based replication. 

Here’s an example of a Recovery Plan that can failover a SQL Guest Cluster deployed on a Replication Group:

image

Learn More

Visit the Azure Site Recovery forum on MSDN for additional information.

Getting started with Azure Site Recovery is easy - all you need is to simply sign up for a free Microsoft Azure trial. SQL Database: General Availability of SQL Database (V12)

Earlier this month we released the general availability version of our SQL Database (V12) service version.  We introduced a preview of this new release last December, and it includes a ton of new capabilities. These include:

  • Better management of large databases. We now support heavier database workload management with parallel queries, table partitioning, online indexing, worry-free large index rebuilds with the previous 2GB size limit removed, and more alter database commands.

  • Support for more programmability capabilities: You can now build even more robust applications with CLR, T-SQL Windows functions, XML index, and change tracking support.

  • Up to 100x performance improvements with support for In-memory columnstore queries for data mart and analytic workloads.

  • Improved monitoring and troubleshooting: Extended Events (XEvents) and visibility into over 100 new table views via an expanded set of Database Management Views (DMVs).

  • New S3 performance level: Today's preview introduces a new pricing option for SQL Databases. The new "S3" performance tier delivers 100 DTU of performance (twice the DTU level of the existing S2 tier) and all of the features available in the Standard tier. It enables an even more cost effective way to run applications with higher performance needs.

You can now take advantage of all of these features in general availability - with all databases backed by an enterprise grade SLA.

Upcoming Security Features

I'm also excited to announce a number of new security features that will start rolling out this month and this spring.  These features will help customers better protect their cloud data and help further meet corporate and industry compliance policies. These security enhancements include:

  • Row-Level Security
  • Dynamic Data Masking
  • Transparent Data Encryption

Available in preview today, customers can now implement Row-Level Security on databases to enable implementation of fine-grained access control over rows in a database table for greater control over which users can access which data.

Coming soon, SQL Database will introduce Dynamic Data Masking which is a policy-based security feature that helps limit the exposure of data in a database by returning masked data to non-privileged users who run queries over designated database fields, like credit card numbers, without changing data on the database. Finally, Transparent Data Encryption is coming soon to SQL Database V12 databases for encryption at rest on all databases.

Stay tuned over the coming months for details as we continue to rollout the V12 service general availability and upcoming security features. Web Sites: Support for Slot Settings

The Azure Web Sites service has always provided the ability to store application settings and connection strings as a part of your Web Site’s metadata.  Those settings become available at runtime via environment variables and, if you use .NET, the standard configuration manager API.  This feature has now been updated to work better with another Web Sites feature: deployment slots. 

Deployment slots provide an easy way for you to safely deploy and test new releases of your web applications prior to swapping them live into production.  Let’s say you have a website called mysite.azurewebsites.net with a deployment slot at mysite-staging.azurewebsites.net.  You can swap these slots at any given time, and with no downtime. This provides a nice infrastructure for upgrading your website. Until now, when you swapped the staging slot with the production site, all settings and connection strings would swap as well. Sometimes that’s exactly what you want and it works great. 

But what if, for testing purposes, your site uses a database and you explicitly want each slot to have its own database (e.g. a production database and a testing database)?  Prior to this month's release that would have been difficult to automate since the swap operation would move the staging connection string to the production site and vice versa. You would have to do something unnatural like going to the staging slot and manually updating the settings to the production values before performing the swap operation. Then, you would execute the swap, and finally manually update the staging settings to point to the staging database. That workflow is very complicated and error prone.  

New Slot Settings Support

Slot specific settings are the solution to this problem.  Simply go to the Azure Preview Portal, navigate to your Web Site’s Settings page, and you’ll see a new checkbox next to each app setting and connection string.  Check the boxes next to each app settings setting and/or connection string that should not participate in swap operations.  Each deployment slot has its own version of this settings page where you can go and enter the slot specific setting values.  You now have a lot more flexibility when it comes to managing deployment slots and flowing configuration between them during swaps:

image 

API Management: New Premium Tier

Earlier this month we released a preview of our new Premium Tier for our API Management Service.  The Azure API Management Service provides a great offering that helps customers expose web-based APIs to customers - and provides support for API protection via rate-limiting, quotas and keys, detailed analytics, easy developer on-boarding and much more.

As the strategic value of APIs increase, customers are demanding even more performance, higher availability and more enterprise-grade features. And in response we're delighted to introduce a new Premium tier of API Management which will offer a 99.95% SLA after preview and includes a number of key new features:

Multiple Geography Deployment

Until now each API Management service resided in a single region selected when the service is created. I’m pleased to announce the introduction of a new multi-region deployment feature that allows API publishers to easily distribute a single API Management service across any number of Azure regions. Customers who want to reduce latency for distributed API consumers and require extremely high availability can now enable multi-geo with minimal configuration.

image

Premium tier customers will now see an updated capacity section on the scale tab of the Azure Management portal. Additional units and regions can be added with a few clicks of the relevant dropdown controls and API Management will provision additional proxies beyond the primary region in a matter of minutes.

Multi-geo is particularly effective when combined with the API Management caching policy, which can provide a CDN-like capability for your mission critical and performance sensitive APIs. For more information on multiple-geography deployment, check out the documentation.

Azure Virtual Network / VPN integration

Many customers are already managing their on-premises APIs using API Management's mutual certificate authentication to secure their backend. The new Premium offering introduces a great new capability for organizations that prefer to use a VPN solution or want to leverage their Azure ExpressRoute connection. Available in the Premium Tier, VPN connectivity settings are available on the configure tab of the Azure Management Portal and can even be combined with multi-geo, with a separate VPN for each region. More information is available in the documentation.

image

Active Directory Integration

Prior to today’s release, API Management's developer portal allowed developers to self-serve sign up using a custom account created with their e-mail address or using popular social identity providers like Facebook, Twitter, Google and Microsoft account. Sometimes businesses and enterprises want more control and would like to restrict sign in options, often preferring Azure Active Directory.

With our latest release, we now allow you to configure Azure Active Directory as an identity provider for Azure API Management. Administrators can disable all other identity providers and restrict access to APIs and documentation based on AD group membership. What's more, access can be extended to allow multiple AAD tenants to access your developer portal, making it even easier to share your APIs with business partners.

image

Learning More

Check out the Azure Active Directory documentation for more information on the integration, and the pricing page for more information on the new premium tier. DocumentDB: New Asia and US Regions, SQL Parameterization and Increased Account Limits

Earlier this month we released the following new features and capabilities in our Azure DocumentDB service - which provides a fully managed NoSQL JSON database service:

  • New regional availability
  • Larger accounts and documents: Increased the number of capacity units per account and upported document size doubled
  • SQL parameterization: Support for handle and escape user input, preventing accidental exposure of data

New Regions

We have added new support for provisioning DocumentDB accounts in the East Asia, Southeast Asia, and US East Azure regions (in addition to our existing US West, East Europe and West Europe regions). We’ll continue to invest in regional expansion in order to give you the flexibility and choice you need when deciding where to locate your DocumentDB data.

Larger Accounts and Documents

Throughout the preview process we’ve steadily increased the maximum document and database sizes.  With this month's release we've increased the maximum size of an individual document from 256Kb to 512Kb. The Capacity Unit (CU) limit per DocumentDB Account has also been raised from 5 to 50 which means you can now scale a single DocumentDB account to 500GB of storage and 100,000 Request Units of provisioned throughput. As always, our preview quotas can be adjusted on a per account basis - contact us if you have a need for increased capacity.

SQL Parameterization

Instead of inventing a new query language, DocumentDB supports querying documents using SQL (Structured Query Language) over hierarchical JSON documents. We are pleased to announce that we have extended our SQL query capabilities by adding support for parameterized SQL queries in the Azure DocumentDB REST API and SDKs. Using this feature, you can now write parameterized SQL queries. Parameterized SQL provides robust handling and escaping of user input, preventing accidental exposure of data through “SQL injection”.

Let’s take a look at a sample using the .NET SDK. In addition to plain SQL strings and LINQ expressions, we’ve added a new SqlQuerySpec class that can be used to build parameterized queries.  Here’s a sample that queries a “Books” collection with a single user supplied parameter for author name:

IQueryable<Book> queryable = client.CreateDocumentQuery<Book>(<?xml:namespace prefix = "o" />

collectionSelfLink,

new SqlQuerySpec {

             QueryText = "SELECT * FROM books b WHERE (b.Author.Name = @name)",

             Parameters = new SqlParameterCollection()  {

              new SqlParameter("@name", "Herman Melville")

              }

       });

Note:

  • SQL parameters in DocumentDB use the familiar @ notation borrowed from T-SQL
  • Parameter values can be any valid JSON (strings, numbers, Booleans, null, even arrays or nested JSON)
  • Since DocumentDB is schema-less, parameters are not validated against any type
  • You could just as easily supply additional parameters by adding additional SqlParameters to the SqlParameterCollection

The DocumentDB REST API also natively supports parameterization. The .NET sample shown above translates to the following REST API call. To use parameterized queries, you need to specify the Content-Type Header as application/query+json and the query as JSON in the body, as shown below.

POST https://contosomarketing.documents.azure.com/dbs/XP0mAA==/colls/XP0mAJ3H-AA=/docs

HTTP/1.1 x-ms-documentdb-isquery: True

x-ms-date: Mon, 18 Aug 2014 13:05:49 GMT

authorization: type%3dmaster%26ver%3d1.0%26sig%3dkOU%2bBn2vkvIlHypfE8AA5fulpn8zKjLwdrxBqyg0YGQ%3d

x-ms-version: 2014-08-21

Accept: application/json

Content-Type: application/query+json

Host: contosomarketing.documents.azure.com

Content-Length: 50

{     

    "query": "SELECT * FROM books b WHERE (b.Author.Name = @name)",    

    "parameters": [         

        {"name": "@name", "value": "Herman Melville"}        

    ]

}

Queries can be issued against document collections, as well as system metadata collections like Databases, DocumentCollections, and Attachments using the approach shown above. To try this out, download the latest build of the DocumentDB SDK on any of the supported platforms (.NET, Java, Node.js, JavaScript, or Python).

As always, we’d love to hear from you about the DocumentDB features and experiences you would find most valuable. Submit your suggestions on the Microsoft Azure DocumentDB feedback forum. Search: Portal Enhancements, Suggestions & Scoring, New Regions

Earlier this month we released a bunch of great enhancements to our Azure Search service.  Azure Search provides developers with all of the features needed to build out search experiences for web and mobile applications without having to deal with the typical complexities that come with managing, tuning and scaling a large search service.

Azure Portal Enhancements

Last month we added the ability to create and manage your search indexes from the Azure Preview Portal. Since then, you have told us that this has really helped to speed up development as it greatly reduced the amount of code required, but we also heard that you needed more. As a result, we extended the portal by adding the ability to add Scoring Profiles as well as configure Cross Origin Resource Sharing from the portal.

Portal Support of Scoring Profiles

Scoring Profiles boost items up in the search results based on different factors that you control. For example, below, I have a hotels index and all other things being equal, I want highly rated hotels close to the users’ current location to appear at the top of the users search results. To do this, in the Azure Preview Portal, choose Add Scoring Profile and provide a name for it. In this case I am going to call it “closeToUser”. You can create one or more scoring profiles and name them as needed in the search request, allowing you to provide different search results based on different use cases.

image

Once closeToUser has been created, I can start adding weights and functions. For example, in this scoring profile, I chose to add:

  • Weighting: Use hotelName as a weighted field, such that if the search term is found in the hotelName, it gets a weighted boost
  • Distance: Leverage the spatial capabilities of Azure Search to boost a hotel if it is found to be closer to the user’s specified location
  • Magnitude: Provide a boost to the hotels that have higher ratings

All of these functions and weights are then combined into a final score that is used to rank documents.

image

Scoring Profiles can often be tricky and it tends to be mixed with the rest of the query. With Azure Search, scoring profiles experience has been simplified and they are separated from search queries so the scoring model stays outside of application code and can be updated independently. In addition, these scoring profiles are modeled as a set of high-level scoring functions combined with a way to do the typical field weights making editing and maintenance of scoring much simpler.

As demonstrated above, this user experience requires no coding and you can simply choose the fields that are important and apply the function or weight that makes the most sense. It is important to note that scoring profiles is a method of boosting the relevance of a document and should not be confused with sorting. There are a number of other functions available which you can learn more about in the MSDN documentation.

Cross Origin Resource Sharing (CORS)

Web Browsers commonly apply a same-origin restriction policy to network requests, preventing client-side web applications from issuing requests to another domain for security reasons. For example, JavaScript code that came from http://www.contoso.com could not issue a request to another domain such as http://www.northwindtraders.com. For Azure Search developers, this is important in cases where all the data is already publicly accessible and they want to save on latency by going straight to the search index from mobile devices or a browser.

CORS is a method that allows you to relax this restriction in a controlled way so you don’t compromise security. Azure Search uses CORS to allow JavaScript code inside browsers to make search requests directly to the Azure Search service and eliminate the need to proxy all requests through the originating server. We now offer the ability to configure CORS from the Azure Preview Portal, allowing you to easily enable cross-domain access and limit it to specific origins. This can be done from the index management portion of your search service as shown below.

image

Tag Boosting

As discussed with Scoring Profiles, there are many examples of where you may want to boost certain relevant items. To this end, we have also introduced a new and highly requested function to our set of scoring profile functions called Tag Boosting. This feature is currently part of our experimental API version, made available to you so you can test and provide feedback on these potential new features.

Tag Boosting allows you to boost documents that have tags in common with the search query. The tags for the search query are provided as a scoring parameter in each search request and then any document that contain these terms would get a boost. This capability can not only be helpful to enable search result customization, but could also be used for cases where you have specific items you want to promote. As an example, during a sporting event, a retailer might want to promote items that are related to the teams participating in that sporting event.

Improved Suggestions

Suggestions (auto-complete) is a feature that allows you to provide type-ahead suggestions as the user types. Just like scoring profiles, this is a great way to allow your users to find the content they are looking for quickly. When we first implemented search suggestions in Azure Search, we heard a number of requests to extend the capabilities of this feature to better suit your requirements. As a result, we have an entirely new implementation of suggestions to address these items. In particular, it will do infix matching for suggestions and if fuzzy matching is enabled, it’ll show more flexibility for spelling mistakes. It also allows up to 100 suggestions per result, has no limit in length other than field limits and doesn’t have the 3-character minimum length.

This enhancement is still under the experimental API version as we are continuing to gather feedback. For more information on this and to see a more detailed example of suggestions, please see the post on the Suggestions in the Azure Blog.

New Regions

As a final note, I wanted to point out that we are continuing to expand the global footprint of Azure Search. With the addition of East Asia and West Europe you can now provision Azure Search services in 8 regions across the globe. Media: General Availability of Content Protection Service

Earlier this month we released the general availability of our new Content Protection service for Azure Media Services. This is backed by an enterprise grade SLA for all customers.

We understand the importance of protecting your premium media content, and our robust new DRM offering features both static and dynamic encryption with first party PlayReady license delivery and an AES 128-bit key delivery service. You can either dynamically encrypt during delivery of your media or statically encrypt during the content processing workflow, and our content protection options are available for both live and on-demand workflows.

For more information on functionality and pricing, visit the Media Services Content Protection blog post, the Media Services Pricing webpage, or this Securing Media article.

Management: General Availability of the Azure Resource Manager

Earlier this month we reached general availability of the new Azure Resource Manager, and now provide a world-side SLA of the service. The Azure Resource Manager provides a core set of management capabilities that are fundamental to the Microsoft Azure Platform and form the basis of our new deployment and management model for all Azure services.  You can use the Azure Resource Manager to deploy and manage your Azure solutions at no cost.

The Azure Resource Manager provides a simple, and customizable experience to manage your applications running in Azure along with enterprise grade authentication and authorization capabilities. Benefits include:

Application Lifecycle Boundaries: Azure Resource Manager provides a deployment container called a Resource Group that serves as the lifecycle boundary of resources/services deployed in it - making it easy for you to deploy, manage and visualize services that are contained within it. You no longer have to deploy parts of your application ala carte and then stitch them together manually. A resource Group container supports one-click deployment and tear down of the entire application in a single operation.

Enterprise Grade Access Control: OAuth and Role-Based Access Control (RBAC) are now natively integrated into Azure Management and consistently apply to all services supported by the Resource Manager. Access and operations performed on these services are also logged automatically to enable you to audit them later. You can now use a rich set of platform and resource specific roles that can be applied at the subscription, resource group, or resource level - giving you granular control over who has access to what operation within your organization.

Rich Tagging and Categorization: The Azure Resource Manager supports metadata tagging of resource groups and contained resources, and you can use this tagging support to group objects in ways suitable to your own needs such as management, billing or monitoring. For example, you could mark certain resources or resource groups as being "Dev/Test" and use that to help filter your resources or charge back their bills differently to internal groups in your organization.  This provides the power needed to manage and monitor departmental applications, subscriptions, and billing data in a more streamlined fashion, especially for larger organizations.

Declarative Deployment Templates: The new Azure Resource Manager supports both an imperative API as well as a declarative template model that you can use to deploy rich multi-tier applications on Azure.  These applications can be composed from multiple Azure services (including both IaaS and PaaS based services) and support the ability for you to pass parameters and connection-strings across them.  For example, you could declarative create a SQL DB, Web Site and VM using a single template and automatically wire-up the connection-string details between them.

image

Learn More

Check out the following resources to learn more about the Azure Resource Manager, and start using it today:

Summary

Today’s Microsoft Azure release enables a ton of great new scenarios, and makes building applications hosted in the cloud even easier.

If you don’t already have a Azure account, you can sign-up for a free trial and start using all of the above features today.  Then visit the Microsoft Azure Developer Center to learn more about how to build apps with it.

Hope this helps,

Scott

P.S. In addition to blogging, I am also now using Twitter for quick updates and to share links. Follow me at:twitter.com/scottgu omni

Categories: Architecture, Programming

Try, Option or Either?

Xebia Blog - Wed, 02/18/2015 - 09:45

Scala has a lot of different options for handling and reporting errors, which can make it hard to decide which one is best suited for your situation. In Scala and functional programming languages it is common to make the errors that can occur explicit in the functions signature (i.e. return type), in contrast with the common practice in other programming languages where either special values are used (-1 for a failed lookup anyone?) or an exception is thrown.

Let's go through the main options you have as a Scala developer and see when to use what!

Option
A special type of error that can occur is the absence of some value. For example when looking up a value in a database or a List you can use the find method. When implementing this in Java the common solution (at least until Java 7) would be to return null when a value cannot be found or to throw some version of the NotFound exception. In Scala you will typically use the Option[T] type, returning Some(value) when the value is found and None when the value is absent.

So instead of having to look at the Javadoc or Scaladoc you only need to look at the type of the function to know how a missing value is represented. Moreover you don't need to litter your code with null checks or try/catch blocks.

Another use case is in parsing input data: user input, JSON, XML etc.. Instead of throwing an exception for invalid input you simply return a None to indicate parsing failed. The disadvantage of using Option for this situation is that you hide the type of error from the user of your function which, depending on the use-case, may or may not be a problem. If that information is important keep on reading the next sections.

An example that ensures that a name is non-empty:

def validateName(name: String): Option[String] = {
  if (name.isEmpty) None
  else Some(name)
}

You can use the validateName method in several ways in your code:

// Use a default value

 validateName(inputName).getOrElse("Default name")

// Apply some other function to the result
 validateName(inputName).map(_.toUpperCase)

// Combine with other validations, short-circuiting on the first error
// returning a new Option[Person]
 for {
   name <- validateName(inputName)
   age <- validateAge(inputAge)
 } yield Person(name, age)

Either
Option is nice to indicate failure, but if you need to provide some more information about the failure Option is not powerful enough. In that case Either[L,R] can be used. It has 2 implementations, Left and Right. Both can wrap a custom type, respectively type L and type R. By convention Right is right, so it contains the successful result and Left contains the error. Rewriting the validateName method to return an error message would give:

def validateName(name: String): Either[String, String] = {
 if (name.isEmpty) Left("Name cannot be empty")
 else Right(name)
 }

Similar to Option Either can be used in several ways. It differs from option because you always have to specify the so-called projection you want to work with via the left or right method:

// Apply some function to the successful result
validateName(inputName).right.map(_.toUpperCase)

// Combine with other validations, short-circuiting on the first error
// returning a new Either[Person]
for {
 name <- validateName(inputName).right
 age <- validateAge(inputAge).right
} yield Person(name, age)

// Handle both the Left and Right case
validateName(inputName).fold {
  error => s"Validation failed: $error",
  result => s"Validation succeeded: $result"
}

// And of course pattern matching also works
validateName(inputName) match {
  case Left(error) => s"Validation failed: $error",
  case Right(result) => s"Validation succeeded: $result"
}

// Convert to an option:
validateName(inputName).right.toOption

This projection is kind of clumsy and can lead to several convoluted compiler error messages in for expressions. See for example the excellent and in detail discussion of the Either type in the The Neophyte's Guide to Scala Part 7. Due to these issues several alternative implementations for a kind of Either have been created, most well known are the \/  type in Scalaz and the Or type in Scalactic. Both avoid the projection issues of the Scala Either and, at the same time, add additional functionality for aggregating multiple validation errors into a single result type.

Try

Try[T] is similar to Either. It also has 2 cases, Success[T] for the successful case and Failure[Throwable] for the failure case. The main difference thus is that the failure can only be of type Throwable. You can use it instead of a try/catch block to postpone exception handling. Another way to look at it is to consider it as Scala's version of checked exceptions. Success[T] wraps the result value of type T, while the Failure case can only contain an exception.

Compare these 2 methods that parse an integer:

// Throws a NumberFormatException when the integer cannot be parsed
def parseIntException(value: String): Int = value.toInt

// Catches the NumberFormatException and returns a Failure containing that exception
// OR returns a Success with the parsed integer value
def parseInt(value: String): Try[Int] = Try(value.toInt)

The first function needs documentation describing that an exception can be thrown. The second function describes in its signature what can be expected and requires the user of the function to take the failure case into account. Try is typically used when exceptions need to be propagated, if the exception is not needed prefer any of the other options discussed.

Try offers similar combinators as Option[T] and Either[L,R]:

// Apply some function to the successful result
parseInt(input).map(_ * 2)

// Combine with other validations, short-circuiting on the first Failure
// returning a new Try[Stats]
for {
  age <- parseInt(inputAge)
  height <- parseDouble(inputHeight)
} yield Stats(age, height)

// Use a default value
parseAge(inputAge).getOrElse(0)

// Convert to an option
parseAge(inputAge).toOption

// And of course pattern matching also works
parseAge(inputAge) match {
  case Failure(exception) => s"Validation failed: ${exception.message}",
  case Success(result) => s"Validation succeeded: $result"
}

Note that Try is not needed when working with Futures! Futures combine asynchronous processing with the Exception handling capabilities of Try! See also Try is free in the Future.

Exceptions
Since Scala runs on the JVM all low-level error handling is still based on exceptions. In Scala you rarely see usage of exceptions and they are typically only used as a last resort. More common is to convert them to any of the types mentioned above. Also note that, contrary to Java, all exceptions in Scala are unchecked. Throwing an exception will break your functional composition and probably result in unexpected behaviour for the caller of your function. So it should be reserved as a method of last resort, for when the other options don’t make sense.
If you are on the receiving end of the exceptions you need to catch them. In Scala syntax:

try {
  dangerousCode()
} catch {
  case e: Exception => println("Oops")
} finally {
  cleanup
}

What is often done wrong in Scala is that all Throwables are caught, including the Java system errors. You should never catch Errors because they indicate a critical system error like the OutOfMemoryError. So never do this:

try {
  dangerousCode()
} catch {
  case _ => println("Oops. Also caught OutOfMemoryError here!")
}

But instead do this:

import scala.util.control.NonFatal

try {
  dangerousCode()
} catch {
  case NonFatal(_) => println("Ooops. Much better, only the non fatal exceptions end up here.")
}

To convert exceptions to Option or Either types you can use the methods provided in scala.util.control.Exception (scaladoc):

import scala.util.control.Exception._

val i = 0
val result: Option[Int] = catching(classOf[ArithmeticException]) opt { 1 / i }
val result: Either[Throwable, Int] = catching(classOf[ArithmeticException]) either { 1 / i }

Finally remember you can always convert an exception into a Try as discussed in the previous section.

TDLR;

  • Option[T], use it when a value can be absent or some validation can fail and you don't care about the exact cause. Typically in data retrieval and validation logic.
  • Either[L,R], similar use case as Option but when you do need to provide some information about the error.
  • Try[T], use when something Exceptional can happen that you cannot handle in the function. This, in general, excludes validation logic and data retrieval failures but can be used to report unexpected failures.
  • Exceptions, use only as a last resort. When catching exceptions use the facility methods Scala provides and never catch { _ => }, instead use catch { NonFatal(_) => }

One final advice is to read through the Scaladoc for all the types discussed here. There are plenty of useful combinators available that are worth using.

Reminder: ClientLogin Shutdown scheduled for April 20, 2015

Google Code Blog - Tue, 02/17/2015 - 21:17

Posted by Ryan Troll, Technical Lead, Identity and Authentication

As mentioned in our earlier post reminding users to migrate to newer Google Data APIs, we would like to once again share that the ClientLogin shutdown date is fast approaching, and applications which rely on it will stop working when it shuts down. We encourage you to minimize user disruption by switching to OAuth 2.0.

Our top priority is to safeguard users’ data, and at Google we use risk based analysis to block the vast majority of account hijacking attempts. Our risk analysis systems take into account many signals in addition to passwords to ensure that user data is protected. Password-only authentication has several well known shortcomings and we are actively working to move away from it. Moving to OAuth 2.0 ensures that advances we make in secure authentication are passed on to users signing in to Google services from your applications.

In our efforts to eliminate password-only authentication, we took the first step by announcing a deprecation date of April 20, 2015 for ClientLogin three years ago. At the same time, we recommended OAuth 2.0 as the standard authentication mechanism for our APIs. Applications using OAuth 2.0 never ask users for passwords, and users have tighter control over which data client applications can access. You can use OAuth 2.0 to build clients and websites that securely access account data and work with our advanced security features like 2-step verification.

We’ve taken steps to provide alternatives to password authentication in other protocols as well. CalDAV API V2 only supports OAuth 2.0, and we’ve added OAuth 2.0 support to IMAP, SMTP, and XMPP. While a deprecation timeline for password authentication in these protocols hasn’t been announced yet, developers are strongly encouraged to move to OAuth 2.0.

Categories: Programming

Cancelling $http requests for fun and profit

Xebia Blog - Tue, 02/17/2015 - 09:11

At my current client, we have a large AngularJS application that is configured to show a full-page error whenever one of the $http requests ends up in error. This is implemented with an error interceptor as you would expect it to be. However, we’re also using some calculation-intense resources that happen to timeout once in a while. This combination is tricky: a user triggers a resource request when navigating to a certain page, navigates to a second page and suddenly ends up with an error message, as the request from the first page triggered a timeout error. This is a particular unpleasant side effect that I’m going to address in a generic way in this post.

There are of course multiple solutions to this problem. We could create a more resilient implementation in the backend that will not time out, but accepts retries. We could change the full-page error in something less ‘in your face’ (but you still would get some out-of-place error notification). For this post I’m going to fix it using a different approach: cancel any running requests when a user switches to a different location (the route part of the URL). This makes sense; your browser does the same when navigating from one page to another, so why not mimic this behaviour in your Angular app?

I’ve created a pretty verbose implementation to explain how to do this. At the end of this post, you’ll find a link to the code as a packaged bower component that can be dropped in any Angular 1.2+ app.

To cancel a running request, Angular does not offer that many options. Under the hood, there are some places where you can hook into, but that won’t be necessary. If we look at the $http usage documentation, the timeout property is mentioned and it accepts a promise to abort the underlying call. Perfect! If we set a promise on all created requests, and abort these at once when the user navigates to another page, we’re (probably) all set.

Let’s write an interceptor to plug in the promise in each request:

angular.module('angularCancelOnNavigateModule')
  .factory('HttpRequestTimeoutInterceptor', function ($q, HttpPendingRequestsService) {
    return {
      request: function (config) {
        config = config || {};
        if (config.timeout === undefined && !config.noCancelOnRouteChange) {
          config.timeout = HttpPendingRequestsService.newTimeout();
        }
        return config;
      }
    };
  });

The interceptor will not overwrite the timeout property when it is explicitly set. Also, if the noCancelOnRouteChange option is set to true, the request won’t be cancelled. For better separation of concerns, I’ve created a new service (the HttpPendingRequestsService) that hands out new timeout promises and stores references to them.

Let’s have a look at that pending requests service:

angular.module('angularCancelOnNavigateModule')
  .service('HttpPendingRequestsService', function ($q) {
    var cancelPromises = [];

    function newTimeout() {
      var cancelPromise = $q.defer();
      cancelPromises.push(cancelPromise);
      return cancelPromise.promise;
    }

    function cancelAll() {
      angular.forEach(cancelPromises, function (cancelPromise) {
        cancelPromise.promise.isGloballyCancelled = true;
        cancelPromise.resolve();
      });
      cancelPromises.length = 0;
    }

    return {
      newTimeout: newTimeout,
      cancelAll: cancelAll
    };
  });

So, this service creates new timeout promises that are stored in an array. When the cancelAll function is called, all timeout promises are resolved (thus aborting all requests that were configured with the promise) and the array is cleared. By setting the isGloballyCancelled property on the promise object, a response promise method can check whether it was cancelled or another exception has occurred. I’ll come back to that one in a minute.

Now we hook up the interceptor and call the cancelAll function at a sensible moment. There are several events triggered on the root scope that are good hook candidates. Eventually I settled for $locationChangeSuccess. It is only fired when the location change is a success (hence the name) and not cancelled by any other event listener.

angular
  .module('angularCancelOnNavigateModule', [])
  .config(function($httpProvider) {
    $httpProvider.interceptors.push('HttpRequestTimeoutInterceptor');
  })
  .run(function ($rootScope, HttpPendingRequestsService) {
    $rootScope.$on('$locationChangeSuccess', function (event, newUrl, oldUrl) {
      if (newUrl !== oldUrl) {
        HttpPendingRequestsService.cancelAll();
      }
    })
  });

When writing tests for this setup, I found that the $locationChangeSuccess event is triggered at the start of each test, even though the location did not change yet. To circumvent this situation, the function does a simple difference check.

Another problem popped up during testing. When the request is cancelled, Angular creates an empty error response, which in our case still triggers the full-page error. We need to catch and handle those error responses. We can simply add a responseError function in our existing interceptor. And remember the special isGloballyCancelled property we set on the promise? That’s the way to distinguish between cancelled and other responses.

We add the following function to the interceptor:

      responseError: function (response) {
        if (response.config.timeout.isGloballyCancelled) {
          return $q.defer().promise;
        }
        return $q.reject(response);
      }

The responseError function must return a promise that normally re-throws the response as rejected. However, that’s not what we want: neither a success nor a failure callback should be called. We simply return a never-resolving promise for all cancelled requests to get the behaviour we want.

That’s all there is to it! To make it easy to reuse this functionality in your Angular application, I’ve packaged this module as a bower component that is fully tested. You can check the module out on this GitHub repo.

Google launches the Chinese language Developer Channel on YouTube

Google Code Blog - Tue, 02/17/2015 - 01:30

Posted by Bill Luan, Greater China Regional Lead, Google Developer Relations

Today, the Google Developer Platform team is launching a Chinese language and captioned YouTube channel, aiming to make it easier for the developers in China to learn more about Google services and technologies around mobile, web and the cloud. The channel includes original content in Chinese (Mandarin speaking), and curates content from the English version of the Google Developers channel with Simplified Chinese captions.

A special thank you to the volunteers in Google Developers Group community in the city of Nanyang (Nanyang GDG) in China, for their effort and contribution in adding the Chinese language translations to the English language Google Developer Channel videos on YouTube. Over time, we will produce more Chinese language original content, as well as continue leveraging GDG volunteers in China to add more Chinese captioned English videos from Google Developer Channel, to serve the learning needs from developers.

Categories: Programming

Python/pandas: Column value in list (ValueError: The truth value of a Series is ambiguous.)

Mark Needham - Mon, 02/16/2015 - 22:39

I’ve been using Python’s pandas library while exploring some CSV files and although for the most part I’ve found it intuitive to use, I had trouble filtering a data frame based on checking whether a column value was in a list.

A subset of one of the CSV files I’ve been working with looks like this:

$ cat foo.csv
"Foo"
1
2
3
4
5
6
7
8
9
10

Loading it into a pandas data frame is reasonably simple:

import pandas as pd
df = pd.read_csv('foo.csv', index_col=False, header=0)
>>> df
   Foo
0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
9   10

If we want to find the rows which have a value of 1 we’d write the following:

>>> df[df["Foo"] == 1]
   Foo
0    1

Finding the rows with a value less than 7 is as you’d expect too:

>>> df[df["Foo"] < 7]
   Foo
0    1
1    2
2    3
3    4
4    5
5    6

Next I wanted to filter out the rows containing odd numbers which I initially tried to do like this:

odds = [i for i in range(1,10) if i % 2 <> 0]
>>> odds
[1, 3, 5, 7, 9]
 
>>> df[df["Foo"] in odds]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/pandas/core/generic.py", line 698, in __nonzero__
    .format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Unfortunately that doesn’t work and I couldn’t get any of the suggestions from the error message to work either. Luckily pandas has a special isin function for this use case which we can call like this:

>>> df[df["Foo"].isin(odds)]
   Foo
0    1
2    3
4    5
6    7
8    9

Much better!

Categories: Programming

When development resembles the ageing of wine

Xebia Blog - Mon, 02/16/2015 - 20:29

Once upon a time I was asked to help out a software product company.  The management briefing went something like this: "We need you to increase productivity, the guys in development seem to be unable to ship anything! and if they do ship something it's only a fraction of what we expected".

And so the story begins. Now there are many ways how we can improve the teams outcome and its output (the first matters more), but it always starts with observing what they do today and trying to figure out why.

It turns out that requests from the business were treated like a good wine, and were allowed to "age", in the oak barrel that was called Jira. Not so much to add flavour in the form of details, requirements, designs, non functional requirements or acceptance criteria, but mainly to see if the priority of this request would remain stable over a period of time.

In the days that followed I participated in the "Change Control Board" and saw what he meant. Management would change priorities on the fly and make swift decisions on requirements that would take weeks to implement. To stay in vinotology terms, wine was poured in and out the barrels at such a rate that it bore more resemblance to a blender than to the art of wine making.

Though management was happy to learn I had unearthed to root cause to their problem, they were less pleased to learn that they themselves were responsible.  The Agile world created the Product Owner role for this, and it turned out that this is hat, that can only be worn by a single person.

Once we funnelled all the requests through a single person, both responsible for the success of the product and for the development, we saw a big change. Not only did the business got a reliable sparring partner, but the development team had a single voice when it came to setting the priorities. Once the team starting finishing what they started we started shipping at regular intervals, with features that we all had committed to.

Of course it did not take away the dynamics of the business, but it allowed us to deliver, and become reliable in how and when we responded to change. Perhaps not the most aged wine, but enough to delight our customers and learn what we should put in our barrel for the next round.

 

ScottGu Azure event in London on March 2nd

ScottGu's Blog - Scott Guthrie - Mon, 02/16/2015 - 19:16

On March 2nd I'm doing an Azure event in London that you can attend for free.  I'll be speaking for about 2.5 hours and will do an end-to-end walkthrough of Microsoft Azure, show off a bunch of demos of great new features/capabilities, and talk about some of the improvements coming out over the next few months.

logo[1]

You can sign-up and attend the event for free (while tickets last - they are going fast).  If you are interested sign-up now.  The event is being held at the Mermaid Conference & Events Centre in Blackfriars, London:

mermaidspic3[1]

Hope to see some of you there!

Scott

omni
Categories: Architecture, Programming

The Joel Test For Programmers (The Simple Programmer Test)

Making the Complex Simple - John Sonmez - Mon, 02/16/2015 - 17:00

A while back—the year 2000 to be exact—Joel Spolsky wrote a blog post entitled: “The Joel Test: 12 Steps to Better Code.” Many software engineers and developers use this test for evaluating a company to determine if a company is a good company to work for. In fact, many software development organizations use the Joel Test […]

The post The Joel Test For Programmers (The Simple Programmer Test) appeared first on Simple Programmer.

Categories: Programming

Python/scikit-learn: Calculating TF/IDF on How I met your mother transcripts

Mark Needham - Sun, 02/15/2015 - 16:56

Over the past few weeks I’ve been playing around with various NLP techniques to find interesting insights into How I met your mother from its transcripts and one technique that kept coming up is TF/IDF.

The Wikipedia definition reads like this:

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

It is often used as a weighting factor in information retrieval and text mining.

The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

I wanted to generate a TF/IDF representation of phrases used in the hope that it would reveal some common themes used in the show.

Python’s scikit-learn library gives you two ways to generate the TF/IDF representation:

  1. Generate a matrix of token/phrase counts from a collection of text documents using CountVectorizer and feed it to TfidfTransformer to generate the TF/IDF representation.
  2. Feed the collection of text documents directly to TfidfVectorizer and go straight to the TF/IDF representation skipping the middle man.

I started out using the first approach and hadn’t quite got it working when I realised there was a much easier way!

I have a collection of sentences in a CSV file so the first step is to convert those into a list of documents:

from collections import defaultdict
import csv
 
episodes = defaultdict(list)
with open("data/import/sentences.csv", "r") as sentences_file:
    reader = csv.reader(sentences_file, delimiter=',')
    reader.next()
    for row in reader:
        episodes[row[1]].append(row[4])
 
for episode_id, text in episodes.iteritems():
    episodes[episode_id] = "".join(text)
 
corpus = []
for id, episode in sorted(episodes.iteritems(), key=lambda t: int(t[0])):
    corpus.append(episode)

corpus contains 208 entries (1 per episode), each of which is a string containing the transcript of that episode. Next it’s time to train our TF/IDF model which is only a few lines of code:

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')

The most interesting parameter here is ngram_range – we’re telling it to generate 2 and 3 word phrases along with the single words from the corpus.

e.g. if we had the sentence “Python is cool” we’d end up with 6 phrases – ‘Python’, ‘is’, ‘cool’, ‘Python is’, ‘Python is cool’ and ‘is cool’.

Let’s execute the model against our corpus:

tfidf_matrix =  tf.fit_transform(corpus)
>>> len(feature_names)
498254
 
>>> feature_names[50:70]
[u'00 does sound', u'00 don', u'00 don buy', u'00 dressed', u'00 dressed blond', u'00 drunkenly', u'00 drunkenly slurred', u'00 fair', u'00 fair tonight', u'00 fall', u'00 fall foliage', u'00 far', u'00 far impossible', u'00 fart', u'00 fart sure', u'00 friends', u'00 friends singing', u'00 getting', u'00 getting guys', u'00 god']

So we’re got nearly 500,000 phrases and if we look at tfidf_matrix we’d expect it to be a 208 x 498254 matrix – one row per episode, one column per phrase:

>>> tfidf_matrix
<208x498254 sparse matrix of type '<type 'numpy.float64'>'
	with 740396 stored elements in Compressed Sparse Row format>

This is what we’ve got although under the covers it’s using a sparse representation to save space. Let’s convert the matrix to dense format to explore further and find out why:

dense = tfidf_matrix.todense()
>>> len(dense[0].tolist()[0])
498254

What I’ve printed out here is the size of one row of the matrix which contains the TF/IDF score for every phrase in our corpus for the 1st episode of How I met your mother. A lot of those phrases won’t have happened in the 1st episode so let’s filter those out:

episode = dense[0].tolist()[0]
phrase_scores = [pair for pair in zip(range(0, len(episode)), episode) if pair[1] > 0]
 
>>> len(phrase_scores)
4823

There are just under 5000 phrases used in this episode, roughly 1% of the phrases in the whole corpus.
The sparse matrix makes a bit more sense – if scipy used a dense matrix representation there’d be 493,000 entries with no score which becomes more significant as the number of documents increases.

Next we’ll sort the phrases by score in descending order to find the most interesting phrases for the first episode of How I met your mother:

>>> sorted(phrase_scores, key=lambda t: t[1] * -1)[:5]
[(419207, 0.2625177493269755), (312591, 0.19571419072701732), (267538, 0.15551468983363487), (490429, 0.15227880637176266), (356632, 0.1304175242341549)]

The first value in each tuple is the phrase’s position in our initial vector and also corresponds to the phrase’s position in feature_names which allows us to map the scores back to phrases. Let’s look up a couple of phrases:

>>> feature_names[419207]
u'ted'
>>> feature_names[312591]
u'olives'
>>> feature_names[356632]
u'robin'

Let’s automate that lookup:

sorted_phrase_scores = sorted(phrase_scores, key=lambda t: t[1] * -1)
for phrase, score in [(feature_names[word_id], score) for (word_id, score) in sorted_phrase_scores][:20]:
   print('{0: <20} {1}'.format(phrase, score))
 
ted                  0.262517749327
olives               0.195714190727
marshall             0.155514689834
yasmine              0.152278806372
robin                0.130417524234
barney               0.124411751867
lily                 0.122924977859
signal               0.103793246466
goanna               0.0981379875009
scene                0.0953423604123
cut                  0.0917336653574
narrator             0.0864622981985
flashback            0.078295921554
flashback date       0.0702825260177
ranjit               0.0693927691559
flashback date robin 0.0585687716814
ted yasmine          0.0585687716814
carl                 0.0582101172888
eye patch            0.0543650529797
lebanese             0.0543650529797

We see all the main characters names which aren’t that interested – perhaps they should be part of the stop list – but ‘olives’ which is where the olive theory is first mentioned. I thought olives came up more often but a quick search for the term suggests it isn’t mentioned again until Episode 9 in Season 9:

$ grep -rni --color "olives" data/import/sentences.csv | cut -d, -f 2,3,4 | sort | uniq -c
  16 1,1,1
   3 193,9,9

‘yasmine’ is also an interesting phrase in this episode but she’s never mentioned again:

$ grep -h -rni --color "yasmine" data/import/sentences.csv
49:48,1,1,1,"Barney: (Taps a woman names Yasmine) Hi, have you met Ted? (Leaves and watches from a distance)."
50:49,1,1,1,"Ted: (To Yasmine) Hi, I'm Ted."
51:50,1,1,1,Yasmine: Yasmine.
53:52,1,1,1,"Yasmine: Thanks, It's Lebanese."
65:64,1,1,1,"[Cut to the bar, Ted is chatting with Yasmine]"
67:66,1,1,1,Yasmine: So do you think you'll ever get married?
68:67,1,1,1,"Ted: Well maybe eventually. Some fall day. Possibly in Central Park. Simple ceremony, we'll write our own vows. But--eh--no DJ, people will dance. I'm not going to worry about it! Damn it, why did Marshall have to get engaged? (Yasmine laughs) Yeah, nothing hotter than a guy planning out his own imaginary wedding, huh?"
69:68,1,1,1,"Yasmine: Actually, I think it's cute."
79:78,1,1,1,"Lily: You are unbelievable, Marshall. No-(Scene splits in half and shows both Lily and Marshall on top arguing and Ted and Yasmine on the bottom mingling)"
82:81,1,1,1,Ted: (To Yasmine) you wanna go out sometime?
85:84,1,1,1,[Cut to Scene with Ted and Yasmine at bar]
86:85,1,1,1,Yasmine: I'm sorry; Carl's my boyfriend (points to bartender)

It would be interesting to filter out the phrases which don’t occur in any other episode and see what insights we get from doing that. For now though we’ll extract phrases for all episodes and write to CSV so we can explore more easily:

with open("data/import/tfidf_scikit.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["EpisodeId", "Phrase", "Score"])
 
    doc_id = 0
    for doc in tfidf_matrix.todense():
        print "Document %d" %(doc_id)
        word_id = 0
        for score in doc.tolist()[0]:
            if score > 0:
                word = feature_names[word_id]
                writer.writerow([doc_id+1, word.encode("utf-8"), score])
            word_id +=1
        doc_id +=1

And finally a quick look at the contents of the CSV:

$ tail -n 10 data/import/tfidf_scikit.csv
208,york apparently laughs,0.012174304095213192
208,york aren,0.012174304095213192
208,york aren supposed,0.012174304095213192
208,young,0.013397275854758335
208,young ladies,0.012174304095213192
208,young ladies need,0.012174304095213192
208,young man,0.008437685963000223
208,young man game,0.012174304095213192
208,young stupid,0.011506395106658192
208,young stupid sighs,0.012174304095213192
Categories: Programming

Diamond Kata - Some Thoughts on Tests as Documentation

Mistaeks I Hav Made - Nat Pryce - Sun, 02/15/2015 - 13:13
Comparing example-based tests and property-based tests for the Diamond Kata, I’m struck by how well property-based tests reduce duplication of test code. For example, in the solutions by Sandro Mancuso and George Dinwiddie, not only do multiple tests exercise the same property with different examples but the tests duplicate assertions. Property-based tests avoid the former by defining generators of input data, but I’m not sure why the latter occurs. Perhaps Seb’s “test recycling” approach would avoid this kind of duplication. But compared to example based tests, property based tests do not work so well as as an explanatory overview. Examples convey an overall impression of what the functionality is, but are are not good at describing precise details. When reading example-based tests, you have to infer the properties of the code from multiple examples and informal text in identifiers and comments. The property-based tests I wrote for the Diamond Kata specify precise properties of the diamond function, but nowhere is there a test that describes that the function draws a diamond! There’s a place for both examples and properties. It’s not an either/or decision. However, explanatory examples used for documentation need not be test inputs. If we’re generating inputs for property tests and generating documentation for our software, we can combine the two, and insert generated inputs and calculated ouputs into generated documentation.
Categories: Programming, Testing & QA

The Great Love Quotes Collection Revamped

A while back I put together a comprehensive collection of love quotes.   It’s a combination of the wisdom of the ages + modern sages.   In the spirit of Valentine’s Day, I gave it a good revamp.  Here it is:

The Great Love Quotes Collection

It's a serious collection of love quotes and includes lessons from the likes of Lucille Ball, Shakespeare, Socrates, and even The Princess Bride.

How I Organized the Categories for Love Quotes

I organized the quotes into a set of buckets:
Beauty
Broken Hearts and Loss
Falling in Love
Fear and Love
Fun and Love
Kissing
Love and Life
Significance and Meaning
The Power of Love
True Love

I think there’s a little something for everyone among the various buckets.   If you walk away with three new quotes that make you feel a little lighter, put a little skip in your step, or help you see love in a new light, then mission accomplished.

Think of Love as Warmth and Connection

If you think of love like warmth and connection, you can create more micro-moments of love in your life.

This might not seem like a big deal, but if you knew all the benefits for your heart, brain, bodily processes, and even your life span, you might think twice.

You might be surprised by how much your career can be limited if you don’t balance connection with conviction.  It’s not uncommon to hear a lot of turning points in the careers of developers, program managers, IT leaders, and business leaders that changed their game, when they changed their heart.

In fact, on one of the teams I was on, the original mantra was “business before technology”, but people in the halls started to say, “people before business, business before technology” to remind people of what makes business go round.

When people treat each other better, work and life get better.

Love Quotes Help with Insights and Actions

Here are a few of my favorite love quotes from the collection …

“Love is like heaven, but it can hurt like hell.” – Unknown

“Love is not a feeling, it’s an ability.” — Dan in Real Life

“There is a place you can touch a woman that will drive her crazy. Her heart.” — Milk Money

“Hearts will be practical only when they are made unbreakable.”  – The Wizard of Oz

“Things are beautiful if you love them.” – Jean Anouilh

“Life is messy. Love is messier.” – Catch and Release

“To the world you may be just one person, but to one person you may be the world.” – Unknown

For many more quotes, explore The Great Love Quotes Collection.

You Might Also Like

Happiness Quotes Revamped

My Story of Personal Transformation

The Great Leadership Quotes Collection Revamped

The Great Personal Development Quotes Collection Revamped

The Great Productivity Quotes Collection

Categories: Architecture, Programming

Neo4j: Building a topic graph with Prismatic Interest Graph API

Mark Needham - Sat, 02/14/2015 - 00:38

Over the last few weeks I’ve been using various NLP libraries to derive topics for my corpus of How I met your mother episodes without success and was therefore enthused to see the release of Prismatic’s Interest Graph API

The Interest Graph API exposes a web service to which you feed a block of text and get back a set of topics and associated score.

It has been trained over the last few years with millions of articles that people share on their social media accounts and in my experience using Prismatic the topics have been very useful for finding new material to read.

The first step is to head to interest-graph.getprismatic.com and get an API key which will be emailed to you.

Having done that we’re ready to make some calls to the API and get back some topics.

I’m going to use Python to call the API and I’ve found the requests library the easiest library to use for this type of work. Our call to the API looks like this:

import requests
payload = { 'title': "insert title of article here",
            'body': "insert body of text here"),
            'api-token': "insert token sent by email here"}
r = requests.post("http://interest-graph.getprismatic.com/text/topic", data=payload)

One thing to keep in mind is that the API is rate limited to 20 requests a second so we need to restrict our requests or we’re going to receive error response codes. Luckily I came across an excellent blog post showing how to write a decorator around a function and only allow it to execute at a certain frequency.

To rate limit our calls to the Interest Graph we need to pull the above code into a function and annotate it appropriately:

import time
 
def RateLimited(maxPerSecond):
    minInterval = 1.0 / float(maxPerSecond)
    def decorate(func):
        lastTimeCalled = [0.0]
        def rateLimitedFunction(*args,**kargs):
            elapsed = time.clock() - lastTimeCalled[0]
            leftToWait = minInterval - elapsed
            if leftToWait>0:
                time.sleep(leftToWait)
            ret = func(*args,**kargs)
            lastTimeCalled[0] = time.clock()
            return ret
        return rateLimitedFunction
    return decorate
 
@RateLimited(0.3)
def topics(title, body):
    payload = { 'title': title,
                'body': body,
                'api-token': "insert token sent by email here"}
    r = requests.post("http://interest-graph.getprismatic.com/text/topic", data=payload)
    return r

The text I want to classify is stored in a CSV file – one sentence per line. Here’s a sample:

$ head -n 10 data/import/sentences.csv
SentenceId,EpisodeId,Season,Episode,Sentence
1,1,1,1,Pilot
2,1,1,1,Scene One
3,1,1,1,[Title: The Year 2030]
4,1,1,1,"Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother"
5,1,1,1,Son: Are we being punished for something?
6,1,1,1,Narrator: No
7,1,1,1,"Daughter: Yeah, is this going to take a while?"
8,1,1,1,"Narrator: Yes. (Kids are annoyed) Twenty-five years ago, before I was dad, I had this whole other life."
9,1,1,1,"(Music Plays, Title ""How I Met Your Mother"" appears)"

We’ll also need to refer to another CSV file to get the title of each episode since it isn’t being stored with the sentence:

$ head -n 10 data/import/episodes_full.csv
NumberOverall,NumberInSeason,Episode,Season,DateAired,Timestamp,Title,Director,Viewers,Writers,Rating
1,1,/wiki/Pilot,1,"September 19, 2005",1127084400,Pilot,Pamela Fryman,10.94,"Carter Bays,Craig Thomas",68
2,2,/wiki/Purple_Giraffe,1,"September 26, 2005",1127689200,Purple Giraffe,Pamela Fryman,10.40,"Carter Bays,Craig Thomas",63
3,3,/wiki/Sweet_Taste_of_Liberty,1,"October 3, 2005",1128294000,Sweet Taste of Liberty,Pamela Fryman,10.44,"Phil Lord,Chris Miller",67
4,4,/wiki/Return_of_the_Shirt,1,"October 10, 2005",1128898800,Return of the Shirt,Pamela Fryman,9.84,Kourtney Kang,59
5,5,/wiki/Okay_Awesome,1,"October 17, 2005",1129503600,Okay Awesome,Pamela Fryman,10.14,Chris Harris,53
6,6,/wiki/Slutty_Pumpkin,1,"October 24, 2005",1130108400,Slutty Pumpkin,Pamela Fryman,10.89,Brenda Hsueh,62
7,7,/wiki/Matchmaker,1,"November 7, 2005",1131321600,Matchmaker,Pamela Fryman,10.55,"Sam Johnson,Chris Marcil",57
8,8,/wiki/The_Duel,1,"November 14, 2005",1131926400,The Duel,Pamela Fryman,10.35,Gloria Calderon Kellett,46
9,9,/wiki/Belly_Full_of_Turkey,1,"November 21, 2005",1132531200,Belly Full of Turkey,Pamela Fryman,10.29,"Phil Lord,Chris Miller",60

Now we need to get our episode titles and transcripts ready to pass to the topics function. Since we’ve only got ~ 200 episodes we can create a dictionary to store that data:

episodes = {}
with open("data/import/episodes_full.csv", "r") as episodesfile:
    episodes_reader = csv.reader(episodesfile, delimiter=",")
    episodes_reader.next()
    for episode in episodes_reader:
        episodes[int(episode[0])] = {"title": episode[6], "sentences" : [] }
 
with open("data/import/sentences.csv", "r") as sentencesfile:
     sentences_reader = csv.reader(sentencesfile, delimiter=",")
     sentences_reader.next()
     for sentence in sentences_reader:
         episodes[int(sentence[1])]["sentences"].append(sentence[4])
 
>>> episodes[1]["title"]
'Pilot'
>>> episodes[1]["sentences"][:5]
['Pilot', 'Scene One', '[Title: The Year 2030]', "Narrator: Kids, I'm going to tell you an incredible story. The story of how I met your mother", 'Son: Are we being punished for something?']

Now we’re going to loop through each of the episodes, call topics and write the result into a CSV file so we can load it into Neo4j afterwards to explore the data:

import json
 
with open("data/import/topics.csv", "w") as topicsfile:
    topics_writer = csv.writer(topicsfile, delimiter=",")
    topics_writer.writerow(["EpisodeId", "TopicId", "Topic", "Score"])
 
    for episode_id, episode in episodes.iteritems():
        tmp = topics(episode["title"], "".join(episode["sentences"]).json()
        print episode_id, tmp
        for topic in tmp['topics']:
            topics_writer.writerow([episode_id, topic["id"], topic["topic"], topic["score"]])

It takes about 10 minutes to run and this is a sample of the output:

$ head -n 10 data/import/topics.csv
EpisodeId,TopicId,Topic,Score
1,1519,Fiction,0.5798245566455255
1,2015,Humour,0.565154963605359
1,24031,Laughing,0.5587120401021765
1,16693,Flirting,0.5514098189505282
1,1163,Dating and Courtship,0.5487490108554022
1,2386,Kissing,0.5476185929151934
1,31929,Puns,0.5375100569837977
2,24031,Laughing,0.5670926949850333
2,1519,Fiction,0.5396488295397263

We’ll use Neo4j’s LOAD CSV command to load the data in:

// make sure the topics exist
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MERGE (topic:Topic {id: TOINT(row.TopicId)})
ON CREATE SET topic.value = row.Topic
// make sure the topics exist
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MERGE (topic:Topic {id: TOINT(row.TopicId)})
ON CREATE SET topic.value = row.Topic
// now link the episodes and topics
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/topics.csv" AS row
MATCH (topic:Topic {id: TOINT(row.TopicId)})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (episode)-[:TOPIC {score: TOFLOAT(row.Score)}]->(topic)

We’ll assume that the episodes and seasons are already loaded – the commands to load those in are on github.

We can now write some queries against our topic graph. We’ll start simple – show me the topics for an episode:

MATCH (episode:Episode {id: 1})-[r:TOPIC]->(topic)
RETURN topic, r

Graph

Let’s say we liked the ‘Puns’ aspect of the Pilot episode and want to find out which other episodes had puns. The following query would let us find those:

MATCH (episode:Episode {id: 1})-[r:TOPIC]->(topic {value: "Puns"})<-[:TOPIC]-(other)
RETURN episode, topic, other

Graph  1

Or maybe we want to find the episode which has the most topics in common:

MATCH (episode:Episode {id: 1})-[:TOPIC]->(topic),
      (topic)<-[r:TOPIC]-(otherEpisode)
RETURN otherEpisode.title as episode, COUNT(r) AS topicsInCommon
ORDER BY topicsInCommon DESC
LIMIT 10
==> +------------------------------------------------+
==> | episode                       | topicsInCommon |
==> +------------------------------------------------+
==> | "Purple Giraffe"              | 6              |
==> | "Ten Sessions"                | 5              |
==> | "Farhampton"                  | 4              |
==> | "The Three Days Rule"         | 4              |
==> | "How I Met Everyone Else"     | 4              |
==> | "The Time Travelers"          | 4              |
==> | "Mary the Paralegal"          | 4              |
==> | "Lobster Crawl"               | 4              |
==> | "The Magician's Code, Part 2" | 4              |
==> | "Slutty Pumpkin"              | 4              |
==> +------------------------------------------------+
==> 10 rows

We could then tweak that query to get the names of those topics:

MATCH (episode:Episode {id: 1})-[:TOPIC]->(topic),
      (topic)<-[r:TOPIC]-(otherEpisode)-[:IN_SEASON]->(season)
RETURN otherEpisode.title as episode, season.number AS season, COUNT(r) AS topicsInCommon, COLLECT(topic.value)
ORDER BY topicsInCommon DESC
LIMIT 10
 
==> +-----------------------------------------------------------------------------------------------------------------------------------+
==> | episode                   | season | topicsInCommon | COLLECT(topic.value)                                                        |
==> +-----------------------------------------------------------------------------------------------------------------------------------+
==> | "Purple Giraffe"          | "1"    | 6              | ["Humour","Fiction","Kissing","Dating and Courtship","Flirting","Laughing"] |
==> | "Ten Sessions"            | "3"    | 5              | ["Humour","Puns","Dating and Courtship","Flirting","Laughing"]              |
==> | "How I Met Everyone Else" | "3"    | 4              | ["Humour","Fiction","Dating and Courtship","Laughing"]                      |
==> | "Farhampton"              | "8"    | 4              | ["Humour","Fiction","Kissing","Dating and Courtship"]                       |
==> | "Bedtime Stories"         | "9"    | 4              | ["Humour","Puns","Dating and Courtship","Laughing"]                         |
==> | "Definitions"             | "5"    | 4              | ["Kissing","Dating and Courtship","Flirting","Laughing"]                    |
==> | "Lobster Crawl"           | "8"    | 4              | ["Humour","Dating and Courtship","Flirting","Laughing"]                     |
==> | "Little Boys"             | "3"    | 4              | ["Humour","Puns","Dating and Courtship","Laughing"]                         |
==> | "Wait for It"             | "3"    | 4              | ["Fiction","Puns","Flirting","Laughing"]                                    |
==> | "Mary the Paralegal"      | "1"    | 4              | ["Humour","Dating and Courtship","Flirting","Laughing"]                     |
==> +-----------------------------------------------------------------------------------------------------------------------------------+

Overall 168 (out of 208) of the other episodes have a topic in common with the first episode so perhaps just having a topic in common isn’t the best indication of similarity.

An interesting next step would be to calculate cosine or jaccard similarity between the episodes and store that value in the graph for querying later on.

I’ve also calculated the most common bigrams across all the transcripts so it would be interesting to see if there are any interesting insights at the intersection of episodes, topics and phrases.

Categories: Programming

Beta Channel for the Android WebView

Android Developers Blog - Fri, 02/13/2015 - 16:59

Posted by Richard Coles, Software Engineer, Google London

Many Android apps use a WebView for displaying HTML content. In Android 5.0 Lollipop, Google has the ability to update WebView independently of the Android platform. Beginning today, developers can use a new beta channel to test the latest version of WebView and provide feedback.

WebView updates bring numerous bug fixes, new web platform APIs and updates from Chromium. If you’re making use of the WebView in your app, becoming a beta channel tester will give you an early start with new APIs as well as the chance to test your app before the WebView rolls out to your users.

The first version offered in the beta channel will be based on Chrome 40 and you can find a full list of changes on the chromium blog entry.

To become a beta tester, join the community which will enable you to sign up for the Beta program; you’ll then be able to install the beta version of the WebView via the Play Store. If you find any bugs, please file them on the Chromium issue tracker.

Join the discussion on

+Android Developers
Categories: Programming