Simplified MapReduce

I believe one of the best solutions to solve a programming problem is to find a paper or article and read it as a clue. Well, of course Wikipedia, BMI and other sources are really helpful but somehow reading them is a nightmare for me, because of complexity of explanation. Thus, It’s preferable for me to read a straightforward article to find the clue.

You have a problem you find a paper about it. Now you have one and a half problems. Understanding the paper, and implementing it.

— Amir Mohammad Saied (@gluegadget) February 6, 2014

And now, I want to straightforwardly describe one of usable algorithms, its called MapReduce. Perhaps you have heard that before in Hadoop, MongoDB or NoSQL discussions.

Here is an introduction to what is a MapReduce is:

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

From: http://en.wikipedia.org/wiki/MapReduce

Mainly, a MapReduce used to gather information from a massive datasets, faster and easier. The algorithm consists of two main functions, map and reduce. The map function is used to collecting data from the inputs. At this step, map function breaks the input into smaller chunks. In reduce function; we will put or aggregate the map function’s results together, to make a single result.

The reduce function will always performed after the map function.

To understand the process better, I’d like to give an example. Suppose we have a news website, each news is an entity in our database. Each news item has an Array of keywords that describes the news. Following is a sample of a news item:

{
  title: ‘Hello world!’,
  description: ‘Hello world! This is the first post from our awesome news portal; we will publish more news here. Thanks.’
  keywords: [{
    word: ‘hello’,
    count: 1
  }, {
    word: ‘world’,
    count: 1
  }, {
    word: ‘news’,
    count: 2
  }, {
    word: ‘post’,
    count: 1
  }]
}

So, what we want to do? We have a lot of news items, and an Array of keywords inside each one. We are going to determine popular keywords from all news items.

First of all, the map function will break the news item into smaller pieces. Actually, we should emit the keyword and the number of repeat inside the map function. The emit function is used to push new values into a temporary key-value pair, this array will be used in reduce function further to generate a single value.

Following is an example of map function source code:

function () {
  this.keywords.forEach(function (doc) {
    emit(doc.word, doc.count);
  })
}

To understand the map function better, following is an output of this function. When we have “hello” word that repeated twice with out number of 1 and 3, the output will be:

{ “hello”: [1, 3] }

And when we have the word “post” that repeated once, with count number of 2, the output would be:

{ “post”: [2] }

Then, we have the reduce function. Inside the reduce function we will wrap up map function’s result to create a single value. The single value is a keyword with total count of repetition in all news items.

Following is the reduce function source code:

function (key, values) {
  return Array.sum(values);
};

So, following is the output of reduce function:

{ id: “hello”, value: 4 }

And for the second map function’s output, the result will be:

{ id: “post”, value: 2 }

After performing the reduce function, we will have a set of keywords with the total count of repetition amongst all news items, that is, the array of popular keywords.

Of course the above explanation was a briefly look into the MapReduce algorithm. There are a lot of MapReduce frameworks and you can find them in NoSQL databases, MongoDB for instance.

Source: New feed

Programming languages war

So frequently I hear a tedious conversation between my colleagues, “PHP is better than foo”, “.NET is better than boo” etc. etc. Each time I hear this sort of dialog I try to ask the reasons of this comparison but till now, no one has given a proper answer since somehow it’s impossible.

Up till now, I’ve coded with JavaScript, PHP, C# and little bit Python. I’m still a newbie in programming industry, but at least I had a lot of mutual projects with some experts. The result of being a part of those projects, taught me how to tackle a problem, how to choose an adequate tools or language and prepare the environment to solve it. Almost in all cases, indicating the programming language wasn’t the bottleneck. However, we did that with considering the problem’s parameters. Design a good architecture and implementing it correctly was the main goal in our projects.

I’d like to point it that obviously, the programming environment is not the only parameter to have a robust application. The most major parameter to have a good result, is the knowledge of programmers, not the facilities of programming languages. I don’t use more than 50% of programming languages features in the progress of development, and I bet no one else will either.

As time passed by, old-fashioned programming languages get retired and new technologies come into the battlefield. Consequently, it’s not a value to know a programming language well, it would be better if you know the concepts.

DISCLAIMER: Above writing is my thoughts and obviously you don’t have to be agreed with necessarily.

Source: New feed

Async vs. Sync I/O benchmark in NodeJs

As you know, NodeJs is a non-blocking I/O platform which gives you ability to do non-blocking and event-based functionalities. It has async methods for I/O but it also provide sync version of that methods as well. It means you can write to a file with async/non-blocking methods and you can do the same with sync methods.

So, in this post I want to show you the different between using async or non-blocking I/O and sync I/O. Here I have a HTTP server which has a simple functionality, it just reads a static file from disk and gives the content of file to the user by an HTTP request. There’s two different ways for reading a file from disk in NodeJs, with fs.open (async) or fs.openSync (sync).

Results speak for themselves, as expected. When we try to read a file with Async mode, all steps of reading a file (stat, open, read, close) are async and it means the reading process will not block the request (less request time) but in Sync mode, each step should wait for previous step result so it takes more than Async mode.

I used Apache Benchmark (ab) for these tests, with this parameters:

ab -n 1000 -c 1000 -vhr http://localhost:8081/

And the test system is:

CentOs, Linux 2.6.18-164.el5 and NodeJs v0.8.8, 512MB Memory, QEMU Virtual CPU.

Well, let’s see the results.

Async mode:

Time taken for tests: 3.800 seconds
Requests per second: 263.19 [#/sec] (mean)
Time per request: 3799.512 [ms] (mean)
Time per request: 3.800 [ms] (mean, across all concurrent requests)

Percentage of the requests served within a certain time (ms)
50% 2667
66% 2682
75% 3752
80% 3752
90% 3761
95% 3765
98% 3765
99% 3765
100% 3765 (longest request)

Sync mode:

Time taken for tests: 4.809 seconds
Requests per second: 207.95 [#/sec] (mean)
Time per request: 4808.944 [ms] (mean)
Time per request: 4.809 [ms] (mean, across all concurrent requests)

Percentage of the requests served within a certain time (ms)
50% 2418
66% 3152
75% 3585
80% 3827
90% 4320
95% 4551
98% 4712
99% 4760
100% 4809 (longest request)

You can see that in the Async mode you can process about 264 requests per second while in Sync mode it’s about 208.

I made this test to show the power of Async I/O functions in NodeJs and also to show the NodeJs developers that using Sync I/O functions is not a good solution for Callback Hell, there are several better approaches to solve the Callback Hell problem, keep using Async functions.

You can download and run this test yourself, I made a Github repo, here you can download them: https://github.com/afshinm/Async-Sync-IO-benchmark

Source: New feed

Migrating Git Repositories

Moving a Git repository from a server to another is a common situation that all developers could face with it, for example moving repositories from Bitbucket to Github or vice versa.

Doing this task is really terrible when you have to move a lot of repositories, you should manually clone it and then push it to the target server. Boring.

I wrote a shell script which help you out to move the repositories from any Git server to another, you can simply configure it and then just hit the Enter.

 Configuration

Ok, to get started you should clone or download this repository from Github: https://github.com/afshinm/git-migrate

After that, you should find two files, migrate.sh and CONFIG. We need CONFIG file to configure our migration, it contains from and to servers.

Our CONFIG file is look like this:

repoName1:fromServer1:toServer1
repoName2:fromServer2:toServer2
repoName3:fromServer3:toServer3

Each repository should be in one line, and in each line we have three variables separated by : character. Repository name, from Git server and to server. You can choose anything for name part of the config, it’s not matter.

Here you can see an example of the CONFIG file:

test:git@bitbucket.org:afshinm/test.git:git@github.com:afshinm/test.git

In above example I moved test repository from Bitbucket to Github. Please note that if in any variables you have : character, you should put a backslash before it to prevent the conflict between variables and values.

Also you can use both HTTPS and SSH urls for from or to servers but I prefer to use SSH forms (then you need to create a SSH Key and add it to both from and to servers, see this article)

 Executing

After saving the CONFIG file, everything is ready for migration. Just type below command in shell and press Enter:

./migrate.sh

Then you will see a log of migration in your shell environment. Also you will notice if there are any errors in migration.

Source: New feed

Using CSS Fallback Properties for Better Cross-browser Compatibility

As you may know, Internet Explorer has supported something called conditional comments which allow you to include specific HTML or CSS based upon the result of a condition. Conditional comments in HTML first appeared in Microsoft’s Internet Explorer 5 browser but it’s been deprecated as of Internet Explorer 10.

Internet Explorer has several problems with CSS, specially in IE6, 7 and 8. Web developers have used conditional comments to provide a better browser compatibility for Internet Explorer by often including some extra CSS files to fix bad behaviors and rendering mistakes. This works but it actually could be done in a better and simpler way.

 CSS Basics

Before going further, let’s discuss a little about how CSS works. Suppose you have this code:

.me {
    color: #ccc;
}

In above example, we selected an element with class me and changed the color property to an hexadecimal value, #ccc. Now look at the following code:

.me {
    color: #ccc;
    color: #000;
}

Because we changed the color property again to #000 after setting it to #ccc, the second value will be used and the color of text inside the element will be #000.

Ok, let’s do something strange. I want to use an invalid CSS function to change the value:

.me {
    color: #ccc;
    color: boo(1);
}

Because boo(1) is not a valid CSS function, browsers don’t change and replace the color value but rather keep the #ccc value for the color property. We can use this behavior of browsers to do something cool, as you’ll see.

 rgba()

Again, go back to conditional comments; web developers use conditional comments to provide a better browser compatibility, especially in CSS3. In new technologies like CSS3, many of functions and methods are unavailable in older versions of IE. For example consider the rgba() function, a useful function to set color with an alpha transparency.

rgba() browsers support

As you can see in the above chart from CanIUse.com, we can’t use rgba() in IE 8 and older versions. What can we do to support this legacy browser? The first, commonly used way is using conditional comments to first include a modern browser CSS file and then include another CSS file inside a conditional comment for other versions of IE, like this:

<link href="modern.css" rel="stylesheet" />

<!--[if lte IE 8]>
    <link href="ie8only.css" rel="stylesheet">
<![endif]-->

In modern.css we have:

.me {
    color: rgba(0, 0, 0, 0.5);
}

And in ie8only.css we have:

.me {
    color: #ccc;
}

When our users visits using Internet Explorer 8 or older, the browser renders the ie8only.css file and the color property is set to #ccc.

 CSS Fallback Properties

However, we can do this better using CSS fallback properties within a single CSS file, like this:

.me {
    color: #ccc;
    color: rgba(0, 0, 0, 0.5);
}

You may be able to guess what will happened by using above code. When you set the color property to #ccc because it’s a valid value in all browsers, it works without any problem. In the next line we used the rgba() function. In modern browsers, because it’s a valid function, it works without any problem and the browser uses the second value as the color property. But in IE 8 or older versions, because it’s an invalid value, the browser does nothing and still uses the first value, #ccc.

What we’ve done is use the CSS fallback properties technique: when a function or value is invalid, the browser uses the last available value for that property. With this technique, you don’t need to create two separate files or write confusing conditions in the HTML files. Also your application doesn’t need to send two separate HTTP requests, first for modern CSS file and then for IE fix file.

You can use this technique in many situations and I believe it’s a better approach than conditional comments.

Source: New feed