Passion of IT

Hadoop Big Data

This week I got a chance of studying big data and Hadoop and I’ve written this short document

What is big data?

Big data is large amount of data to store and process on the server.

They can be the user activities of big applications such as social media (facebook), search engines (google), Internet of things (sensors, machines, transports etc)

Today 1TB hard drive costs around 50 euro so it’s cheap storing large amount of data but processing it requires long time using SQL database and NoSQL databases are better way for storing these informations and also they are other frameworks for processing them such as Hadoop.

Big Data technologies

There are two classes of technologies to handle big data, they are complementary and usually used together:

  • Operational big data: technologies for storing and processing the big data: NoSQL databases
  • Analytical Big Data: Systems for Massive Parallel Processing such as Hadoop

Google Hadoop

in the usual client-server approach we have a client that gets the information from the database.

Hadoop splits the task in small parts assigning each one to a computer connected to the network and connects all the information afterward.This is the MapReduce algorithm.

This means that Hadoop has the below components:

  • his own distributed file system called HDFS: it’s made of masters NameNode which manages many slaves nodes DataNodes. A file is divided into many blocks stored in set of DataNode. NameNode makes the mapping between the file and the block
  • his own common libraries
  • the framework called YARN
  • the map reduce algorithm: made by a Map task which converts the input in a map and a Reduce task which splits the map in a smaller set of tuple. The map reduce Algorithm has a single master called Job Tracker responsible for resource management and scheduling jobs, and a slave TaskTraker which executes the task

Hadoop works as below:

  • the user submits a job to Hadoop with a input/output elements, a MapReduce algorithm, the job configuration
  • the jobTraker takes the responsibility of the job submitting it to the slaves
  • the task traker executes its task

Here is a guide on how to install Hadoop

Here is a list of the hadoop file system commands

I tested a couple of commands: mkdir and ls, they work fine

Finally I’ve written a simple map reduce example:

Here is an example of the MapReduce algorithm


package com.michele.rizzi.hadoop;
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

 

The mapper sets all words in the context with key=word, value=1.

if the text file is like the one below

Michele Michele Rizzi

the context at the end of the mapper method will have

key:Michele, value:1

key:Michele value:1

key=Rizzi value:1

 

The reducer automatically groups the items with the same key, the input parameters will be

key=”Michele” values={1,1}

key=”Rizzi” values={1}

 

 

 

Difference Flink Hadoop

Map/Reduce is a protocol where one transformation takes most likely two phases: One map phase, one reduce phase. Between these phases, the results are spilled to disk. If you have more than one transformation, it stacks up.

One of the limitations of Hadoop MapReduce is that the framework can only batch process one job at a time.

MapReduce may be a legacy system for many companies who started their Big Data journey when Hadoop first came out. This framework has been the workhorse for large data projects. One of the significant challenges for systems developed with this framework is its high-latency, batch-mode response. Since MapReduce has evolved over the last 10 years some developers would complain that it’s difficult to maintain because of inherent inefficiencies in its design and code.
Flink tries to combine all transformation into a job. If the job is started the transformations are done in this jobs, intermediate results are kept in memory. One of the highlights of Flink is its own memory management to prevent Garbage Collection and overflows.

Flink provides a true data streaming platform that uses high-performance dataflow architecture. It is also a strong tool for batch processing since Flink handles batch as a special case of streaming. Flink processes streaming data in real-time streams by pipelining data elements as soon as the elements arrive.

 

Here is available the source code of the application

No Comments Yet

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite="
"> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Recent Comments

Michele Rizzithe website doesn't exist. Ara you a robot? In any case it's difficult that an automatic system writes a comment here since there are two captchas...
Hello there! This is kind of off topic but I need some guidance from an established blog. Is it very hard to set up your own blog? I'm not very t...
We are a group of volunteers and opening a new scheme in our community. Your web site offered us with valuable information to work on. You've done a...
December 2024
M T W T F S S
« Dec    
 1
2345678
9101112131415
16171819202122
23242526272829
3031  

Login