Monday 16 May 2016

DISTRIBUTED CACHE IN HADOOP

Introduction :

In the Using third part jars and files in your MapReduce application(Distributed cache) entry i blogged about how to use Distributed Cache in Hadoop using DistributedCache API . But you can also have option of using command line option. You will have to use following steps to use DistributedCache programmatically In order to use it, first change your MapReduce Driver class to add job.addCacheFile()

Hadoop Map Reduce Project provides us this facility with something called as DistributedCache.
This Distributed Cache is configured with Job Configuration, What it does is, it provides read only data to all machine on the cluster.

Step 1 : Put file to HDFS

In order to use a file with DistributedCache API, it has to available on either hdfs:// or http:// URL, that is accessible to all the cluster members. So first step was to upload the file that you are interested in into HDFS, in my case i used following command to copy the /tmp/file1 file to hdfs.


 # hdfs dfs -put /tmp/file1 /cachefile1  

Step 2: Add cachefile in Job Configuration

Next step is to change the Driver class and add job.addCacheFile(new URI("/cachefile1")); call, this call takes the hdfs url of the file that you just uploaded to HDFS and passes it to DistributedCache class.

 Configuration conf = new Configuration();  
 Job job = new Job(conf, "wordcount");  
 DistributedCache.addCacheFile(new URI("/cachefile1"),job.getConfiguration());  

Step 3: Access Cached file

Now in your Mapper class you can read the file1 using normal File API

 Path[] cacheFiles = context.getLocalCacheFiles();  
 FileInputStream fileStream = new FileInputStream(cacheFiles[0].toString());  


You may like reading HADOOP directory structure Hadoop -1.0.4.tar.gz Directory Structure

If you know anyone who has started learning Hadoop and Java, why not help them out! Just share this post with them. Thanks for studying today!

No comments:

Post a Comment