Introduction :
In the Using third part jars and files in your MapReduce application(Distributed cache) entry i blogged about how to use Distributed Cache in Hadoop using DistributedCache API . But you can also have option of using command line option. You will have to use following steps to use DistributedCache programmatically In order to use it, first change your MapReduce Driver class to add job.addCacheFile()
Hadoop Map Reduce Project provides us this facility with something called as DistributedCache.
This Distributed Cache is configured with Job Configuration, What it does is, it provides read only data to all machine on the cluster.
Step 1 : Put file to HDFS
In order to use a file with DistributedCache API, it has to available on either hdfs:// or http:// URL, that is accessible to all the cluster members. So first step was to upload the file that you are interested in into HDFS, in my case i used following command to copy the /tmp/file1 file to hdfs.
Step 2: Add cachefile in Job Configuration
Next step is to change the Driver class and add job.addCacheFile(new URI("/cachefile1")); call, this call takes the hdfs url of the file that you just uploaded to HDFS and passes it to DistributedCache class.
Step 3: Access Cached file
Now in your Mapper class you can read the file1 using normal File API
You may like reading HADOOP directory structure Hadoop -1.0.4.tar.gz Directory Structure
If you know anyone who has started learning Hadoop and Java, why not help them out! Just share this post with them. Thanks for studying today!
In the Using third part jars and files in your MapReduce application(Distributed cache) entry i blogged about how to use Distributed Cache in Hadoop using DistributedCache API . But you can also have option of using command line option. You will have to use following steps to use DistributedCache programmatically In order to use it, first change your MapReduce Driver class to add job.addCacheFile()
Hadoop Map Reduce Project provides us this facility with something called as DistributedCache.
This Distributed Cache is configured with Job Configuration, What it does is, it provides read only data to all machine on the cluster.
Step 1 : Put file to HDFS
In order to use a file with DistributedCache API, it has to available on either hdfs:// or http:// URL, that is accessible to all the cluster members. So first step was to upload the file that you are interested in into HDFS, in my case i used following command to copy the /tmp/file1 file to hdfs.
# hdfs dfs -put /tmp/file1 /cachefile1
Step 2: Add cachefile in Job Configuration
Next step is to change the Driver class and add job.addCacheFile(new URI("/cachefile1")); call, this call takes the hdfs url of the file that you just uploaded to HDFS and passes it to DistributedCache class.
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/cachefile1"),job.getConfiguration());
Step 3: Access Cached file
Now in your Mapper class you can read the file1 using normal File API
Path[] cacheFiles = context.getLocalCacheFiles();
FileInputStream fileStream = new FileInputStream(cacheFiles[0].toString());
You may like reading HADOOP directory structure Hadoop -1.0.4.tar.gz Directory Structure
If you know anyone who has started learning Hadoop and Java, why not help them out! Just share this post with them. Thanks for studying today!
No comments:
Post a Comment