Java2bigdata: April 2016

Sunday, 24 April 2016

Hortonworks or Cloudera Sandbox links

Cloudera QuickStart VM
Hortonworks Sandbox

HDFS Commands Reference

In this article, the basic syntax of hadoop file system i.e HDFS has been explained with examples and screen shot. This is very useful for the beginners who are interested to explore the big data world and HDFS is the gate to that world.

Hadoop is open source software [It is a java frame work] which runs on a cluster of commodity hardware machines. It provides both storage [HDFS] and processing [MAP REDUCE] in distributed manner. It has capable of processing huge volume of data that is ranging from Giga bytes to Peta bytes.

HDFS Commands

hdfs dfs

hadood fs:

hdfs dfs/hadoop fs

1. Creating a directory [HDFS]

Syntax:

- hdfs dfs –mkdir <Directory name along with path details >

Example:

- hdfs dfs –mkdir /user/root/hadoop_mahendhar

Screenshot

2. Listing the contains of the hadoop directory

Syntax:

- hdfs dfs –ls < argument like absolute path of the file >

Example:

- hdfs dfs –ls /user/root/hadoop_mahendhar

Screen Shot:

3. Create a file in local file system and put the file in HDFS

Create a file in local system by vi <file_name>, add some texts and save, exit.

Syntax: vi First_hadoop.txt

Putting the normal file to hadoop file system

Syntax:

- hdfs dfs –put <Local file path with file name > <hadoop destination path with file name >

Example:

- hdfs dfs –put First_hadoop.txt /user/root/hadoop_mahendhar

Screen Shot:

4. Moving a normal file to hadoop file system

Syntax:

- hdfs dfs – moveFromLocal <Local file path with file name > <hadoop destination path with file name >

Example:

- hdfs dfs –moveFromLocal /root/Second_hadoop.txt /user/root/hadoop_mahendhar

Screenshot:

Note:

1. Before executing the above command, ensure that the second_hadoop.txt file is created in the Local normal file system.

2. This operation will move the local file, so there is no local copy of the file exist after this operation.

5. For listing all directories and sub-directories recursively

Syntax:

- hdfs dfs –lsr <hadoop directory>

Example:

- hdfs dfs -lsr /user/root/hadoop_mahendhar/

- Note: Create more directory and sub directory to validate this command correctly.

6. Check the size of the file in HDFS

Syntax:

- hdfs dfs – du <File path with file name >

Example:

- hdfs dfs – du /user/root/hadoop_mahendhar/

Screen Shot:

7. Download a file from HDFS to normal file system

Syntax:

- hdfs dfs – get <hadoop file path details with file name > < local file path details with file name>

Example:

- hdfs dfs – get /user/root/hadoop_mahendhar/Second_hadoop.txt /root/local_files/

Screen Shot:

8. Getting a directory of files from HDFS and merge into a single file in normal file system

Syntax:

- hdfs dfs – getmerge <HDFS file directory > < Local file path with file details > < add new line>

Example:

- hdfs dfs – getmerge /user/root/hadoop_mahendhar/ /root/local_files/hadoop_merge_file.txt

Screen Shot:

Note:

1. The add newline is optional and it will just add a new line at the end of the each file.

2. Before this make sure you have created 2-3 files in HDFS so that you can check and validate the file contain with normal file w.r.t size

9. Copying data from one node to another node in HDFS

Syntax:

- hdfs dfs – distcp <node1 file path details > <node 2 file path details >

10. Display the contain of the data

Syntax:

- hdfs dfs – cat < File path details with file name >

Example:

- hdfs dfs –cat /user/root/hadoop_mahendhar/ First_hadoop.txt

Screen Shot:

11. Change the group, owner, permission of the file or directory

Syntax:

- hdfs dfs –chgrp [-R] <New group Name > <File or directory >

- hdfs dfs –chmod [ -R] <Fileor directory>

- hdfs dfs –chown <New Owner name> < File or directory>

12. Copying and Moving files within HDFS

Syntax:

- hdfs dfs –cp <First file path details > < Destination file details >

- hdfs dfs –mv <First file path details > < Destination file details >

13. Empty the hadoop thrash

Syntax:

- hdfs dfs –expunge

Screen Shot:

Apache Hadoop Terms/Abbreviations click here

Saturday, 23 April 2016

Apache Hadoop Abbreviations/Terms

Hadoop Terms

HDFS - Hadoop Distributed File System
GFS - Google File System
NN - NameNode
DN - Data Node
SNN - Secondary NameNode
JT - Job Tracker
TT - Task Tracker
HA NN - Highly Available NameNode (or NN HA - NameNode Highly Available)
REST - Representational State Transfer
HiveQL - Hive SQL
HAR - Hadoop Archive
ORC - Optimized Row Columnar
JSON - Java Script Object Notation
CDH - Cloudera’s Distribution Including Apache Hadoop
ZKFC - ZooKeeper Failover Controller
FUSE - Filesystem In Userspace
YARN - Yet Another Resource Negotiator
Amazon EC2 - Amazon Elastic Compute Cloud
Amazon S3 - Amazon Simple Storage Service
WASB - Windows Azure Storage Blobs (WASB)
EMR - Elastic MapReduce
JAR - Java ARchive
RPC - Remote Procedure Call
UDFs - user-defined functions
ETL - Extract/Transform/Load

Hadoop -1.0.4.tar.gz Directory Structure click here

Hadoop versions

This article will help you understand what are all Apache HADOOP versions.

Hadoop 2.7.2 (released on 25 January, 2016) 2.X.X - current stable version
Hadoop 2.7.1 (released on 06 July, 2015)
Hadoop 2.7.0 (released on 21 April 2015)

Hadoop 2.6.4 (released on 11 February, 2016)
Hadoop 2.6.3 (released on 17 December, 2015)
Hadoop 2.6.2 (released on 28 October, 2015)
Hadoop 2.6.1 (released on 23 September, 2015)
Hadoop 2.6.0 (released on 18 November, 2014)

Hadoop 2.5.2 (released on 19 November, 2014)
Hadoop 2.5.1 (released on 12 September, 2014)
Hadoop 2.5.0 (released on 11 August, 2014)

Hadoop 2.4.1 (released on 30 June, 2014)
Hadoop 2.4.0 (released on 07 April, 2014)

Hadoop 2.3.0 (released on 20 February, 2014)

Hadoop 2.2.0 (released on 15 October, 2013)

Hadoop 2.1.1 (released on 23 September, 2013) 2.X.X - beta version
Hadoop 2.1.0 (released on 25 August, 2013)

Hadoop 2.0.6 (released on 23 August, 2013) 2.X.X - alpha version
Hadoop 2.0.5 (released on 6 June, 2013)
Hadoop 2.0.4 (released on 25 April, 2013)

Hadoop 2.0.3-alpha (released on 14 February, 2013)
Hadoop 2.0.2-alpha (released on 9 October, 2012)
Hadoop 2.0.1-alpha (released on 26 July, 2012)
Hadoop 2.0.0-alpha (released on 23 May, 2012)

Hadoop 1.2.1 (released on 1 Aug, 2013) 1.2.1 - Stable version
Hadoop 1.2.0 (released on 13 May, 2013)
Hadoop 1.1.2 (released on 15 February, 2013) 1.1.X - beta version
Hadoop 1.1.1 (released on 1 December, 2012)
Hadoop 1.1.0 (released on 13 October, 2012)
Hadoop 1.0.4 (released on 12 October, 2012) 1.0.X - stable version
Hadoop 1.0.3 (released on 16 May, 2012)
Hadoop 1.0.2 (released on 3 Apr, 2012)
Hadoop 1.0.1 (released on 10 Mar, 2012)
Hadoop 1.0.0 (released on 27 December, 2011)

Hadoop 0.23.11(released on 27 June, 2014)
Hadoop 0.23.10(released on 11 December, 2013)
Hadoop 0.23.9 (released on 8 July, 2013)
Hadoop 0.23.8 (released on 5 June, 2013)
Hadoop 0.23.7 (released on 18 April, 2013)
Hadoop 0.23.6 (released on 7 February, 2013) 0.23.X - simmilar to 2.X.X but missing NN HA
Hadoop 0.23.5 (released on 28 November, 2012)
Hadoop 0.23.4 (released on 15 October, 2012)
Hadoop 0.23.3 (released on 17 September, 2012)
Hadoop 0.23.1 (released on 27 Feb, 2012)
Hadoop 0.22.0 (released on 10 December, 2011) 0.22.X - does not include security
Hadoop 0.23.0 (released on 11 Nov, 2011)
Hadoop 0.20.205.0 (released on 17 Oct, 2011)
Hadoop 0.20.204.0 (released on 5 Sep, 2011)
Hadoop 0.20.203.0 (released on 11 May, 2011) 0.20.203.X - old legacy stable version
Hadoop 0.21.0 (released on 23 August, 2010)
Hadoop 0.20.2 (released on 26 February, 2010) 0.20.X - old legacy version
Hadoop 0.20.1 (released on 14 September, 2009)
Hadoop 0.19.2 (released on 23 July, 2009)
Hadoop 0.20.0 (released on 22 April, 2009)
Hadoop 0.19.1 (released on 24 February, 2009)
Hadoop 0.18.3 (released on 29 January, 2009)
Hadoop 0.19.0 (released on 21 November, 2008)
Hadoop 0.18.2 (released on 3 November, 2008)
Hadoop 0.18.1 (released on 17 September, 2008)
Hadoop 0.18.0 (released on 22 August, 2008)
Hadoop 0.17.2 (released on 19 August, 2008)
Hadoop 0.17.1 (released on 23 June, 2008)
Hadoop 0.17.0 (released on 20 May, 2008)
Hadoop 0.16.4 (released on 5 May, 2008)
Hadoop 0.16.3 (released on 16 April, 2008)
Hadoop 0.16.2 (released on 2 April, 2008)
Hadoop 0.16.1 (released on 13 March, 2008)
Hadoop 0.16.0 (released on 7 February, 2008)
Hadoop 0.15.3 (released on 18 January, 2008)
Hadoop 0.15.2 (released on 2 January, 2008)
Hadoop 0.15.1 (released on 27 November, 2007)
Hadoop 0.14.4 (released on 26 November, 2007)
Hadoop 0.15.0 (released on 29 October 2007)
Hadoop 0.14.3 (released on 19 October, 2007)
Hadoop 0.14.1 (released on 4 September, 2007)

Data Types in Hadoop click here
Hadoop -1.0.4.tar.gz Directory Structure click here

Thursday, 21 April 2016

How to use unix tool AWK

This artical will help you understand how to work with the AWK utility in UNIX. It also gives
the meaning of some of the AWK Built-in Variables

These few AWK one liners give very basic and random examples which will help to understand basic about this UNIX tool.

Meaning of some of the Awk Built-in Variables used below:

NF : Number of fields in current line/record

NR : Ordial number of current line/record

FS : Field Separator (Also -F can be used)

OFS : Output Field Separator (default=blank)

FILENAME : Name of current input file

All of following Awk one liner examples is based on the input file 'test1.txt’

$ cat test1.txt

Continent:Val

AS:12000

AF:9800

AS:12300

NA:3400

OC:12000

AF:500

AS:1000

Know more bout UNIX FILE PERMISSIONS Click here

Scenario	Print 'line number' NR and 'Number of fields' NF for each line
Command	awk -F ":" '{print NR,NF}' test1.txt
Output	1 2
	2 2
	3 2
	4 2
	5 2
	6 2
	7 2
	8 2

Scenario	Print first field, colon delimited
Command	awk -F ":" '{print $1}' test1.txt
	Continent
	AS
	AF
	AS
	NA
	OC
	AF
	AS

Scenario	Print first field, colon delimited, but excluding the 'first line' (NR!=1)
Command	awk -F ":" 'NR!=1 {print $1}' test1.txt
Output	AS
	AF
	AS



	NA
	OC
	AF
	AS

Scenario	Print first field, colon delimited, but only for line number 1 (NR==1)
Command	awk -F ":" 'NR==1 {print $1}' test1.txt
Output	Continent

Scenario	Print first and second field, colon delimited, but excluding the 'first line' (NR!=1)
Command	awk -F ":" 'NR!=1 {print $1,$2}' test1.txt
Output	AS 12000
	AF 9800
	AS 12300
	NA 3400

	OC 12000
	AF 500
	AS 1000

Scenario	Setting output field separator as pipe
Command	awk -F ":" 'BEGIN{OFS="\|"} NR!=1 {print $1,$2}' test1.txt
Output	AS\|12000
	AF\|9800
	AS\|12300
	NA\|3400
	OC\|12000
	AF\|500
	AS\|1000

Scenario	Anything on BEGIN executes first
Command	awk 'BEGIN{FS=":"; OFS="\|"; print "Con\|SomeVal"} NR!=1 {print $1,$2}' test1.txt
Output	Con\|SomeVal
	AS\|12000
	AF\|9800
	AS\|12300
	NA\|3400
	OC\|12000
	AF\|500
	AS\|1000

Scenario	Printing FILENAME, will be printed for all the lines
Command	awk -F ":" '{print FILENAME}' test1.txt
Output	test1.txt
	test1.txt
	test1.txt
	test1.txt
	test1.txt
	test1.txt
	test1.txt
	test1.txt


Scenario	Printing FILENAME, but printing only last instance using END clause
Command	awk -F ":" ' END {print FILENAME}' test1.txt
Output	test1.txt

Scenario	Printing the last field of the file, same as printing $2 as there are only 2 fields
Command	awk -F ":" '{print $NF}' test1.txt
Output	Val
	12000
	9800
	12300
	3400
	12000
	500
	1000

Scenario	Matching, printing lines begin with "AS"
Command	awk -F ":" '/^AS/' test1.txt
Output	AS:12000
	AS:12300
	AS:1000

Scenario	Matching, printing lines not begining with "AS"
Command	awk -F ":" '!/^AS/' test1.txt
Output	Continent:Val
	AF:9800
	NA:3400
	OC:12000
	AF:500

Scenario	Direct matching, first field as "AS"
Command	awk -F ":" '$1=="AS"' test1.txt
Output	AS:12000
	AS:12300
	AS:1000

Scenario	Direct matching, first field as "AS", Print 2^nd Column
Command	awk -F ":" '$1=="AS" {print $2}' test1.txt
Output	12000
	12300
	1000

Scenario	$0 prints the full line, same as {print}
Command 1	awk -F ":" '$1=="AS" {print $0}' test1.txt
Output	AS:12000
	AS:12300
	AS:1000

Command 2	awk -F ":" '$1=="AS" {print}' test1.txt
Output	AS:12000
	AS:12300
	AS:1000

Scenario	'Or' and 'AND' together
Command	awk -F ":" '($1=="AS" \|\| $1=="OC") && $NF > 11000 {print}' test1.txt
Output	AS:12000
	AS:12300
	OC:12000

Scenario	Partial Matching
Command	awk -F ":" '$1 ~ /A/ {print}' test1.txt
Output	AS:12000
	AF:9800
	AS:12300
	NA:3400
	AF:500
	AS:1000

Scenario	Reading from STDOUT
Command	cat test1.txt \| awk -F ":" '!/Continent/ {print $1}' \| sort \| uniq
Output	AF
	AS
	NA
	OC

Scenario	Add value 1000 to the 2nd field, where first field is "AF" and then print the output file
Command	awk -F ":" '$1=="AF" {$2+=1000} {print}' test1.txt
Output	Continent:Val
	AS:12000
	AF 10800
	AS:12300
	NA:3400
	OC:12000
	AF 1500
	AS:1000

Scenario	Sum of 2nd fields, exclude first line
Command	awk -F ":" 'NR!=1 {sum+=$NF} END {print sum}' test1.txt
Output	51000

Scenario	Set 2nd value as 0 where first field is "AS"
Command	awk -F ":" 'BEGIN {OFS=":"} $1=="AS" {$2=0} {print}' test1.txt
Output	Continent:Val
	AS:0
	AF:9800
	AS:0
	NA:3400

Pages

Sunday, 24 April 2016

Hortonworks or Cloudera Sandbox links

HDFS Commands Reference

Saturday, 23 April 2016

Apache Hadoop Abbreviations/Terms

Hadoop versions

Thursday, 21 April 2016

How to use unix tool AWK

Java2bigdata