Sunday, 24 April 2016

Hortonworks or Cloudera Sandbox links

HDFS Commands Reference

In this article, the basic syntax of hadoop file system i.e HDFS has been explained with examples and screen shot. This is very useful for the beginners who are interested to explore the big data world and HDFS is the gate to that world.

Hadoop is open source software [It is a java frame work] which runs on a cluster of commodity hardware machines. It provides both storage [HDFS] and processing [MAP REDUCE] in distributed manner. It has capable of processing huge volume of data that is ranging from Giga bytes to Peta bytes.

HDFS Commands

hdfs dfs

































hadood fs:

































hdfs dfs/hadoop fs

1.  Creating a directory [HDFS]

Syntax:

- hdfs dfs –mkdir <Directory name along with path details >

Example:

- hdfs dfs –mkdir /user/root/hadoop_mahendhar

Screenshot



2.  Listing the contains of the hadoop directory

Syntax:

- hdfs dfs –ls < argument like absolute path of the file >

Example:

- hdfs  dfs –ls /user/root/hadoop_mahendhar

Screen Shot:






3.  Create a file in local file system and put the file in HDFS

Create a file in local system by  vi <file_name>, add some texts and save, exit.

Syntax: vi First_hadoop.txt

Putting the normal file to hadoop file system

Syntax:
- hdfs dfs –put <Local file path with file name > <hadoop destination path with file name >

Example:
- hdfs dfs –put  First_hadoop.txt  /user/root/hadoop_mahendhar

Screen Shot:








4. Moving a normal file to hadoop file system 

Syntax: 

- hdfs dfs – moveFromLocal <Local file path with file name > <hadoop destination path with file name > 

Example:

- hdfs dfs  –moveFromLocal  /root/Second_hadoop.txt  /user/root/hadoop_mahendhar

Screenshot:








Note:

1. Before executing the above command, ensure that the second_hadoop.txt file is created in the Local normal file system. 

2. This operation will move the local file, so there is no local copy of the file exist after this operation. 

5. For listing all directories and sub-directories recursively 

Syntax:

- hdfs dfs –lsr <hadoop directory>

Example:

- hdfs  dfs -lsr /user/root/hadoop_mahendhar/ 

- Note: Create more directory and sub directory to validate this command correctly. 

6. Check the size of the file in HDFS 

Syntax: 

- hdfs dfs – du <File path with file name > 

Example:

- hdfs  dfs – du /user/root/hadoop_mahendhar/  
Screen Shot:







7. Download a file from HDFS to normal file system 

Syntax: 

- hdfs dfs – get <hadoop file path details with file name > < local file path details with file name> 

Example:

- hdfs dfs – get /user/root/hadoop_mahendhar/Second_hadoop.txt /root/local_files/ 

Screen Shot:






8. Getting a directory of files from HDFS and merge into a single file in normal file system 

Syntax: 

- hdfs dfs – getmerge <HDFS file directory > < Local file path with file details > < add new line> 

Example:

- hdfs dfs – getmerge /user/root/hadoop_mahendhar/ /root/local_files/hadoop_merge_file.txt 

Screen Shot:






Note:

1. The add newline is optional and it will just add a new line at the end of the each file. 

2. Before this make sure you have created 2-3 files in HDFS so that you can check and validate the file contain with normal file w.r.t size 

9. Copying data from one node to another node in HDFS 

Syntax:

- hdfs dfs – distcp <node1 file path details > <node 2 file path details > 

10. Display the contain of the data 

Syntax: 

- hdfs dfs – cat < File path details with file name > 

Example:

- hdfs  dfs –cat /user/root/hadoop_mahendhar/ First_hadoop.txt 

Screen Shot:




11. Change the group, owner, permission of the file or directory 

Syntax: 

- hdfs dfs –chgrp [-R] <New group Name > <File or directory > 

- hdfs dfs –chmod [ -R] <Fileor directory> 

- hdfs dfs –chown <New Owner name> < File or directory> 

12. Copying and Moving files within HDFS 

Syntax:

- hdfs dfs –cp <First file path details > < Destination file details >

- hdfs dfs –mv <First file path details > < Destination file details > 

13. Empty the hadoop thrash 

Syntax: 

- hdfs dfs –expunge 

Screen Shot:




Apache Hadoop Terms/Abbreviations click here 

Saturday, 23 April 2016

Apache Hadoop Abbreviations/Terms

Hadoop Terms
HDFS - Hadoop Distributed File System
GFS - Google File System
NN - NameNode
DN - Data Node
SNN - Secondary NameNode
JT - Job Tracker
TT - Task Tracker
HA NN - Highly Available NameNode (or NN HA - NameNode Highly Available)
REST - Representational State Transfer
HiveQL - Hive SQL
HAR - Hadoop Archive
ORC - Optimized Row Columnar
JSON - Java Script Object Notation
CDH - Cloudera’s Distribution Including Apache Hadoop
ZKFC - ZooKeeper Failover Controller
FUSE - Filesystem In Userspace
YARN - Yet Another Resource Negotiator
Amazon EC2 - Amazon Elastic Compute Cloud
Amazon S3 - Amazon Simple Storage Service
WASB - Windows Azure Storage Blobs (WASB)
EMR - Elastic MapReduce
JAR - Java ARchive
RPC - Remote Procedure Call
UDFs - user-defined functions
ETL - Extract/Transform/Load 
Hadoop -1.0.4.tar.gz Directory Structure click here

Hadoop versions

This article will help you understand what are all Apache HADOOP versions.


 Hadoop 2.7.2 (released on 25 January, 2016) 2.X.X -  current stable version
 Hadoop 2.7.1 (released on 06 July, 2015)
 Hadoop 2.7.0 (released on 21 April 2015)

 Hadoop 2.6.4 (released on 11 February, 2016)
 Hadoop 2.6.3 (released on 17 December, 2015)
 Hadoop 2.6.2 (released on 28 October, 2015)
 Hadoop 2.6.1 (released on 23 September, 2015)
 Hadoop 2.6.0 (released on 18 November, 2014)

 Hadoop 2.5.2 (released on 19 November, 2014)
 Hadoop 2.5.1 (released on 12 September, 2014)
 Hadoop 2.5.0 (released on 11 August, 2014)

 Hadoop 2.4.1 (released on 30 June, 2014)
 Hadoop 2.4.0 (released on 07 April, 2014)

 Hadoop 2.3.0 (released on 20 February, 2014)

 Hadoop 2.2.0 (released on 15 October, 2013)

 Hadoop 2.1.1 (released on 23 September, 2013) 2.X.X - beta version
 Hadoop 2.1.0 (released on 25 August, 2013)

 Hadoop 2.0.6 (released on 23 August, 2013) 2.X.X -  alpha version
 Hadoop 2.0.5 (released on 6 June, 2013)
 Hadoop 2.0.4 (released on 25 April, 2013)

 Hadoop 2.0.3-alpha (released on 14 February, 2013)        
 Hadoop 2.0.2-alpha (released on 9 October, 2012)
 Hadoop 2.0.1-alpha (released on 26 July, 2012)
 Hadoop 2.0.0-alpha (released on 23 May, 2012)


 Hadoop 1.2.1 (released on 1 Aug, 2013)                         1.2.1 - Stable version
 Hadoop 1.2.0 (released on 13 May, 2013)
 Hadoop 1.1.2 (released on 15 February, 2013)                1.1.X -  beta version
 Hadoop 1.1.1 (released on 1 December, 2012)
 Hadoop 1.1.0 (released on 13 October, 2012)
 Hadoop 1.0.4 (released on 12 October, 2012)                 1.0.X -  stable version
 Hadoop 1.0.3 (released on 16 May, 2012)
 Hadoop 1.0.2 (released on 3 Apr, 2012)
 Hadoop 1.0.1 (released on 10 Mar, 2012)
 Hadoop 1.0.0 (released on 27 December, 2011)

 Hadoop 0.23.11(released on 27 June, 2014)
 Hadoop 0.23.10(released on  11 December, 2013)
 Hadoop 0.23.9 (released on 8 July, 2013)
 Hadoop 0.23.8 (released on 5 June, 2013)
 Hadoop 0.23.7 (released on  18 April, 2013)
 Hadoop 0.23.6 (released on 7 February, 2013)           0.23.X - simmilar to 2.X.X but missing NN HA
 Hadoop 0.23.5 (released on 28 November, 2012)
 Hadoop 0.23.4 (released on 15 October, 2012)
 Hadoop 0.23.3 (released on 17 September, 2012)
 Hadoop 0.23.1 (released on 27 Feb, 2012)
 Hadoop 0.22.0 (released on 10 December, 2011)        0.22.X - does not include security
 Hadoop 0.23.0 (released on 11 Nov, 2011)
 Hadoop 0.20.205.0 (released on 17 Oct, 2011)
 Hadoop 0.20.204.0 (released on 5 Sep, 2011)
 Hadoop 0.20.203.0 (released on 11 May, 2011)          0.20.203.X - old legacy stable version
 Hadoop 0.21.0 (released on 23 August, 2010)
 Hadoop 0.20.2 (released on 26 February, 2010)         0.20.X - old legacy version
 Hadoop 0.20.1 (released on 14 September, 2009)
 Hadoop 0.19.2 (released on 23 July, 2009)
 Hadoop 0.20.0 (released on 22 April, 2009)
 Hadoop 0.19.1 (released on 24 February, 2009)
 Hadoop 0.18.3 (released on 29 January, 2009)
 Hadoop 0.19.0 (released on 21 November, 2008)
 Hadoop 0.18.2 (released on 3 November, 2008)
 Hadoop 0.18.1 (released on 17 September, 2008)
 Hadoop 0.18.0 (released on 22 August, 2008)
 Hadoop 0.17.2 (released on 19 August, 2008)
 Hadoop 0.17.1 (released on 23 June, 2008)
 Hadoop 0.17.0 (released on 20 May, 2008)
 Hadoop 0.16.4 (released on 5 May, 2008)
 Hadoop 0.16.3 (released on 16 April, 2008)
 Hadoop 0.16.2 (released on 2 April, 2008)
 Hadoop 0.16.1 (released on 13 March, 2008)
 Hadoop 0.16.0 (released on 7 February, 2008)
 Hadoop 0.15.3 (released on 18 January, 2008)
 Hadoop 0.15.2 (released on 2 January, 2008)
 Hadoop 0.15.1 (released on 27 November, 2007)
 Hadoop 0.14.4 (released on 26 November, 2007)
 Hadoop 0.15.0 (released on 29 October 2007)
 Hadoop 0.14.3 (released on 19 October, 2007)
 Hadoop 0.14.1 (released on 4 September, 2007)


Data Types in Hadoop click here
Hadoop -1.0.4.tar.gz Directory Structure click here


Thursday, 21 April 2016

How to use unix tool AWK

This artical will help you understand how to work with the AWK utility in UNIX. It also gives
the meaning of some of the AWK Built-in Variables


These few AWK one liners give very basic and random examples which will help to understand basic about this UNIX tool.

Meaning of some of the Awk Built-in Variables used below:

NF               : Number of fields in current line/record

NR               : Ordial number of current line/record
FS                : Field Separator (Also -F can be used)
OFS             : Output Field Separator (default=blank)

FILENAME : Name of current input file

All of following Awk one liner examples is based on the input file 'test1.txt’

$ cat test1.txt                                                               
Continent:Val
AS:12000
AF:9800                                                               
AS:12300
NA:3400
OC:12000
AF:500
AS:1000

Know more bout UNIX FILE PERMISSIONS   Click here


Scenario
Print 'line number' NR and 'Number of fields' NF for each line
Command
awk -F ":" '{print NR,NF}' test1.txt
Output
1 2

2 2

3 2

4 2

5 2

6 2

7 2

8 2


Scenario
Print first field, colon delimited
Command
awk -F ":" '{print $1}' test1.txt

Continent

AS

AF

AS

NA

OC

AF

AS


Scenario
Print first field, colon delimited, but excluding the 'first line' (NR!=1)
Command
awk -F ":" 'NR!=1 {print $1}' test1.txt
Output
AS

AF

AS









NA

OC

AF

AS


Scenario
Print first field, colon delimited, but only for line number 1 (NR==1)
Command
awk -F ":" 'NR==1 {print $1}' test1.txt
Output
Continent


Scenario
Print first and second field, colon delimited, but excluding the 'first line' (NR!=1)
Command
awk -F ":" 'NR!=1 {print $1,$2}' test1.txt
Output
AS 12000

AF 9800

AS 12300

NA 3400



OC 12000

AF 500

AS 1000


Scenario
Setting output field separator as pipe
Command
awk -F ":" 'BEGIN{OFS="|"} NR!=1 {print $1,$2}' test1.txt
Output
AS|12000

AF|9800

AS|12300

NA|3400

OC|12000

AF|500

AS|1000


Scenario
Anything on BEGIN executes first
Command
awk 'BEGIN{FS=":"; OFS="|"; print "Con|SomeVal"} NR!=1 {print $1,$2}' test1.txt
Output
Con|SomeVal

AS|12000

AF|9800

AS|12300

NA|3400

OC|12000

AF|500

AS|1000


Scenario
Printing FILENAME, will be printed for all the lines
Command
awk -F ":" '{print FILENAME}' test1.txt
Output
test1.txt

test1.txt

test1.txt

test1.txt

test1.txt

test1.txt

test1.txt

test1.txt



Scenario
Printing FILENAME, but printing only last instance using END clause
Command
awk -F ":" ' END {print FILENAME}' test1.txt
Output
test1.txt


Scenario
Printing the last field of the file, same as printing $2 as there are only 2 fields
Command
awk -F ":" '{print $NF}' test1.txt
Output
Val

12000

9800

12300

3400

12000

500

1000


Scenario
Matching, printing lines begin with "AS"
Command
awk -F ":" '/^AS/' test1.txt
Output
AS:12000

AS:12300

AS:1000


Scenario
Matching, printing lines not begining with "AS"
Command
awk -F ":" '!/^AS/' test1.txt
Output
Continent:Val

AF:9800

NA:3400

OC:12000

AF:500


Scenario
Direct matching, first field as "AS"
Command
awk -F ":" '$1=="AS"' test1.txt
Output
AS:12000

AS:12300

AS:1000


Scenario
Direct matching, first field as "AS", Print 2nd Column
Command
awk -F ":" '$1=="AS" {print $2}' test1.txt
Output
12000

12300

1000


Scenario
$0 prints the full line, same as {print}
Command 1
awk -F ":" '$1=="AS" {print $0}' test1.txt
Output
AS:12000

AS:12300

AS:1000





Command 2
awk -F ":" '$1=="AS" {print}' test1.txt
Output
AS:12000

AS:12300

AS:1000


Scenario
'Or' and 'AND' together
Command
awk -F ":" '($1=="AS" || $1=="OC") && $NF > 11000 {print}' test1.txt
Output
AS:12000

AS:12300

OC:12000


Scenario
Partial Matching
Command
awk -F ":" '$1 ~ /A/ {print}' test1.txt
Output
AS:12000

AF:9800

AS:12300

NA:3400

AF:500

AS:1000


Scenario
Reading from STDOUT
Command
cat test1.txt | awk -F ":" '!/Continent/ {print $1}' | sort | uniq
Output
AF

AS

NA

OC


Scenario
Add value 1000 to the 2nd field, where first field is "AF" and then print the output file
Command
awk -F ":" '$1=="AF" {$2+=1000} {print}' test1.txt
Output
Continent:Val

AS:12000

AF 10800

AS:12300

NA:3400

OC:12000

AF 1500

AS:1000


Scenario
Sum of 2nd fields, exclude first line
Command
awk -F ":" 'NR!=1 {sum+=$NF} END {print sum}' test1.txt
Output
51000


Scenario
Set 2nd value as 0 where first field is "AS"
Command
awk -F ":" 'BEGIN {OFS=":"} $1=="AS" {$2=0} {print}' test1.txt
Output
Continent:Val

AS:0

AF:9800

AS:0

NA:3400