0%

Hadoop快速入门指南

文章多处参考自Hadoop: Setting up a Single Node Cluster

注意:测试环境中应有3台虚拟机,各自网卡UUID,IP,主机名均不同,一下信息仅为其中之一的配置

虚拟机信息

  • CentOS Linux release 7.6.1810 (Core)

    Derived from Red Hat Enterprise Linux 7.6 (Source)

    Linux hadoop10 3.10.0-957.27.2.el7.x86_64

  • 账户信息

用户名 密码
sun 123456
hadoop 123456
root 123456
  • 测试hadoop时务必使用hadoop用户操作

网络配置详情

虚拟机外围配置

环境:VMware® Workstation 15 Pro 15.0.0 build-10134415

  • 虚拟网络VMnet配置
名称 类型 外部链接 主机连接 DHCP 子网地址
VMnet8 NAT模式 NAT模式 已连接 - 192.169.100.0
  • 关闭DHCP服务

  • 子网IP:192.168.100.0 子网掩码:255.255.255.0

  • 网关IP:192.168.100.2

  • 关闭IPV6

虚拟机内配置

  • ip addr

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
    valid_lft forever preferred_lft forever
    2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:0c:29:f1:16:d2 brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.10/24 brd 192.168.100.255 scope global noprefixroute ens33
    valid_lft forever preferred_lft forever
    inet6 fe80::711a:c906:a3e3:aed6/64 scope link noprefixroute
    valid_lft forever preferred_lft forever

  • 主要网络信息/etc/sysconfig/network-scripts/ifcfg-ens33

    1
    2
    3
    4
    5
    6
    7
    BOOTPROTO=static
    ONBOOT=yes
    IPADDR=192.168.100.10
    NETMASK=255.255.255.0
    GATEWAY=192.168.100.2
    DNS1=114.114.114.114
    DNS2=8.8.8.8
  • 主机名/etc/sysconfig/network

    1
    2
    NETWORKING=yes
    HOSTNAME=hadoop10
  • HOSTS/etc/hosts

    1
    2
    3
    4
    5
    127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
    ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
    192.168.100.10 hadoop10
    192.168.100.11 hadoop11
    192.168.100.12 hadoop12

    环境变量

  • hadoop和jdk存放目录(注意 /opt 下的所有文件所有者一定是用户hadoop)

    tree -L 1 /opt/modules/

    1
    2
    3
    /opt/modules/
    ├── hadoop-3.1.2
    └── jdk1.8.0_201
  • 环境变量

    cat /etc/profile | tail -n 9

    1
    2
    3
    4
    5
    6
    7
    8
    #java
    export JAVA_HOME=/opt/modules/jdk1.8.0_201
    export PATH=$JAVA_HOME/bin:$PATH
    export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib

    #hadoop
    export HADOOP_HOME=/opt/modules/hadoop-3.1.2
    export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

    示例运行

一、Standalone Operation 单机运算

By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.

The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.

Ⅰ.Grep

  1. 创建输入文件(pwd /opt/modules/hadoop-3.1.2/)

    1
    2
    [hadoop@hadoop10 hadoop-3.1.2]$ mkdir input
    [hadoop@hadoop10 hadoop-3.1.2]$ cp etc/hadoop/*.xml input
  2. 目录结构

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    [hadoop@hadoop10 hadoop-3.1.2]$ tree input/
    input/
    ├── capacity-scheduler.xml
    ├── core-site.xml
    ├── hadoop-policy.xml
    ├── hdfs-site.xml
    ├── httpfs-site.xml
    ├── kms-acls.xml
    ├── kms-site.xml
    ├── mapred-site.xml
    └── yarn-site.xml
  3. 启动示例(这里注意,output目录如果已经存在则会报错)

    1
    [hadoop@hadoop10 hadoop-3.1.2]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar grep input output 'dfs[a-z.]+'
  4. 结果(筛选符合正则的文件)

    1
    2
    [hadoop@hadoop10 hadoop-3.1.2]$ output/part-r-00000 
    1 dfsadmin

Ⅱ.WordCount

  1. 创建输入文件(pwd /opt/modules/hadoop-3.1.2/)

    1
    2
    3
    [hadoop@hadoop10 hadoop-3.1.2]$ mkdir wcinput
    [hadoop@hadoop10 hadoop-3.1.2]$ cd wcinput
    [hadoop@hadoop10 hadoop-3.1.2]$ touch wc.input
    1
    2
    3
    4
    5
    #以下是向wc.input中添加的内容
    hadoop mark
    hadoop yarn
    sun key
    sun key
  2. 启动示例

    1
    hadoop@hadoop10 hadoop-3.1.2]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar wordcount wcinput wcoutput
  3. 结果

    1
    2
    3
    4
    5
    6
    [hadoop@hadoop10 hadoop-3.1.2]$ cat wcoutput/part-r-00000 
    hadoop 2
    key 2
    mark 1
    sun 2
    yarn 1

    二、Pseudo-Distributed Operation 伪分布运算

Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.

  1. 配置(pwd /opt/modules/hadoop-3.1.2/)

    • vim etc/hadoop/core-site.xml
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
       <configuration>
    <!--
    指定HDFS 中NameNode 的地址
    注意,因为示例一依赖的是linux文件系统,所以配置此项过后示例一将无法运行
    -->
    <property>
    <name>fs.defaultFS</name>
    <!--规范来说应注意将localhost改为主机名-->
    <value>hdfs://hadoop10:9000</value>
    </property>
    <!--
    指定Hadoop 运行时产生文件的存储目录
    -->
    <property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/modules/hadoop-3.1.2/data/tmp</value>
    </property>
    </configuration>
    • etc/hadoop/hdfs-site.xml
    1
    2
    3
    4
    5
    6
    7
    8
    9
    <configuration>
    <!--
    配置文件副本数,默认为3
    -->
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>
    </configuration>
    • 配置Hadoop的JAVA_HOME
    1
    2
    3
    $ vim etc/hadoop/hadoop-env.sh 
    #添加/更改为
    export JAVA_HOME=/opt/modules/jdk1.8.0_201
  2. 配置ssh免密

    Now check that you can ssh to the localhost without a passphrase:

    1
    $ ssh localhost

    If you cannot ssh to localhost without a passphrase, execute the following commands:

    1
    2
    3
    $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
    $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    $ chmod 0600 ~/.ssh/authorized_keys
  3. 格式化文件系统(Format the filesystem)

    注意:只有第一次配置才需要格式化文件系统

    因为配置过Hadoops /bin和/sbin下的环境变量,所以这里可以省略bin/sbin/路径

    1
    $ bin/hdfs namenode -format
  4. 启动NameNode和DataNode守护进程 (Start NameNode daemon and DataNode daemon)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    [hadoop@hadoop10 hadoop-3.1.2]$ start-dfs.sh 
    Starting namenodes on [hadoop10]
    Starting datanodes
    Starting secondary namenodes [hadoop10]
    #检查启动是否成功
    $ jps
    3080 SecondaryNameNode
    2746 NameNode
    3213 Jps
    2863 DataNode
  5. 浏览器查看web页面:Browse the web interface for the NameNode; by default it is available at:

    NameNode - http://localhost:9870/

  6. hdfs文件操作

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    #创建目录
    $ hdfs dfs -mkdir -p /home/hadoop/input、
    #查看目录
    $ hdfs dfs -ls -R /
    drwxr-xr-x - hadoop supergroup 0 2019-09-22 01:30 /home
    drwxr-xr-x - hadoop supergroup 0 2019-09-22 01:30 /home/hadoop
    drwxr-xr-x - hadoop supergroup 0 2019-09-22 01:30 /home/hadoop/input
    #将linux文件传至hdfs文件系统
    $ hdfs dfs -put wcinput/wc.input /home/hadoop/input
    #查看目录
    $ hdfs dfs -ls -R /
    drwxr-xr-x - hadoop supergroup 0 2019-09-22 01:30 /home
    drwxr-xr-x - hadoop supergroup 0 2019-09-22 01:30 /home/hadoop
    drwxr-xr-x - hadoop supergroup 0 2019-09-22 01:34 /home/hadoop/input
    -rw-r--r-- 1 hadoop supergroup 42 2019-09-22 01:34 /home/hadoop/input/wc.input
  7. WordCount示例

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    #运行示例(记得output先删除掉)
    $ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar wordcount /home/hadoop/input/wc.input /home/hadoop/output
    #查看生成结果
    $ hdfs dfs -ls -R /
    drwxr-xr-x - hadoop supergroup 0 2019-09-22 01:30 /home
    drwxr-xr-x - hadoop supergroup 0 2019-09-22 01:43 /home/hadoop
    drwxr-xr-x - hadoop supergroup 0 2019-09-22 01:34 /home/hadoop/input
    -rw-r--r-- 1 hadoop supergroup 42 2019-09-22 01:34 /home/hadoop/input/wc.input
    drwxr-xr-x - hadoop supergroup 0 2019-09-22 01:43 /home/hadoop/output
    -rw-r--r-- 1 hadoop supergroup 0 2019-09-22 01:43 /home/hadoop/output/_SUCCESS
    -rw-r--r-- 1 hadoop supergroup 35 2019-09-22 01:43 /home/hadoop/output/part-r-00000
    #查看输出结果
    $ hdfs dfs -cat /home/hadoop/output/part*
    hadoop 2
    key 2
    mark 1
    sun 2
    yarn 1

    三、YARN on a Single Node 单节点YARN

  8. 配置(pwd /opt/modules/hadoop-3.1.2/)

    • vim etc/hadoop/mapred-site.xml
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    <configuration>
    <!--指定mapreduce使用yarn-->
    <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    </property>
    <property>
    <name>yarn.app.mapreduce.am.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
    <name>mapreduce.map.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
    <name>mapreduce.reduce.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    </configuration>
    • vim etc/hadoop/yarn-site.xml
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    <configuration>
    <!-- Reducer 获取数据的方式-->
    <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    </property>
    <!-- 指定YARN 的ResourceManager 的地址-->
    <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>hadoop10</value>
    </property>
    </configuration>
  9. 启动ResourceManager和NodeManager守护进程 (Start ResourceManager daemon and NodeManager daemon)

  • start-yarn.sh
  1. 查看进程

    1
    2
    3
    4
    5
    6
    7
    [hadoop@hadoop10 hadoop-3.1.2]$ jps
    10800 Jps
    9988 ResourceManager
    5432 SecondaryNameNode
    5230 DataNode
    5119 NameNode
    10095 NodeManager
  2. 浏览器查看web页面:Browse the web interface for the ResourceManager; by default it is available at:

    • ResourceManager - http://localhost:8088/
  3. WordCount示例

    1
    [hadoop@hadoop10 hadoop-3.1.2]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar wordcount /home/hadoop/input/wc.input /home/hadoop/output
  4. 查看结果

    1
    2
    3
    4
    5
    6
    [hadoop@hadoop10 hadoop-3.1.2]$ hdfs dfs -cat /home/hadoop/output/part-r-00000
    hadoop 2
    key 2
    mark 1
    sun 2
    yarn 1

    四、历史服务器

  5. 配置mapred-site.xml

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    <configuration>
    <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    </property>
    <!--历史服务器端地址-->
    <property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop10:10020</value>
    </property>
    <!-- 历史服务器web 端地址-->
    <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop10:19888</value>
    </property>

    <property>
    <name>yarn.app.mapreduce.am.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
    <name>mapreduce.map.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
    <name>mapreduce.reduce.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    </configuration>
  6. 运行Jobhistoryserver

    1
    [hadoop@hadoop10 hadoop-3.1.2]$ mapred --daemon start historyserver
  7. 查看进程 | 访问http://hadoop10:19888/ 查看 jobhistory

    1
    2
    3
    4
    5
    6
    7
    8
    [hadoop@hadoop10 hadoop-3.1.2]$ jps
    1408 NameNode
    2096 NodeManager
    3216 JobHistoryServer
    1970 ResourceManager
    1525 DataNode
    1736 SecondaryNameNode
    3309 Jps

    五、日志聚集

简单记一下吧,其实都一样

  1. 配置yarn_site.xml

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    <configuration>
    <!-- 日志聚集功能使能-->
    <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
    </property>
    <!-- 日志保留时间设置7 天-->
    <property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>604800</value>
    </property>
    <!-- Reducer 获取数据的方式-->
    <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    </property>
    <!-- 指定YARN 的ResourceManager 的地址-->
    <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>hadoop10</value>
    </property>
    </configuration>
  2. 关闭yarn(这里可以不重启hdfs和jobhistoryserver)

  3. 这里我删除了hdfs 下yarn生成的tmp文件,和原来wordcount输出的output文件

  4. 重启yarn

  5. 启动WordCount示例

  6. 访问http://hadoop10:19888/jobhistory

  7. 找到刚刚完成的WordCount Job,应该也就一个

  8. 进入log

  9. 查看log,是这个亚子~