Welcome 微信登录
编程资源 图片资源库 蚂蚁家优选 PDF转换器

首页 / 操作系统 / Linux / SparkSQL使用之Spark SQL CLI

Spark SQL CLI描述Spark SQL CLI的引入使得在SparkSQL中通过hive metastore就可以直接对hive进行查询更加方便;当前版本中还不能使用Spark SQL CLI与ThriftServer进行交互。使用Spark SQL CLI前需要注意:1、将hive-site.xml配置文件拷贝到$SPARK_HOME/conf目录下;2、需要在$SPARK_HOME/conf/spark-env.sh中的SPARK_CLASSPATH添加jdbc驱动的jar包export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/Hadoop/software/mysql-connector-java-5.1.27-bin.jarSpark SQL CLI命令参数介绍:cd $SPARK_HOME/bin
spark-sql --helpUsage: ./bin/spark-sql [options] [cli option]
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Options:
  --master MASTER_URL       spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application"s main class (for Java / Scala apps).
  --name NAME               A name of your application.
  --jars JARS               Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --py-files PY_FILES       Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES             Comma-separated list of files to be placed in the working
                              directory of each executor.  --conf PROP=VALUE         Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.  --driver-memory MEM       Memory for driver (e.g. 1000M, 2G) (Default: 512M).
  --driver-java-options     Extra Java options to pass to the driver.
  --driver-library-path     Extra library path entries to pass to the driver.
  --driver-class-path       Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.  --executor-memory MEM     Memory per executor (e.g. 1000M, 2G) (Default: 1G).  --help, -h                  Show this help message and exit
  --verbose, -v             Print additional debug output Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).
  --supervise               If given, restarts the driver on failure. Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors. YARN-only:
  --executor-cores NUM        Number of cores per executor (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM       Number of executors to launch (Default: 2).
  --archives ARCHIVES       Comma separated list of archives to be extracted into the
                              working directory of each executor.CLI options:
-d,--define <key=value>          Variable subsitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
    --database <databasename>   Specify the database to use
 -e <quoted-query-string>       SQL from command line
 -f <filename>                    SQL from files
 -h <hostname>                    connecting to Hive Server on remote host
    --hiveconf <property=value> Use value for given property
    --hivevar <key=value>       Variable subsitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -p <port>                        connecting to Hive Server on port number
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                   Verbose mode (echo executed SQL to the console)在启动spark-sql时,如果不指定master,则以local的方式运行,master既可以指定standalone的地址,也可以指定yarn;当设定master为yarn时(spark-sql --master yarn)时,可以通过http://hadoop000:8088页面监控到整个job的执行过程;注:如果在$SPARK_HOME/conf/spark-defaults.conf中配置了spark.master spark://hadoop000:7077,那么在启动spark-sql时不指定master也是运行在standalone集群之上。spark-sql使用启动spark-sql: 由于我已经在spark-defaults.conf中配置了spark.master spark://hadoop000:7077,就没在spark-sql启动时指定master了cd $SPARK_HOME/bin
spark-sqlSELECT track_time, url, session_id, referer, ip, end_user_id, city_id FROM page_views WHERE city_id = -1000 limit 10;SELECT session_id, count(*) c FROM page_views group by session_id order by c desc limit 10;上面两个sql语句用到的表现在存在hive中了,如果没有则手工创建下,创建脚本以及导入数据脚本如下:create table page_views(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string

ROW FORMAT DELIMITED FIELDS TERMINATED BY " ";load data local inpath "/home/spark/software/data/page_views.dat" overwrite into table page_views;  本文永久更新链接地址:http://www.linuxidc.com/Linux/2014-09/106620.htm