Přejít na menu

Getting started with Impala

Správa článků

Vyhledávání Vyhledávání
25.3.2016 09:37
Počet přečtení: 1417

Obrázek ke článku Getting started with ImpalaMy excerpt from the book by John Russell.


  • Impala is MPP - massively parallel processing engine
  • intended for people familiar with SQL

Chapter 1: Why Impala?

  • Hadoop is data-centric, SQL simple to use => with Impala you can use Hadoop for data analytics without PhD in distributed computing:)
  • Impala brings flexibility for ETL - using Hadoop common data formats
  • from batch data processing to "interactive" in Impala (=human-scale type of responsiveness)
  • interactive queries
  • using BI tools for "clicking" the queries - IBM Cognos, SAP Business Objects, Tableau...
  • you don't need to reorganize the data - Impala works on original data
  • fast queries - possible iterative submitting new queries when inspecting the data - exploratory data analysis, data discovery, ad-hoc queries
  • from ad-hoc exploratory queries discover the frequent queries => store them into efficient parquet format

Chapter 2: Getting Up and Running with Impala

  • installation possibilities: Cloudera Live Demo (Hue web interface), Cloudera Quick Start VM (virtual image), Cloudera Manager, Manual installation (need to install on all nodes - with Puppet, Salt etc.), build from source from GitHub
  • Impala service = impalad, statestored, catalogd
  • impala-shell command: connects to the impala server (impalad) - default localhost and port 21000, specify another host with -i flag
  • toying with impala-shell: try CREATE TABLE, INSERT, SELECT...

Chapter 3: Impala for the Database Developer

SQL language

  • SQL-92 standard with some enhancements from later SQL standards
  • SELECT: joins, views, aggregate, order by, group by, casts, column aliases, built-in functions...
  • DECIMAL data type: precise setting of number precision (number of digits and number of digits after decimal point) - for example DECIMAL(5,2) for 123.45; suitable for financial and precise scientific calculations
  • missing OLTP commands UPDATE and DELETE - Hadoop is optimized for bulk inserts and appending
  • NO transactions, foreign keys, constraints, unique, indexes, NOT NULL option
  • NO DELETE - instead of it use DROP TABLE or DROP PARTITION
  • REFRESH table after changing data externally

Big data considerations

  • Impala performance shines on big data sizes - GB, TB and large. For small data you can use scripts or grep
  • do tests on data with similar size to your real-world data - typically too big for one machine
  • bring Impala to your system before the data become too huge
  • HDFS block size 128 MB is OK, Parquet had default block size 1GB (note: since Impala 2.0 are both 256MB)
  • Parquet block size: 1GB (default 256MB since Impla 2.0), but can be smaller (512, 128 or 64 GB)
  • parallelism: NUM_NODES * NUM_CORES * PARQUET_SIZE, ex. cluster with 100 nodes, each 16 cores, 1GB parquet block size => 1.6 TB data can be processed in parallel

How Impala is Like a Data Warehouse

  • no indexes and data normalisation
  • full scans - efficient on all columns
  • minimizes I/O operations and network traffic by local data operations - partitioning

Physical and logical data layout

  • HDFS: speed, reliability and low-cost
  • HDFS block: 64MB or 128MB, since Impala 2.0 256 MB

Distributed queries

  • "Distributed queries are the heart and soul of Impala. Once upon a time, you needed a doctorate in parallel computing to even be able to think about doing such esoteric, obscure operations. Now, with Impala running on Hadoop, you just need…a laptop!"
  • TODO: distributed query schema
  • writing on each node produces one separate file!
  • data processing on specific node is not guaranteed - it is not deterministic, so results are also not deterministic (ex. SELECT * FROM my_table LIMIT 10)
  • expressions are computed on coordinator if possible - ex. NOW()
  • time needed to transmit the result back to coordinator is proportional to the size of result set

Normalized and Denormalized Data

  • normalized: more tables, OLTP, denormalized: OLAP (no overhead with processing - determining IDs from other tables in real-time systems...)
  • Impala supports both - normalized with JOINs, denormalized with Parquet format (efficient dictionary compression)

File formats

  • data in Hadoop:
    • creating: Sqoop, Flume
    • transforming: Hive, Pig
    • analysing: Impala, MR, Spark
  • text file format: simple to create, read, fix, grep, re-interpret data types... but it BULKY - good for experimenting, but not for big data
  • Parquet file: columnar, appropriate data types (int, datatime, bool as single bit...), compression: RLE (run-time length), dictionary (for < 16K distinct values)
  • Parquet - can be splitted to more files, each holds several columns
  • commands SHOW TABLE STATS my_table and DESCRIBE FORMATTED my_table
  • change file format: ALTER TABLE… SET FILEFORMAT, table can have different partitions in different formats

Chapter 4: Common Developer Tasks for Impala

Getting Data into an Impala Table

  • INSERT ... VALUES - not suitable for production, because each command creates the separate data file
  • INSERT ... SELECT - copy data from existing table - good for filtering, changing storage format etc.
  • INSERT OVERWRITE - additionally clears the data from table
  • LOAD DATA - moves the data file(s) from HDFS to Impala's internal HDFS directory - suitable when creating data internally and Impala is the final data consumer
  • external tables (CREATE EXTERNAL TABLE ...) - Impala works with data files in specified HDFS path - suitable when there are several consumers of this data file
  • use hadoop distcp -pb srcpath dstpath for manual copying parquet files - it preserves the block size
  • Hive: data files are compatible (when using compatible data types) - Impala can use Hive tables directly, just call INVALIDATE METADATA and REFRESH after changes made through Hive
  • Sqoop - exporting data from OLTP systems (MySQL etc.) - uses JDBC, inlude --null-string parameter
  • Sqoop - can export to text, Avro, SequenceFile and since 1.5.2 version to Parquet
  • Kite - converts text file directly to HDFS to Avro or Parquet file
  • for Sqoop and Avro: do not forget to call INVALIDATE METADATA and REFRESH commands after data changes to notify Impala about it

Porting SQL Code to Impala

  • different data types than on other databases like MySQL
  • different DDL - specifying partitions, location, sperators etc.
  • no UPDATE and DELETE commands
  • less build-in functions than MySQL
  • restricted implicit conversions (regarding to MySQL)

Using Impala from a JDBC or ODBC Application, 

  • close the connection after the query to free up the memory!
  • JDBC - Hadoop JDBC driver for Java
  • ODBC - for C++ and Python

Using Impala with a Scripting Language

  • impala-shell interpreter, options:
    • -q query
    • -f file with SQL
    • -o save output to the file on local filesystem
  • impyla package for Python

Optimizing Impala Performance

  • biggest performance killers are JOINs - table statistics helps to choose the right plan
  • call COMPUTE STATS command after every table size change bigger than 30%
  • use optimized Parquet format
  • table size does not impact the RAM usage - Impala uses the modest size read buffer
  • ORDER BY, UNION, DISTINCT increases the RAM usage (since version 1.4 the intermediate results can be spilled to the disk)
  • RAM usage should be automatically distributed on the cluster
  • aggregation functions MAX, AVG, SUM - size of intermedate data is proportional to the number of groups
  • UNION is more complicated than UNION ALL and it uses more RAM, because it demoves duplicates - so if you know there will be no duplicates, prefer UNION ALL
  • LIMIT clause substantially decreases the number of RAM used (it cuts off the records over the limit on each node)
  • use the smallest INT possible - TINYINT, SMALLINT, BIGINT - RAM usage will be lower
  • TIMESTAMP - you can separate date parts using EXTRACT function
  • parquet writing can use lot memory - 1GB  * NUM_NODES * NUM_CORES
  • parallel writes - RAM usage 1GB for each partitioned block, INSERT query can fail because of limit of simulaniously opened files in HDFS
  • how to minimize memory when writing the data:
    • with computed stats, Impala does it automatically
    • use SHUFFLE when INSERT ... SELECTing the data (= INSERT ... SELECT SHUFFLE ...)
    • write partitions in separate queries (use SELECT DISTINCT with CONCAT subquery)
  • partitioning - use columns which appeaas most frequently in WHERE clauses, date columns are suitable almost everytime
  • partitions should correspond to HDFS block size (64MB - 1GB) - too small are inefficient, because from HDFS point of view is almost equal to read few megabytes and the whole data block
  • loading new partitions: command LOAD DATA INPATH

Writing User-Defined Functions

  • UDFs are faster and simpler to use and write than writing the application on top of Impala
  • Impala supports UDFs written in C++ UDFs (faster) and Java (for compatibility with Hive) and Python (impyla)
  • Scalar UDFs - work on single row, fast
  • UDAFs - User-defined aggregate functions

Collaborating with Your Administrators

  • recommended to dedicate 80% of RAM to Impala
  • out of memory is often caused by invalid statistics, which causes bad Impala query planning
  • data in production - use Sentry; group tables into databases and set access for it based on user roles
  • Impala can use YARN for resources allocation - uses Lllama for retaining the results in memory until the memory is requested for another process by YARN
  • again: use COMPUTE STATS after about 30% and more data changed (from Impala or from external application) in tables
  • HDFS caching:
    • reduces I/O
    • uses special memory cache area
    • suitable for frequently used tables
    • can be set per table with CREATE TABLE or ALTER TABLE command
  • developing for Impala:
    • testing on one node or virtual machine installation
    • later on small cluster with namenode on the same host as datanode
    • production cluster with namenode on separate host; need for monitoring Impala nodes for performance
  • do not forget to close all queries
Vytvořil 25. března 2016 v 09:42:31 mira. Upravováno 938x, naposledy 29. března 2016 v 19:40:55, mira

Diskuze ke článku

Vložení nového komentáře