Getting started with Impala

Správa článků

Vyhledávání

My excerpt from the book by John Russell.

Introduction

Impala is MPP - massively parallel processing engine
intended for people familiar with SQL

Chapter 1: Why Impala?

Hadoop is data-centric, SQL simple to use => with Impala you can use Hadoop for data analytics without PhD in distributed computing:)
Impala brings flexibility for ETL - using Hadoop common data formats
from batch data processing to "interactive" in Impala (=human-scale type of responsiveness)
interactive queries
using BI tools for "clicking" the queries - IBM Cognos, SAP Business Objects, Tableau...
you don't need to reorganize the data - Impala works on original data
fast queries - possible iterative submitting new queries when inspecting the data - exploratory data analysis, data discovery, ad-hoc queries
from ad-hoc exploratory queries discover the frequent queries => store them into efficient parquet format

Chapter 2: Getting Up and Running with Impala

installation possibilities: Cloudera Live Demo (Hue web interface), Cloudera Quick Start VM (virtual image), Cloudera Manager, Manual installation (need to install on all nodes - with Puppet, Salt etc.), build from source from GitHub
Impala service = impalad, statestored, catalogd
impala-shell command: connects to the impala server (impalad) - default localhost and port 21000, specify another host with -i flag
toying with impala-shell: try CREATE TABLE, INSERT, SELECT...

Chapter 3: Impala for the Database Developer

SQL language

SQL-92 standard with some enhancements from later SQL standards
SELECT: joins, views, aggregate, order by, group by, casts, column aliases, built-in functions...
CREATE TABLE, INSERT
EXPLIAN
SQL data types: STRING, INT (TINYINT, BIGINT), DOUBLE, FLOAT, TIMESTAMP, BOOLEAN, DECIMAL
DECIMAL data type: precise setting of number precision (number of digits and number of digits after decimal point) - for example DECIMAL(5,2) for 123.45; suitable for financial and precise scientific calculations
missing OLTP commands UPDATE and DELETE - Hadoop is optimized for bulk inserts and appending
data ingestion: LOAD DATA, INSERT... SELECT, INSERT OVERWRITE
NO transactions, foreign keys, constraints, unique, indexes, NOT NULL option
NO DELETE - instead of it use DROP TABLE or DROP PARTITION
REFRESH table after changing data externally

Big data considerations

Impala performance shines on big data sizes - GB, TB and large. For small data you can use scripts or grep
do tests on data with similar size to your real-world data - typically too big for one machine
bring Impala to your system before the data become too huge
HDFS block size 128 MB is OK, Parquet had default block size 1GB (note: since Impala 2.0 are both 256MB)
Parquet block size: 1GB (default 256MB since Impla 2.0), but can be smaller (512, 128 or 64 GB)
parallelism: NUM_NODES * NUM_CORES * PARQUET_SIZE, ex. cluster with 100 nodes, each 16 cores, 1GB parquet block size => 1.6 TB data can be processed in parallel

How Impala is Like a Data Warehouse

no indexes and data normalisation
full scans - efficient on all columns
minimizes I/O operations and network traffic by local data operations - partitioning

Physical and logical data layout

HDFS: speed, reliability and low-cost
HDFS block: 64MB or 128MB, since Impala 2.0 256 MB

Distributed queries

"Distributed queries are the heart and soul of Impala. Once upon a time, you needed a doctorate in parallel computing to even be able to think about doing such esoteric, obscure operations. Now, with Impala running on Hadoop, you just need…a laptop!"
TODO: distributed query schema
writing on each node produces one separate file!
data processing on specific node is not guaranteed - it is not deterministic, so results are also not deterministic (ex. SELECT * FROM my_table LIMIT 10)
expressions are computed on coordinator if possible - ex. NOW()
time needed to transmit the result back to coordinator is proportional to the size of result set

Normalized and Denormalized Data

normalized: more tables, OLTP, denormalized: OLAP (no overhead with processing - determining IDs from other tables in real-time systems...)
Impala supports both - normalized with JOINs, denormalized with Parquet format (efficient dictionary compression)

File formats

data in Hadoop:
- creating: Sqoop, Flume
- transforming: Hive, Pig
- analysing: Impala, MR, Spark
text file format: simple to create, read, fix, grep, re-interpret data types... but it BULKY - good for experimenting, but not for big data
Parquet file: columnar, appropriate data types (int, datatime, bool as single bit...), compression: RLE (run-time length), dictionary (for < 16K distinct values)
Parquet - can be splitted to more files, each holds several columns
commands SHOW TABLE STATS my_table and DESCRIBE FORMATTED my_table
change file format: ALTER TABLE… SET FILEFORMAT, table can have different partitions in different formats

Chapter 4: Common Developer Tasks for Impala

Getting Data into an Impala Table

INSERT ... VALUES - not suitable for production, because each command creates the separate data file
INSERT ... SELECT - copy data from existing table - good for filtering, changing storage format etc.
INSERT OVERWRITE - additionally clears the data from table
LOAD DATA - moves the data file(s) from HDFS to Impala's internal HDFS directory - suitable when creating data internally and Impala is the final data consumer
external tables (CREATE EXTERNAL TABLE ...) - Impala works with data files in specified HDFS path - suitable when there are several consumers of this data file
use hadoop distcp -pb srcpath dstpath for manual copying parquet files - it preserves the block size
Hive: data files are compatible (when using compatible data types) - Impala can use Hive tables directly, just call INVALIDATE METADATA and REFRESH after changes made through Hive
Sqoop - exporting data from OLTP systems (MySQL etc.) - uses JDBC, inlude --null-string parameter
Sqoop - can export to text, Avro, SequenceFile and since 1.5.2 version to Parquet
Kite - converts text file directly to HDFS to Avro or Parquet file
for Sqoop and Avro: do not forget to call INVALIDATE METADATA and REFRESH commands after data changes to notify Impala about it

Porting SQL Code to Impala

different data types than on other databases like MySQL
different DDL - specifying partitions, location, sperators etc.
no UPDATE and DELETE commands
less build-in functions than MySQL
restricted implicit conversions (regarding to MySQL)

Using Impala from a JDBC or ODBC Application,

close the connection after the query to free up the memory!
JDBC - Hadoop JDBC driver for Java
ODBC - for C++ and Python

Using Impala with a Scripting Language

impala-shell interpreter, options:
- -q query
- -f file with SQL
- -o save output to the file on local filesystem
impyla package for Python

Optimizing Impala Performance

biggest performance killers are JOINs - table statistics helps to choose the right plan
call COMPUTE STATS command after every table size change bigger than 30%
use optimized Parquet format
table size does not impact the RAM usage - Impala uses the modest size read buffer
ORDER BY, UNION, DISTINCT increases the RAM usage (since version 1.4 the intermediate results can be spilled to the disk)
RAM usage should be automatically distributed on the cluster
aggregation functions MAX, AVG, SUM - size of intermedate data is proportional to the number of groups
UNION is more complicated than UNION ALL and it uses more RAM, because it demoves duplicates - so if you know there will be no duplicates, prefer UNION ALL
LIMIT clause substantially decreases the number of RAM used (it cuts off the records over the limit on each node)
use the smallest INT possible - TINYINT, SMALLINT, BIGINT - RAM usage will be lower
TIMESTAMP - you can separate date parts using EXTRACT function
parquet writing can use lot memory - 1GB * NUM_NODES * NUM_CORES
parallel writes - RAM usage 1GB for each partitioned block, INSERT query can fail because of limit of simulaniously opened files in HDFS
how to minimize memory when writing the data:
- with computed stats, Impala does it automatically
- use SHUFFLE when INSERT ... SELECTing the data (= INSERT ... SELECT SHUFFLE ...)
- write partitions in separate queries (use SELECT DISTINCT with CONCAT subquery)
partitioning - use columns which appeaas most frequently in WHERE clauses, date columns are suitable almost everytime
partitions should correspond to HDFS block size (64MB - 1GB) - too small are inefficient, because from HDFS point of view is almost equal to read few megabytes and the whole data block
loading new partitions: command LOAD DATA INPATH

Writing User-Defined Functions

UDFs are faster and simpler to use and write than writing the application on top of Impala
Impala supports UDFs written in C++ UDFs (faster) and Java (for compatibility with Hive) and Python (impyla)
Scalar UDFs - work on single row, fast
UDAFs - User-defined aggregate functions

Collaborating with Your Administrators

recommended to dedicate 80% of RAM to Impala
out of memory is often caused by invalid statistics, which causes bad Impala query planning
data in production - use Sentry; group tables into databases and set access for it based on user roles
Impala can use YARN for resources allocation - uses Lllama for retaining the results in memory until the memory is requested for another process by YARN
again: use COMPUTE STATS after about 30% and more data changed (from Impala or from external application) in tables
HDFS caching:
- reduces I/O
- uses special memory cache area
- suitable for frequently used tables
- can be set per table with CREATE TABLE or ALTER TABLE command
developing for Impala:
- testing on one node or virtual machine installation
- later on small cluster with namenode on the same host as datanode
- production cluster with namenode on separate host; need for monitoring Impala nodes for performance
do not forget to close all queries

Odkaz na článek

Vytvořil 25. března 2016 v 09:42:31 mira. Upravováno 938x, naposledy 29. března 2016 v 19:40:55, mira

Getting started with Impala

Správa článků

Vyberte atribut, podle kterého se bude vyhledávat:

Introduction

Chapter 1: Why Impala?

Chapter 2: Getting Up and Running with Impala

Chapter 3: Impala for the Database Developer

SQL language

Big data considerations

How Impala is Like a Data Warehouse

Physical and logical data layout

Distributed queries

Normalized and Denormalized Data

File formats

Chapter 4: Common Developer Tasks for Impala

Getting Data into an Impala Table

Porting SQL Code to Impala

Using Impala from a JDBC or ODBC Application,

Using Impala with a Scripting Language

Optimizing Impala Performance

Writing User-Defined Functions

Collaborating with Your Administrators

Diskuze ke článku