mapreduce - How to parallelize a query in Apache Hive for a (small) dataset -


i testing latest hive on parts of data set. it's couple gb of log files reading through custom serde.

when run simple group queries (4 mr jobs), getting logs such

  • map : 100%
  • reduce : 0%
  • map : 85%
  • reduce : 0%
  • map : 86%
  • reduce : 0%

all while using 1 core on 8 core server. kind of waste...

i have activated parallel option still won't parallelize. have set number of reduce jobs 8.

my expectations since data set partitionned (=> different files), @ least of map-reduce phases run on parallel on files.

is understanding wrong ? there specific way write queries ?

thanks

if doing nothing simple group by, real processing comparison, isn't hard. said, how many mappers running? tasktrackers not run parallelized. rather, hadoop banks on multiple tasktrackers running parallelize. if you're running 1 map task per node, won't see anything.

another possibility because doing group by, bound in io , not on processor, there's no need bring multiple cores it.


Comments

Popular posts from this blog

linux - Mailx and Gmail nss config dir -

c# - Is it possible to remove an existing registration from Autofac container builder? -

php - Mysql PK and FK char(36) vs int(10) -