mapreduce - How to parallelize a query in Apache Hive for a (small) dataset -

- May 15, 2012

i testing latest hive on parts of data set. it's couple gb of log files reading through custom serde.

when run simple group queries (4 mr jobs), getting logs such

map : 100%
reduce : 0%
map : 85%
reduce : 0%
map : 86%
reduce : 0%

all while using 1 core on 8 core server. kind of waste...

i have activated parallel option still won't parallelize. have set number of reduce jobs 8.

my expectations since data set partitionned (=> different files), @ least of map-reduce phases run on parallel on files.

is understanding wrong ? there specific way write queries ?

thanks

if doing nothing simple group by, real processing comparison, isn't hard. said, how many mappers running? tasktrackers not run parallelized. rather, hadoop banks on multiple tasktrackers running parallelize. if you're running 1 map task per node, won't see anything.

another possibility because doing group by, bound in io , not on processor, there's no need bring multiple cores it.

Search This Blog

Return

mapreduce - How to parallelize a query in Apache Hive for a (small) dataset -

Comments

Post a Comment

Popular posts from this blog

Show multiple (2,3,4,…) images in the same window in OpenCV -

c# - Is it possible to remove an existing registration from Autofac container builder? -

asp.net - RadAsyncUpload in code behind, how to? -