[HDFS] 하둡 Balancer 과정

개요

하둡 클러스터의 데이터 불균형 현상이 일어나는 원인과 밸런싱 과정을 알아본다.

하둡 클러스터 데이터 쏠림 현상

새로운 데이터 노드를 추가할 때

밸런서를 따로 돌리지 않으면 기존 데이터들은 새로운 데이터노드에 저장되지 않는다.

클라이언트 프로그램

클라이언트가 HDFS 쓰기 작업을 요청한다. 근데 어차피 어떤 데이터 노드에 쓸 건지는 네임노드가 결정해주는 건데 클라이언트 프로그램 때문에 데이터 쏠림이 발생하는 건가?

HDFS 블록 할당

HDFS에 블록을 할당할 때 데이터노드들의 용량까지 고려되지 않는다. 그러다 보면 HDFS 언밸런싱이 생기게 된다.
나름의 할당 기준은 있다. 복제본 3개. Node local -> Rack local -> off switch 기준으로 블록이 할당된다.

하둡 클러스터 밸런싱 과정

1) 저장소 그룹 분류

소스그룹

Over-Utilized
Above-Average

대상그룹

Below-Average
Under-Utilized

Threshold 는 기본값 10% 이다. balancer 옵션 설정에서 수정 가능하다. Threshold 값을 작게 잡으면 데이터노드 간 용량 차이를 더 작게 맞춰야 하므로 balancer 작업이 더 오래 걸린다.

2) 저장소 그룹 페어링 (소스그룹 - 대상그룹)

동일 랙이 우선순위가 더 높다!

소스 그룹과 대상 그룹이 동일 랙인 경우
- Over-Utilized → Under-Utilize
- Over-Utilized → Below-Average
- Above-Average → Under-Utilized

소스 그룹과 대상 그룹이 다른 랙인 경우
- Over-Utilized → Under-Utilize
- Over-Utilized → Below-Average
- Above-Average → Under-Utilized

3) 블록 이동 스케쥴링

저장소 그룹 쌍 별로 소스 그룹에서 이동할 블록을 고른다. 아래와 같은 조건을 만족하는 블록을 고른다.

The storage type of the block replica in the source DataNode is the same as the target storage type.
- 블록 복제본의 스토리지 타입이 뭐지?

The storage type of the block replica is not already scheduled.
The target does not already have the same block replica.
The number of racks of the block is not reduced after the move.
- 복제본(기본 3개)이 배치된 랙의 총 개수는 유지 되어야 한다?

4) 블록 이동 실행

타겟 데이터 노드로 블록을 복제한다.
복제가 완료되면 타겟 데이터노드는 네임노드한테 알린다.
네임노드는 기존 소스 데이터노드에 있던 블록을 삭제한다.

참고링크

Rebalancing HDFS Data | HDFS Commands, HDFS Permissions and HDFS Storage | InformIT

Home > Articles By Sam R. Alapati Jan 25, 2017 📄 Contents ␡ Managing HDFS through the HDFS Shell Commands Using the dfsadmin Utility to Perform HDFS Operations Managing HDFS Permissions and Users Managing HDFS Storage Rebalancing HDFS Data Reclaiming

www.informit.com

Why HDFS data Becomes unbalanced

Factors such as addition of DataNodes, block allocation in HDFS, and behavior of the client application can lead to the data stored in HDFS clusters becoming unbalanced. Addition of DataNodes When new DataNodes are added to a cluster, newly created blocks

docs.cloudera.com

Cluster balancing algorithm

The HDFS Balancer runs in iterations. Each iteration contains the following four steps: storage group classification, storage group pairing, block move scheduling, and block move execution.

docs.cloudera.com

마치며

balancer 과정 중에 over replicated도 발생할 수 있겠구나.

이동할 블록을 선정하는 기준이 이랬구나.

중간 중간 모르는 부분은 표시해놨다. 아시는 분이 있다면 댓글 부탁드립니다 !

'BigData 기술 > Hadoop' 카테고리의 다른 글

[HDFS] 네임노드 구동과정 (Namenode Startup Process) (8)	2021.01.05
[HDFS] Block Pool 개념 정리 (8)	2021.01.04
[HDFS] Rack Awareness 란 (911)	2020.07.15
[HDFS] 네임노드 SafeMode 켜지는 경우 (4)	2020.07.14
[YARN] 필수개념 (4)	2019.10.30

차곡차곡

[HDFS] 하둡 Balancer 과정

개요

하둡 클러스터 데이터 쏠림 현상

하둡 클러스터 밸런싱 과정

참고링크

마치며

'BigData 기술 > Hadoop' 카테고리의 다른 글

댓글

티스토리툴바

[HDFS] 하둡 Balancer 과정

개요

하둡 클러스터 데이터 쏠림 현상

하둡 클러스터 밸런싱 과정

참고링크

마치며

'BigData 기술 > Hadoop' 카테고리의 다른 글

관련글

댓글

티스토리툴바