HadoopExam Learning Resources

CCD-410 Certifcation CCA-500 Hadoop Administrator Exam HBase Certifcation CCB-400 Data Science Certifcation Hadoop Training with Hands On Lab Hadoop Package Deal

HBase Interview Questions - HBase Question Part-3

Q21. Can major compaction manually triggered ?
Ans : Major compactions can also be triggered (or a particular region) manually from the shell. This is a relatively expensive operation and isn’t done often. Minor compactions, on the other hand, are relatively lightweight and happen more frequently.

Q22. Can you explain data versioning ?
Ans : In addition to being a schema-less database, HBase is also versioned.
Every time you perform an operation on a cell, HBase implicitly stores a new version. Creating, modifying, and deleting a cell are all treated identically; they’re all new versions.
When a cell exceeds the maximum number of versions, the extra records are dropped during the next major compaction. Instead of deleting an entire cell, you can operate on a specific version or versions within that cell. Values within a cell are versioned. Versions are identified by their timestamp, a long. When a version isn’t specified, the current timestamp is used as the basis for the operation. The number of cell value versions retained by HBase is configured via the column family. The default number of cell versions is three.

Q23 : Which process or component is responsible for managing HBase RegionServer ?

Ans : HMaster is the implementation of the Master Server. The Master server is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes. In a distributed cluster, the Master typically runs on the NameNode.

Q24. Which component is responsible for managing and monitoring of Regions ?

Ans : HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions. In a distributed cluster, a RegionServer runs on a DataNode.

Q25. Why Would You Need HBase?
Ans : Use HBase when you need fault-tolerant, random, real time read/write access to data stored in HDFS. Use HBase when you need strong data consistency. HBase provides Bigtable-like capabilities on top of Hadoop. HBase’s goal is the hosting of very large tables — billions of rows times millions of columns — atop clusters of commodity hardware.
HBase manages structured data on top of HDFS for you, efficiently using the underlying replicated storage as backing store to gain the benefits of its fault tolerance and data availability and locality.

Q26. When Would You Not Want To Use HBase?
Ans: A. When your data access patterns are largely sequential over immutable data. Use plain MapReduce
B. When your data is not large
C. When the large overheads of the extract-transform-load (ETL) of your data into alternatives such as Hive is not an issue because you are purely operating on the data in a batching manner and can afford to wait, and some feature of the alternative is simply a must-have.
D. If you need to make a different trade off between consistency and availability. HBase is a strongly consistent system. HBase regions can be temporarily unavailable during fault recovery.
E. If you just can’t live without SQL.
F. When you really do require normalized schemas or a relational query engine.

Q27. What is the use of "HColumnDescriptor " ? 

Ans : An HColumnDescriptor contains information about a column family such as the number of versions, compression settings, etc. It is used as input when creating a table or adding a column. It is used as input when creating a table or adding a column. Once set, the parameters that specify a column cannot be changed without deleting the column and recreating it. If there is data stored in the column, it will be deleted when the column is deleted.

Q28.In HBase what is the problem with "Time Series Data" and can you explain the Hotspot ?
Ans : When dealing with stream processing of events, the most common use case is time series data. Such data could be coming from a sensor in a power grid, a stock exchange, or a monitoring system for computer systems. Its salient feature is that its row key represents the event time. This imposes a problem with the way HBase is arranging its rows: they are all stored sorted in a distinct range, namely regions with specific start and stop keys. The sequential, monotonously increasing nature of time series data causes all incoming data to be written to the same region. And since this region is hosted by a single server, all the updates will only tax this one machine. This can cause regions to really run hot with the number of accesses, and in the process slow down the perceived overall performance of the cluster, because inserting data is now bound to the performance of a single machine.

Q29 : What is salting and How it helps the "TimeSeries HotSpot" problem ?

Answer : It is easy to overcome this problem by ensuring that data is spread over all region servers instead. This can be done, for example, by prefixing the row key with a nonsequential prefix. Common choices include: Salting: You can use a salting prefix to the key that guarantees a spread of all rows across all region servers. For example:

byte prefix = (byte) (Long.hashCode(timestamp) % <number of region servers>);

byte[] rowkey = Bytes.add(Bytes.toBytes(prefix), Bytes.toBytes(timestamp);

This formula will generate enough prefix numbers to ensure that rows are sent to all region servers. Of course, the formula assumes a specific number of servers, and if you are planning to grow your cluster you should set this number to a multiple instead. The generated row keys might look like this:

0myrowkey-1, 1myrowkey-2, 2myrowkey-3, 0myrowkey-4, 1myrowkey-5, 2myrowkey-6, ...

When these keys are sorted and sent to the various regions the order would be:






In other words, the updates for row keys 0myrowkey-1 and 0myrowkey-4 would be sent to one region (assuming they do not overlap two regions, in which case there would be an even broader spread), and 1myrowkey-2 and 1myrowkey-5 are sent to another.

The drawback of this approach is that access to a range of rows must be fanned out in your own code and read with <number of region servers> get or scan calls. On the upside, you could use multiple threads to read this data from distinct servers, therefore parallelizing read access. This is akin to a small map-only MapReduce job, and should result in increased I/O performance. 

Q30 : What is "Field swap/promotion" ?

Answer: you can move the timestamp field of the row key or prefix it with another field. This approach uses the composite row key concept to move the sequential, monotonously increasing timestamp to a secondary position in the row key. If you already have a row key with more than one field, you can swap them. If you have only the timestamp as the current row key, you need to promote another field from the column keys, or even the value, into the row key. There is also a drawback to moving the time to the right-hand side in the composite key: you can only access data, especially time ranges, for a given swapped or promoted field.


You are here: Home Interview Questions HBase Interview Questions