HadoopExam Learning Resources

CCD-410 Certifcation CCA-500 Hadoop Administrator Exam HBase Certifcation CCB-400 Data Science Certifcation Hadoop Training with Hands On Lab Hadoop Package Deal

HBase Interview Questions - HBase Questions Part 4


Q31 : How "Randomization" helps in Time series data ?

Answer : A totally different approach is to randomize the row key using, for example:

byte[] rowkey = MD5(timestamp)

Using a hash function like MD5 will give you a random distribution of the key across all available region servers. For time series data, this approach is obviously less than ideal, since there is no way to scan entire ranges of consecutive timestamps. On the other hand, since you can re-create the row key by hashing the timestamp requested, it still is very suitable for random lookups of single rows. When your data is not scanned in ranges but accessed randomly, you can use this strategy.Summarizing the various approaches, you can see that it is not trivial to find the right balance between optimizing for read and write performance. It depends on your access pattern, which ultimately drives the decision on how to structure your row keys.

Q32 : List the main component of HBase?

Answer : Zookeeper, Catalog Tables , Master , RegionServer , Region

Q33. Please tell us Operational command in Hbase, we you have used ?

Answer : There are five main command in HBase.

1. Get
2. Put
3. Delete
4. Scan
5. Increment

Q34. Write down the Java Code snippet to open a connection in Hbase?

Answer : If you are going to open connection with the help of Java API.
The following code provide the connection

Configuration myConf = HBaseConfiguration.create();
HTableInterface usersTable = new HTable(myConf, "users");

Q35. Please let us know the Difference Between HBase and Hadoop/HDFS?

Answer:  HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.  HDFS is a distributed file system that is well suited for the storage of large files. Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed StoreFiles that exist on HDFS for high-speed lookups. Assumptions and Goals of HDFS

  1. Hardware Failure
  2. Streaming Data Access
  3. Large Data Sets
  4. Simple Coherency Model
  5. Moving Computation is Cheaper than Moving Data
  6. Portability Across Heterogeneous Hardware and Software Platforms

HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.

Q36. What is the maximum recommended cell size?

Answer : A rough rule of thumb, with little empirical validation, is to keep the data in HDFS and store pointers to the data in HBase if you expect the cell size to be consistently above 10 MB. If you do expect large cell values and you still plan to use HBase for the storage of cell contents, you'll want to increase the block size and the maximum region size for the table to keep the index size reasonable and the split frequency acceptable.

Q37. What happens if we change the block size of a column family on an already populated database?
 When we change the block size of the column family, the new data takes the new block size while the old data is within the old block size. When the compaction occurs, old data will take the new block size. “New files, as they are flushed, will have the new block size, whereas existing data will continue to be read correctly. After the next major compaction, all data should be converted to the new block size.”

Q38. What is the difference between HBASE and RDBMS? 

It is distributed, column oriented, versioned data storage system. It is designed to follow FIXED schema. It is row-oriented databases and doesn’t natively scale to distributed storage.
HDFS is underlying layer of HBase and provides fault tolerance and linear scalability. It doesn’t support secondary indexes and support data in key-value pair. It supports secondary indexes and improvises data retrieval through SQL language.
It supports dynamic addition of column in table schema. It is not relational database like RDBMS. It has slow learning curve and support complex joins and aggregate functions.
HBASE helps Hadoop overcome the challenges in random read and write.  

Q39. Explain what is WAL and Hlog in Hbase?
Answer : WAL (Write Ahead Log) is similar to MySQL BIN log; it records all the changes occur in data. It is a standard sequence file by Hadoop and it stores HLogkey’s.  These keys consist of a sequential number as well as actual data and are used to replay not yet persisted data after a server crash. So, in cash of server failure WAL work as a life-line and retrieves the lost data’s. 

Q40. In Hbase what is column families?
Answer : Column families comprise the basic unit of physical storage in Hbase to which features like compressions are applied.

You are here: Home Interview Questions HBase Interview Questions