www.HadoopExam.com

HadoopExam Learning Resources

CCD-410 Certifcation CCA-500 Hadoop Administrator Exam HBase Certifcation CCB-400 Data Science Certifcation Hadoop Training with Hands On Lab Hadoop Package Deal

Spark SQL: Query parquet file

This is the parquetfile:

parquet.avro.schemaçü{"type":"record","name":"Events","namespace":"com.sample.schema.avro.ev","fields":[{"name":"eventtype","type":"string"},{"name":"event1001","type":["null",{"type":"record","name":"fieldset1001","fields":[{"name":"id","type":["null","string"],"default":null},{"name":"eventtype","type":...

This is the .avsc:

{
    "namespace" : "com.sample.schema.avro.ev",
    "name"      : "Events",
    "type"      : "record",
    "fields"    : [
    {
        "name" : "eventtype",
        "type" : "string"
    },
    {   "name" : "event1001",
        "type" : ["null",
        {
            "type" : "record",
            "name" : "fieldset1001",
            "fields" : [
            { "name" : "id", "type" : ["null", "string"], "default" : null },
            { "name" : "eventtype", "type" : ["null", "string"], "default" : null }
            ]
         } ],
        "default" : null
    },
    {   "name" : "event1002",
        "type" : ["null",
        {
            "type" : "record",
            "name" : "fieldset1002",
            "fields" : [
            { "name" : "id", "type" : ["null", "string"], "default" : null },
            { "name" : "eventtype", "type" : ["null", "string"], "default" : null },

How can I query the parquetFile to get selected fields? I am only insterested in event1001 and event2009. And then merge values into 1 row if it has the same id.

For example,

in event1001: [id|type|date1|date2]

4929102|EV02|2015-01-20 10:44:39||
4929103|EV02|2015-01-20 10:44:39||
4929104|EV02|2015-01-20 10:44:39||

in event2009: [id|type|date1|date2]

4929101|EV02||2015-01-20 20:44:39
4929102|EV02||2015-01-20 20:44:39
4929105|EV02||2015-01-20 20:44:39

The result would be: (sorted by id) [eventid|id|type|date1|date2]

event2009|4929101|EV02||2015-01-20 20:44:39
event1001|4929102|EV02|2015-01-20 10:44:39|2015-01-20 20:44:39|
event1001|4929103|EV02|2015-01-20 10:44:39||
event1001|4929104|EV02|2015-01-20 10:44:39||
event2009|4929105|EV02||2015-01-20 20:44:39

This is my code:

val parquetFile = sqlContext.parquetFile("part-r-00000.snappy.parquet")
parquetFile.registerTempTable("parquetFile")
val events = sqlContext.sql("SELECT * from parquetFile") // -

You have no rights to post comments

You are here: Home Question & Answer Hadoop Questions Spark SQL: Query parquet file