Spark SQL: Query parquet file
This is the parquetfile:
parquet.avro.schemaçü{"type":"record","name":"Events","namespace":"com.sample.schema.avro.ev","fields":[{"name":"eventtype","type":"string"},{"name":"event1001","type":["null",{"type":"record","name":"fieldset1001","fields":[{"name":"id","type":["null","string"],"default":null},{"name":"eventtype","type":...
This is the .avsc:
{
"namespace" : "com.sample.schema.avro.ev",
"name" : "Events",
"type" : "record",
"fields" : [
{
"name" : "eventtype",
"type" : "string"
},
{ "name" : "event1001",
"type" : ["null",
{
"type" : "record",
"name" : "fieldset1001",
"fields" : [
{ "name" : "id", "type" : ["null", "string"], "default" : null },
{ "name" : "eventtype", "type" : ["null", "string"], "default" : null }
]
} ],
"default" : null
},
{ "name" : "event1002",
"type" : ["null",
{
"type" : "record",
"name" : "fieldset1002",
"fields" : [
{ "name" : "id", "type" : ["null", "string"], "default" : null },
{ "name" : "eventtype", "type" : ["null", "string"], "default" : null },
How can I query the parquetFile to get selected fields? I am only insterested in event1001 and event2009. And then merge values into 1 row if it has the same id.
For example,
in event1001: [id|type|date1|date2]
4929102|EV02|2015-01-20 10:44:39||
4929103|EV02|2015-01-20 10:44:39||
4929104|EV02|2015-01-20 10:44:39||
in event2009: [id|type|date1|date2]
4929101|EV02||2015-01-20 20:44:39
4929102|EV02||2015-01-20 20:44:39
4929105|EV02||2015-01-20 20:44:39
The result would be: (sorted by id) [eventid|id|type|date1|date2]
event2009|4929101|EV02||2015-01-20 20:44:39
event1001|4929102|EV02|2015-01-20 10:44:39|2015-01-20 20:44:39|
event1001|4929103|EV02|2015-01-20 10:44:39||
event1001|4929104|EV02|2015-01-20 10:44:39||
event2009|4929105|EV02||2015-01-20 20:44:39
This is my code:
val parquetFile = sqlContext.parquetFile("part-r-00000.snappy.parquet")
parquetFile.registerTempTable("parquetFile")
val events = sqlContext.sql("SELECT * from parquetFile") // -