coomon featurs in hadoop and spark
statless/ memory less
no randomization
hadoop features
disk based
if machine fail- uses replica
spark features
uses RDD’s
stores data in memory
laxy execution
lineage traking
transformation opperations
filter
mapPartition and other maps
sample
reduceby key
distinctgeoupbykey
action operatings
reduce
aggregation
take sample
countbykey
wide dependency
reduce by key
diffrence between spark RDD and SQL filter function
Spark SQL filter function is more efficient
Spark SQL understands the logic while Spark RDD doesn’t have access to it
Key difference between Spark RDD and Spark SQL
Spark SQL can achieve higher performance since it understands the logic of the opertaitons
Spark SQL employs the catalyst query optimizer
Spark SQL is awar of the data schem while spark RDD is schema agnostic
operations
filter
projection
cross product
aggregation
union
intersection