Data Ingestion JSON File in Azure Data Bricks




We have a JSON File and we need to create the Data frame with that JSON File , we can use the below python code to create a data frame from a JSON File 


Python Code


Dataframe_df=spark.read \
.option("header" , True) \
.option("inferschema" , True) \
.json("/FileStore/dataset/constructors.json")


use the below command to see the schema of a Data Frame 


Python Code


Dataframe_df.printSchema()

Once the Data Frame is Created we can use the below command to check the content of the Data Frame which was loaded with the JSON File 


Python Code


display(Dataframe_df)

we can use the below command If we need to remove a column of a data frame  


Python Code


Dataframe_df_Dropped=Dataframe_df.drop('url')


after deleting the Column we can Display the content of the Data Frame , Now the column "URL" is removed from the data Frame 


Python Code


display(Dataframe_df_Dropped)


We can use the below command if we want to remove multiple columns from a Data Frame 


Python Code


Dataframe_df_Drop_multiple_columns=Dataframe_df.drop('url','name')


You can again see the structure and the content of a Data Frame after removing muliple columns  


Python Code


display(Dataframe_df_Drop_multiple_columns)


We can use the below command if we want to Rename a column in a Data Frame  


Python Code


Dataframe_df_Final=Dataframe_df_Drop_multiple_columns.withColumnRenamed("constructorid","Constructor_id")


Use the below command if you need to rename multiple coumn 


Python Code


Dataframe_df_Final=Dataframe_df_Drop_multiple_columns.withColumnRenamed("constructorid","Constructor_id") \
                                                     .withColumnRenamed("constructorRef","Constructor_Ref") \
                                                     .withColumnRenamed("nationality","Nationality")


Below is the code to Rename multiple columns and if we wish to add a new column with Current timestamp 


Python Code


from pyspark.sql.functions import current_timestamp
Dataframe_df_Final=Dataframe_df_Drop_multiple_columns.withColumnRenamed("constructorid","Constructor_id") \
                                                     .withColumnRenamed("constructorRef","Constructor_Ref") \
                                                     .withColumnRenamed("nationality","Nationality") \
                                                     .withColumn("Ingestion_date",current_timestamp())


No comments: