pyspark给dataframe增加新的一列的实现示例
熟悉pandas的pythoner应该知道给dataframe增加一列很容易,直接以字典形式指定就好了,pyspark中就不同了,摸索了一下,可以使用如下方式增加
frompysparkimportSparkContext frompysparkimportSparkConf frompypsark.sqlimportSparkSession frompyspark.sqlimportfunctions spark=SparkSession.builder.config(conf=SparkConf()).getOrCreate() data=[['Alice',19,'blue','["Alice",19,"blue"]'], ['Jane',20,'green','["Jane",20,"green"]'], ['Mary',21,'blue','["Mary",21,"blue"]'],] frame=spark.createDataFrame(data,schema=["name","age","eye_color","detail"]) frame.cache() frame.show()
+-----+---+---------+--------------------+
|name|age|eye_color| detail|
+-----+---+---------+--------------------+
|Alice|19| blue|["Alice",19,"bl...|
|Jane|20| green|["Jane",20,"gre...|
|Mary|21| blue|["Mary",21,"blue"]|
+-----+---+---------+--------------------+
1、增加常数项
frame2=frame.withColumn("contant",functions.lit(10)) frame2.show()
+-----+---+---------+--------------------+-------+
|name|age|eye_color| detail|contant|
+-----+---+---------+--------------------+-------+
|Alice|19| blue|["Alice",19,"bl...| 10|
|Jane|20| green|["Jane",20,"gre...| 10|
|Mary|21| blue|["Mary",21,"blue"]| 10|
+-----+---+---------+--------------------+-------+
2、简单根据某列进行计算
2.1使用withColumn
frame3_1=frame.withColumn("name_length",functions.length(frame.name)) frame3_1.show()
+-----+---+---------+--------------------+-----------+
|name|age|eye_color| detail|name_length|
+-----+---+---------+--------------------+-----------+
|Alice|19| blue|["Alice",19,"bl...| 5|
|Jane|20| green|["Jane",20,"gre...| 4|
|Mary|21| blue|["Mary",21,"blue"]| 4|
+-----+---+---------+--------------------+-----------+
2.2使用select
frame3_2=frame.select(["name",functions.length(frame.name).alias("name_length")]) frame3_2.show()
+-----+-----------+
|name|name_length|
+-----+-----------+
|Alice| 5|
|Jane| 4|
|Mary| 4|
+-----+-----------+
2.3使用selectExpr
frame3_3=frame.selectExpr(["name","length(name)asname_length"]) frame3_3.show()
+-----+-----------+
|name|name_length|
+-----+-----------+
|Alice| 5|
|Jane| 4|
|Mary| 4|
+-----+-----------+
3、定制化根据某列进行计算
比如我想对某列做指定操作,但是对应的函数没得咋办,造,自己造~
frame4=frame.withColumn("detail_length",functions.UserDefinedFunction(lambdaobj:len(json.loads(obj)))(frame.detail)) #or deflength_detail(obj): returnlen(json.loads(obj)) frame4=frame.withColumn("detail_length",functions.UserDefinedFunction(length_detail)(frame.detail)) frame4.show()
+-----+---+---------+--------------------+-------------+
|name|age|eye_color| detail|detail_length|
+-----+---+---------+--------------------+-------------+
|Alice|19| blue|["Alice",19,"bl...| 3|
|Jane|20| green|["Jane",20,"gre...| 3|
|Mary|21| blue|["Mary",21,"blue"]| 3|
+-----+---+---------+--------------------+-------------+
到此这篇关于pyspark给dataframe增加新的一列的实现示例的文章就介绍到这了,更多相关pysparkdataframe增加列内容请搜索毛票票以前的文章或继续浏览下面的相关文章希望大家以后多多支持毛票票!