pyspark.sql.DataFrame¶
-
class
pyspark.sql.
DataFrame
(jdf, sql_ctx)[source]¶ A distributed collection of data grouped into named columns.
A
DataFrame
is equivalent to a relational table in Spark SQL, and can be created using various functions inSparkSession
:people = spark.read.parquet("...")
Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in:
DataFrame
,Column
.To select a column from the
DataFrame
, use the apply method:ageCol = people.age
A more concrete example:
# To create DataFrame using SparkSession people = spark.read.parquet("...") department = spark.read.parquet("...") people.filter(people.age > 30).join(department, people.deptId == department.id) \ .groupBy(department.name, "gender").agg({"salary": "avg", "age": "max"})
New in version 1.3.0.
Methods
agg
(*exprs)Aggregate on the entire
DataFrame
without groups (shorthand fordf.groupBy().agg()
).alias
(alias)Returns a new
DataFrame
with an alias set.approxQuantile
(col, probabilities, relativeError)Calculates the approximate quantiles of numerical columns of a
DataFrame
.cache
()Persists the
DataFrame
with the default storage level (MEMORY_AND_DISK).checkpoint
([eager])Returns a checkpointed version of this
DataFrame
.coalesce
(numPartitions)Returns a new
DataFrame
that has exactly numPartitions partitions.colRegex
(colName)Selects column based on the column name specified as a regex and returns it as
Column
.collect
()Returns all the records as a list of
Row
.corr
(col1, col2[, method])Calculates the correlation of two columns of a
DataFrame
as a double value.count
()Returns the number of rows in this
DataFrame
.cov
(col1, col2)Calculate the sample covariance for the given columns, specified by their names, as a double value.
createGlobalTempView
(name)Creates a global temporary view with this
DataFrame
.Creates or replaces a global temporary view using the given name.
createOrReplaceTempView
(name)Creates or replaces a local temporary view with this
DataFrame
.createTempView
(name)Creates a local temporary view with this
DataFrame
.crossJoin
(other)Returns the cartesian product with another
DataFrame
.crosstab
(col1, col2)Computes a pair-wise frequency table of the given columns.
cube
(*cols)Create a multi-dimensional cube for the current
DataFrame
using the specified columns, so we can run aggregations on them.describe
(*cols)Computes basic statistics for numeric and string columns.
distinct
()Returns a new
DataFrame
containing the distinct rows in thisDataFrame
.drop
(*cols)Returns a new
DataFrame
that drops the specified column.dropDuplicates
([subset])Return a new
DataFrame
with duplicate rows removed, optionally only considering certain columns.drop_duplicates
([subset])drop_duplicates()
is an alias fordropDuplicates()
.dropna
([how, thresh, subset])Returns a new
DataFrame
omitting rows with null values.exceptAll
(other)Return a new
DataFrame
containing rows in thisDataFrame
but not in anotherDataFrame
while preserving duplicates.explain
([extended, mode])Prints the (logical and physical) plans to the console for debugging purpose.
fillna
(value[, subset])Replace null values, alias for
na.fill()
.filter
(condition)Filters rows using the given condition.
first
()Returns the first row as a
Row
.foreach
(f)Applies the
f
function to each partition of thisDataFrame
.freqItems
(cols[, support])Finding frequent items for columns, possibly with false positives.
groupBy
(*cols)Groups the
DataFrame
using the specified columns, so we can run aggregation on them.groupby
(*cols)groupby()
is an alias forgroupBy()
.head
([n])Returns the first
n
rows.hint
(name, *parameters)Specifies some hint on the current
DataFrame
.Returns a best-effort snapshot of the files that compose this
DataFrame
.intersect
(other)Return a new
DataFrame
containing rows only in both thisDataFrame
and anotherDataFrame
.intersectAll
(other)Return a new
DataFrame
containing rows in both thisDataFrame
and anotherDataFrame
while preserving duplicates.isLocal
()Returns
True
if thecollect()
andtake()
methods can be run locally (without any Spark executors).join
(other[, on, how])Joins with another
DataFrame
, using the given join expression.limit
(num)Limits the result count to the number specified.
localCheckpoint
([eager])Returns a locally checkpointed version of this
DataFrame
.mapInPandas
(func, schema)Maps an iterator of batches in the current
DataFrame
using a Python native function that takes and outputs a pandas DataFrame, and returns the result as aDataFrame
.orderBy
(*cols, **kwargs)Returns a new
DataFrame
sorted by the specified column(s).persist
([storageLevel])Sets the storage level to persist the contents of the
DataFrame
across operations after the first time it is computed.Prints out the schema in the tree format.
randomSplit
(weights[, seed])Randomly splits this
DataFrame
with the provided weights.registerTempTable
(name)Registers this
DataFrame
as a temporary table using the given name.repartition
(numPartitions, *cols)Returns a new
DataFrame
partitioned by the given partitioning expressions.repartitionByRange
(numPartitions, *cols)Returns a new
DataFrame
partitioned by the given partitioning expressions.replace
(to_replace[, value, subset])Returns a new
DataFrame
replacing a value with another value.rollup
(*cols)Create a multi-dimensional rollup for the current
DataFrame
using the specified columns, so we can run aggregation on them.sameSemantics
(other)Returns True when the logical query plans inside both
DataFrame
s are equal and therefore return same results.sample
([withReplacement, fraction, seed])Returns a sampled subset of this
DataFrame
.sampleBy
(col, fractions[, seed])Returns a stratified sample without replacement based on the fraction given on each stratum.
select
(*cols)Projects a set of expressions and returns a new
DataFrame
.selectExpr
(*expr)Projects a set of SQL expressions and returns a new
DataFrame
.Returns a hash code of the logical query plan against this
DataFrame
.show
([n, truncate, vertical])Prints the first
n
rows to the console.sort
(*cols, **kwargs)Returns a new
DataFrame
sorted by the specified column(s).sortWithinPartitions
(*cols, **kwargs)Returns a new
DataFrame
with each partition sorted by the specified column(s).subtract
(other)Return a new
DataFrame
containing rows in thisDataFrame
but not in anotherDataFrame
.summary
(*statistics)Computes specified statistics for numeric and string columns.
tail
(num)Returns the last
num
rows as alist
ofRow
.take
(num)Returns the first
num
rows as alist
ofRow
.toDF
(*cols)Returns a new
DataFrame
that with new specified column namestoJSON
([use_unicode])Converts a
DataFrame
into aRDD
of string.toLocalIterator
([prefetchPartitions])Returns an iterator that contains all of the rows in this
DataFrame
.toPandas
()Returns the contents of this
DataFrame
as Pandaspandas.DataFrame
.to_koalas
([index_col])to_pandas_on_spark
([index_col])Converts the existing DataFrame into a pandas-on-Spark DataFrame.
transform
(func)Returns a new
DataFrame
.union
(other)Return a new
DataFrame
containing union of rows in this and anotherDataFrame
.unionAll
(other)Return a new
DataFrame
containing union of rows in this and anotherDataFrame
.unionByName
(other[, allowMissingColumns])Returns a new
DataFrame
containing union of rows in this and anotherDataFrame
.unpersist
([blocking])Marks the
DataFrame
as non-persistent, and remove all blocks for it from memory and disk.where
(condition)withColumn
(colName, col)Returns a new
DataFrame
by adding a column or replacing the existing column that has the same name.withColumnRenamed
(existing, new)Returns a new
DataFrame
by renaming an existing column.withWatermark
(eventTime, delayThreshold)Defines an event time watermark for this
DataFrame
.writeTo
(table)Create a write configuration builder for v2 sources.
Attributes
Returns all column names as a list.
Returns all column names and their data types as a list.
Returns
True
if thisDataFrame
contains one or more sources that continuously return data as it arrives.Returns a
DataFrameNaFunctions
for handling missing values.Returns the content as an
pyspark.RDD
ofRow
.Returns the schema of this
DataFrame
as apyspark.sql.types.StructType
.Returns a
DataFrameStatFunctions
for statistic functions.Get the
DataFrame
’s current storage level.Interface for saving the content of the non-streaming
DataFrame
out into external storage.Interface for saving the content of the streaming
DataFrame
out into external storage.