Class: Dataset

eclairjs/sql. Dataset

A distributed collection of data organized into named columns. A Dataset is equivalent to a relational table in Spark SQL.

Constructor

new Dataset()

Source:
Examples
var people = sqlContext.read.parquet("...")
// Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in:
// Dataset (this class), Column, and functions.
// To select a column from the data frame:
var ageCol = people("age")

Methods

agg() → {module:eclairjs/sql.Dataset}

aggregates on the entire Dataset without groups.
Parameters:
Type Description
hashMap hashMap exprs
Source:
Returns:
Type
module:eclairjs/sql.Dataset
Example
// df.agg(...) is a shorthand for df.groupBy().agg(...)
var map = {};
map["age"] = "max";
map["salary"] = "avg";
df.agg(map)
df.groupBy().agg(map)

alias(alias) → {module:eclairjs/sql.Dataset}

Returns a new Dataset with an alias set. Same as `as`.
Parameters:
Name Type Description
alias string
Since:
  • EclairJS 0.7 Spark 2.0.0
Source:
Returns:
Type
module:eclairjs/sql.Dataset

apply(colName) → {module:eclairjs/sql.Column}

Selects column based on the column name and return it as a Column. Note that the column name can also reference to a nested column like a.b.
Parameters:
Name Type Description
colName string
Source:
Returns:
Type
module:eclairjs/sql.Column

as(alias) → {module:eclairjs/sql.Dataset}

Returns a new Dataset with an alias set.
Parameters:
Name Type Description
alias string
Source:
Returns:
Type
module:eclairjs/sql.Dataset

cache() → {module:eclairjs/sql.Dataset}

Persist this Dataset with the default storage level (`MEMORY_ONLY`).
Source:
Returns:
Type
module:eclairjs/sql.Dataset

coalesce(numPartitions) → {module:eclairjs/sql.Dataset}

Returns a new Dataset that has exactly numPartitions partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
Parameters:
Name Type Description
numPartitions integer
Source:
Returns:
Type
module:eclairjs/sql.Dataset

col(name) → {module:eclairjs/sql.Column}

Selects column based on the column name and return it as a Column.
Parameters:
Name Type Description
name string
Source:
Returns:
Type
module:eclairjs/sql.Column

collect() → {Array.<object>}

Returns an array that contains all of objects in this Dataset.
Source:
Returns:
Type
Array.<object>

columns(name) → {Array.<string>}

Returns all column names as an array.
Parameters:
Name Type Description
name string
Source:
Returns:
Type
Array.<string>

count() → {integer}

Returns the number of rows in the Dataset.
Source:
Returns:
Type
integer

createOrReplaceTempView(viewName)

Creates a temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.
Parameters:
Name Type Description
viewName string
Since:
  • EclairJS 0.7 Spark 2.0.0
Source:

createTempView(viewName)

Creates a temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.
Parameters:
Name Type Description
viewName string
Since:
  • EclairJS 0.7 Spark 2.0.0
Source:
Throws:
AnalysisException if the view name already exists

cube() → {module:eclairjs/sql.GroupedData}

Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.
Parameters:
Name Type Description
cols... string | Column
Source:
Returns:
Type
module:eclairjs/sql.GroupedData
Example
var df = peopleDataset.cube("age", "expense");

describe() → {module:eclairjs/sql.Dataset}

Computes statistics for numeric columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns. This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead.
Parameters:
Name Type Description
cols.... string
Source:
Returns:
Type
module:eclairjs/sql.Dataset
Example
var df = peopleDataset.describe("age", "expense");

distinct()

Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for dropDuplicates.
Source:

drop(col) → {module:eclairjs/sql.Dataset}

Returns a new Dataset with a column dropped.
Parameters:
Name Type Description
col string | module:eclairjs/sql.Column
Source:
Returns:
Type
module:eclairjs/sql.Dataset

dropDuplicates(colNamesopt) → {module:eclairjs/sql.Dataset}

Returns a new Dataset that contains only the unique rows from this Dataset, if colNames then considering only the subset of columns.
Parameters:
Name Type Attributes Description
colNames Array.<string> <optional>
Source:
Returns:
Type
module:eclairjs/sql.Dataset

dtypes() → {Array}

Returns all column names and their data types as an array of arrays. ex. [["name","StringType"],["age","IntegerType"],["expense","IntegerType"]]
Source:
Returns:
Array of Array[2]
Type
Array

except(otherDataset) → {module:eclairjs/sql.Dataset}

Returns a new Dataset containing rows in this frame but not in another frame. This is equivalent to EXCEPT in SQL.
Parameters:
Name Type Description
otherDataset module:eclairjs/sql.Dataset to compare to this Dataset
Source:
Returns:
Type
module:eclairjs/sql.Dataset

explain(if)

Prints the plans (logical and physical) to the console for debugging purposes.
Parameters:
Name Type Description
if boolean false prints the physical plans only.
Source:

filter() → {module:eclairjs/sql.Dataset}

Filters rows using the given SQL expression string or Filters rows using the given Column..
Parameters:
Type Description
string | module:eclairjs/sql.Column | function
Source:
Returns:
Type
module:eclairjs/sql.Dataset

flatMap(func, encoder, bindArgsopt) → {module:eclairjs/sql.Dataset}

Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
Parameters:
Name Type Attributes Description
func function
encoder module:eclairjs/sql.Encoder
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
module:eclairjs/sql.Dataset

foreach(Function, bindArgsopt) → {void}

Applies a function to all elements of this Dataset.
Parameters:
Name Type Attributes Description
Function function with one parameter
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
void
Example
rdd3.foreach(function(record) {
   var connection = createNewConnection()
   connection.send(record);
   connection.close()
});

foreachPartition(Function, bindArgsopt) → {void}

Applies a function to each partition of this Dataset.
Parameters:
Name Type Attributes Description
Function function with one Array parameter
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
void
Example
df.foreachPartition(function(partitionOfRecords) {
   var connection = createNewConnection()
   partitionOfRecords.forEach(function(record){
      connection.send(record);	
   });
   connection.close()
});

groupBy() → {module:eclairjs/sql.RelationalGroupedDataset}

Groups the Dataset using the specified columns, so we can run aggregation on them
Parameters:
Type Description
Array.<string> | Array.<module:eclairjs/sql.Column> Array of Column objects of column name strings
Source:
Returns:
Type
module:eclairjs/sql.RelationalGroupedDataset

groupByKey(func, encoder) → {module:eclairjs/sql.KeyValueGroupedDataset}

:: Experimental :: (Java-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key `func`.
Parameters:
Name Type Description
func MapFunction
encoder module:eclairjs/sql.Encoder
Since:
  • EclairJS 0.7 Spark 2.0.0
Source:
Returns:
Type
module:eclairjs/sql.KeyValueGroupedDataset
Returns the first row.
Parameters:
Name Type Attributes Description
n number <optional>
Source:
Returns:
Type
module:eclairjs/sql.Row

inputFiles() → {Array.<string>}

Returns a best-effort snapshot of the files that compose this Dataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.
Source:
Returns:
files
Type
Array.<string>

intersect(other) → {module:eclairjs/sql.Dataset}

Returns a new Dataset containing rows only in both this frame and another frame. This is equivalent to INTERSECT in SQL
Parameters:
Name Type Description
other module:eclairjs/sql.Dataset
Source:
Returns:
Type
module:eclairjs/sql.Dataset

isLocal() → {boolean}

Returns true if the collect and take methods can be run locally (without any Spark executors).
Source:
Returns:
Type
boolean

isStreaming() → {boolean}

Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as a StreamingQuery using the `start()` method in DataStreamWriter. Methods that return a single answer, e.g. `count()` or `collect()`, will throw an AnalysisException when there is a streaming source present.
Since:
  • EclairJS 0.7 Spark 2.0.0
Source:
Returns:
Type
boolean

join(Right, columnNamesOrJoinExpropt, joinTypeopt) → {module:eclairjs/sql.Dataset}

Cartesian join with another Dataset. Note that cartesian joins are very expensive without an extra filter that can be pushed down.
Parameters:
Name Type Attributes Description
Right module:eclairjs/sql.Dataset side of the join operation.
columnNamesOrJoinExpr string | Array.<string> | module:eclairjs/sql.Column <optional>
If string or array of strings column names, inner equi-join with another Dataset using the given columns. Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax. If Column object, joinExprs inner join with another Dataset, using the given join expression.
joinType string <optional>
only valid if using Column joinExprs.
Source:
Returns:
Type
module:eclairjs/sql.Dataset
Example
var joinedDf = df1.join(df2);
// or
var joinedDf = df1.join(df2,"age");
// or
var joinedDf = df1.join(df2, ["age", "DOB"]);
// or Column joinExpr
var joinedDf = df1.join(df2, df1.col("name").equalTo(df2.col("name")));
// or Column joinExpr
var joinedDf = df1.join(df2, df1.col("name").equalTo(df2.col("name")), "outer");

joinWith(other, condition, joinTypeopt) → {module:eclairjs/sql.Dataset}

:: Experimental :: Joins this Dataset returning a module:eclairjs.Tuple2 for each pair where `condition` evaluates to true. This is similar to the relation `join` function with one important difference in the result schema. Since `joinWith` preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names `_1` and `_2`. This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
Parameters:
Name Type Attributes Description
other module:eclairjs/sql.Dataset Right side of the join.
condition module:eclairjs/sql.Column Join expression.
joinType string <optional>
One of: `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`.
Since:
  • EclairJS 0.7 Spark 1.6.0
Source:
Returns:
Type
module:eclairjs/sql.Dataset

limit(number) → {module:eclairjs/sql.Dataset}

Returns a new Dataset by taking the first n rows. The difference between this function and head is that head returns an array while limit returns a new Dataset.
Parameters:
Name Type Description
number integer
Source:
Returns:
Type
module:eclairjs/sql.Dataset

map(func, encoder, bindArgsopt) → {module:eclairjs/sql.Dataset}

Returns a new Dataset that contains the result of applying func to each element..
Parameters:
Name Type Attributes Description
func function
encoder module:eclairjs/sql.Encoder
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
module:eclairjs/sql.Dataset

mapPartitions(encoder, bindArgsopt) → {module:eclairjs/sql.Dataset}

Returns a new Dataset that contains the result of applying `f` to each partition.. Similar to map, but runs separately on each partition (block) of the Dataset, so func must accept an Array. func should return a array rather than a single item.
Parameters:
Name Type Attributes Description
function
encoder module:eclairjs/sql.Encoder
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
module:eclairjs/sql.Dataset

na() → {module:eclairjs/sql.DataframeNaFunctions}

Returns a DatasetNaFunctions for working with missing data.
Source:
Returns:
Type
module:eclairjs/sql.DataframeNaFunctions

orderBy() → {module:eclairjs/sql.Dataset}

Returns a new Dataset sorted by the specified columns, if columnName is used sorted in ascending order. This is an alias of the sort function.
Parameters:
Name Type Description
columnName,...columnName string | module:eclairjs/sql.Column or sortExprs,... sortExprs
Source:
Returns:
Type
module:eclairjs/sql.Dataset

persist(newLevelopt) → {module:eclairjs/sql.Dataset}

Parameters:
Name Type Attributes Description
newLevel module:eclairjs/storage.StorageLevel <optional>
Source:
Returns:
Type
module:eclairjs/sql.Dataset

printSchema()

Prints the schema to the console in a nice tree format.
Source:

queryExecution() → {SQLContextQueryExecution}

Source:
Returns:
Type
SQLContextQueryExecution

randomSplit(weights, seedopt) → {Array.<module:eclairjs/sql.Dataset>}

Randomly splits this Dataset with the provided weights.
Parameters:
Name Type Attributes Description
weights Array.<float> weights for splits, will be normalized if they don't sum to 1.
seed int <optional>
Seed for sampling.
Source:
Returns:
Type
Array.<module:eclairjs/sql.Dataset>

rdd() → {module:eclairjs.RDD}

Represents the content of the Dataset as an RDD of Rows.
Source:
Returns:
Type
module:eclairjs.RDD

reduce(func) → {object}

:: Experimental :: Reduces the elements of this Dataset using the specified binary function. The given `func` must be commutative and associative or the result may be non-deterministic.
Parameters:
Name Type Description
func ReduceFunction
Since:
  • EclairJS 0.7 Spark 1.6.0
Source:
Returns:
Type
object

registerTempTable(tableName)

Registers this Dataset as a temporary table using the given name.
Parameters:
Name Type Description
tableName string
Source:

repartition(numPartitions) → {module:eclairjs/sql.Dataset}

Returns a new Dataset that has exactly numPartitions partitions.
Parameters:
Name Type Description
numPartitions integer
Source:
Returns:
Type
module:eclairjs/sql.Dataset

rollup(columnName,) → {module:eclairjs/sql.GroupedData}

Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
Parameters:
Name Type Description
columnName, string | module:eclairjs/sql.Column .....columnName or sortExprs,... sortExprs
Source:
Returns:
Type
module:eclairjs/sql.GroupedData
Example
var result = peopleDataset.rollup("age", "networth").count();
 // or
 var col = peopleDataset.col("age");
   var result = peopleDataset.rollup(col).count();

sample(withReplacement, fraction, seedopt) → {module:eclairjs/sql.Dataset}

Returns a new Dataset by sampling a fraction of rows, using a random seed.
Parameters:
Name Type Attributes Description
withReplacement boolean
fraction float
seed integer <optional>
Source:
Returns:
Type
module:eclairjs/sql.Dataset

schema() → {module:eclairjs/sql/types.StructType}

Returns the schema of this Dataset.
Source:
Returns:
Type
module:eclairjs/sql/types.StructType

select() → {module:eclairjs/sql.Dataset}

Selects a set of column based expressions.
Parameters:
Type Description
Array.<module:eclairjs/sql.Column> | Array.<module:eclairjs/sql.TypedColumn> | Array.<string>
Source:
Returns:
Type
module:eclairjs/sql.Dataset

selectExpr() → {module:eclairjs/sql.Dataset}

Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.
Parameters:
Name Type Description
exprs,...exprs string
Source:
Returns:
Type
module:eclairjs/sql.Dataset
Example
var result = peopleDataset.selectExpr("name", "age > 19");

show(numberOfRowsOrTruncateopt, truncateopt)

Displays the Dataset rows in a tabular form.
Parameters:
Name Type Attributes Description
numberOfRowsOrTruncate interger | boolean <optional>
defaults to 20.
truncate boolean <optional>
defaults to false, Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right
Source:

sort() → {module:eclairjs/sql.Dataset}

Returns a new Dataset sorted by the specified columns, if columnName is used sorted in ascending order.
Parameters:
Name Type Description
columnName,...columnName string | module:eclairjs/sql.Column or sortExprs,... sortExprs
Source:
Returns:
Type
module:eclairjs/sql.Dataset
Example
var result = peopleDataset.sort("age", "name");
 // or
 var col = peopleDataset.col("age");
   var colExpr = col.desc();
   var result = peopleDataset.sort(colExpr);

sortWithinPartitions(sortCol, sortCols) → {module:eclairjs/sql.Dataset}

Returns a new Dataset with each partition sorted by the given expressions. This is the same operation as "SORT BY" in SQL (Hive QL).
Parameters:
Name Type Description
sortCol string | module:eclairjs/sql.Column
sortCols string | module:eclairjs/sql.Column
Since:
  • EclairJS 0.7 Spark 2.0.0
Source:
Returns:
Type
module:eclairjs/sql.Dataset

sparkSession() → {module:eclairjs/sql.SparkSession}

Returns SparkSession
Source:
Returns:
Type
module:eclairjs/sql.SparkSession

sqlContext() → {module:eclairjs/sql.SQLContext}

Returns SQLContext
Source:
Returns:
Type
module:eclairjs/sql.SQLContext

stat() → {module:eclairjs/sql.DatasetStatFunctions}

Returns a DatasetStatFunctions for working statistic functions support.
Source:
Returns:
Type
module:eclairjs/sql.DatasetStatFunctions
Example
var stat = peopleDataset.stat().cov("income", "networth");

take(n) → {Array.<module:eclairjs/sql.Row>}

Returns the first row in the Dataset.
Parameters:
Name Type Description
n number
Source:
Returns:
Type
Array.<module:eclairjs/sql.Row>

toDF() → {module:eclairjs/sql.Dataset}

Returns a new Dataset with columns renamed. This can be quite convenient in conversion from a RDD of tuples into a Dataset with meaningful names. For example:
Parameters:
Name Type Description
colNames,...colNames string
Source:
Returns:
Type
module:eclairjs/sql.Dataset
Example
var result = nameAgeDF.toDF("newName", "newAge");

toJSON() → {object}

Returns the content of the Dataset as JSON.
Source:
Returns:
Type
object

toRDD() → {module:eclairjs.RDD}

Represents the content of the Dataset as an RDD of Rows.
Source:
Returns:
Type
module:eclairjs.RDD

union(other) → {module:eclairjs/sql.Dataset}

Returns a new Dataset containing union of rows in this Dataset and another Dataset. This is equivalent to `UNION ALL` in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.
Parameters:
Name Type Description
other module:eclairjs/sql.Dataset
Since:
  • EclairJS 0.7 Spark 2.0.0
Source:
Returns:
Type
module:eclairjs/sql.Dataset

unionAll(other) → {module:eclairjs/sql.Dataset}

Returns a new Dataset containing union of rows in this frame and another frame. This is equivalent to UNION ALL in SQL.
Parameters:
Name Type Description
other module:eclairjs/sql.Dataset
Source:
Returns:
Type
module:eclairjs/sql.Dataset

unpersist(blockingopt) → {module:eclairjs/sql.Dataset}

Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Parameters:
Name Type Attributes Description
blocking boolean <optional>
Whether to block until all blocks are deleted.
Since:
  • EclairJS 0.7 Spark 1.6.0
Source:
Returns:
Type
module:eclairjs/sql.Dataset

where(condition) → {module:eclairjs/sql.Dataset}

Filters rows using the given Column or SQL expression.
Parameters:
Name Type Description
condition module:eclairjs/sql.Column | string .
Source:
Returns:
Type
module:eclairjs/sql.Dataset

withColumn(name, col) → {module:eclairjs/sql.Dataset}

Returns a new Dataset by adding a column or replacing the existing column that has the same name.
Parameters:
Name Type Description
name string
col module:eclairjs/sql.Column
Source:
Returns:
Type
module:eclairjs/sql.Dataset
Example
var col = peopleDataset.col("age");
 var df1 = peopleDataset.withColumn("newCol", col);

withColumnRenamed(existingName, newName) → {module:eclairjs/sql.Dataset}

Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName.
Parameters:
Name Type Description
existingName string
newName string
Source:
Returns:
Type
module:eclairjs/sql.Dataset

write() → {module:eclairjs/sql.DatasetWriter}

Interface for saving the content of the Dataset out into external storage.
Source:
Returns:
Type
module:eclairjs/sql.DatasetWriter

writeStream() → {module:eclairjs/sql/streaming.DataStreamWriter}

Interface for saving the content of the Dataset out into external storage.
Source:
Returns:
Type
module:eclairjs/sql/streaming.DataStreamWriter