Constructor
new Dataset()
- Source:
Examples
var people = sqlContext.read.parquet("...")
// Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in:
// Dataset (this class), Column, and functions.
// To select a column from the data frame:
var ageCol = people("age")
Methods
agg() → {module:eclairjs/sql.Dataset}
aggregates on the entire Dataset without groups.
Parameters:
Type | Description |
---|---|
hashMap | hashMap |
- Source:
Returns:
Example
// df.agg(...) is a shorthand for df.groupBy().agg(...)
var map = {};
map["age"] = "max";
map["salary"] = "avg";
df.agg(map)
df.groupBy().agg(map)
alias(alias) → {module:eclairjs/sql.Dataset}
Returns a new Dataset with an alias set. Same as `as`.
Parameters:
Name | Type | Description |
---|---|---|
alias |
string |
- Since:
- EclairJS 0.7 Spark 2.0.0
- Source:
Returns:
apply(colName) → {module:eclairjs/sql.Column}
Selects column based on the column name and return it as a Column.
Note that the column name can also reference to a nested column like a.b.
Parameters:
Name | Type | Description |
---|---|---|
colName |
string |
- Source:
Returns:
as(alias) → {module:eclairjs/sql.Dataset}
Returns a new Dataset with an alias set.
Parameters:
Name | Type | Description |
---|---|---|
alias |
string |
- Source:
Returns:
cache() → {module:eclairjs/sql.Dataset}
Persist this Dataset with the default storage level (`MEMORY_ONLY`).
- Source:
Returns:
coalesce(numPartitions) → {module:eclairjs/sql.Dataset}
Returns a new Dataset that has exactly numPartitions partitions.
Similar to coalesce defined on an RDD, this operation results in a narrow dependency,
e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle,
instead each of the 100 new partitions will claim 10 of the current partitions.
Parameters:
Name | Type | Description |
---|---|---|
numPartitions |
integer |
- Source:
Returns:
col(name) → {module:eclairjs/sql.Column}
Selects column based on the column name and return it as a Column.
Parameters:
Name | Type | Description |
---|---|---|
name |
string |
- Source:
Returns:
collect() → {Array.<object>}
Returns an array that contains all of objects in this Dataset.
- Source:
Returns:
- Type
- Array.<object>
columns(name) → {Array.<string>}
Returns all column names as an array.
Parameters:
Name | Type | Description |
---|---|---|
name |
string |
- Source:
Returns:
- Type
- Array.<string>
count() → {integer}
Returns the number of rows in the Dataset.
- Source:
Returns:
- Type
- integer
createOrReplaceTempView(viewName)
Creates a temporary view using the given name. The lifetime of this
temporary view is tied to the SparkSession that was used to create this Dataset.
Parameters:
Name | Type | Description |
---|---|---|
viewName |
string |
- Since:
- EclairJS 0.7 Spark 2.0.0
- Source:
createTempView(viewName)
Creates a temporary view using the given name. The lifetime of this
temporary view is tied to the SparkSession that was used to create this Dataset.
Parameters:
Name | Type | Description |
---|---|---|
viewName |
string |
- Since:
- EclairJS 0.7 Spark 2.0.0
- Source:
Throws:
AnalysisException if the view name already exists
cube() → {module:eclairjs/sql.GroupedData}
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.
Parameters:
Name | Type | Description |
---|---|---|
cols... |
string | Column |
- Source:
Returns:
- Type
- module:eclairjs/sql.GroupedData
Example
var df = peopleDataset.cube("age", "expense");
describe() → {module:eclairjs/sql.Dataset}
Computes statistics for numeric columns, including count, mean, stddev, min, and max.
If no columns are given, this function computes statistics for all numerical columns.
This function is meant for exploratory data analysis, as we make no guarantee about the backward
compatibility of the schema of the resulting Dataset. If you want to programmatically compute
summary statistics, use the agg function instead.
Parameters:
Name | Type | Description |
---|---|---|
cols.... |
string |
- Source:
Returns:
Example
var df = peopleDataset.describe("age", "expense");
distinct()
Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for dropDuplicates.
- Source:
drop(col) → {module:eclairjs/sql.Dataset}
Returns a new Dataset with a column dropped.
Parameters:
Name | Type | Description |
---|---|---|
col |
string | module:eclairjs/sql.Column |
- Source:
Returns:
dropDuplicates(colNamesopt) → {module:eclairjs/sql.Dataset}
Returns a new Dataset that contains only the unique rows from this Dataset, if colNames then considering only the subset of columns.
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
colNames |
Array.<string> |
<optional> |
- Source:
Returns:
dtypes() → {Array}
Returns all column names and their data types as an array of arrays. ex. [["name","StringType"],["age","IntegerType"],["expense","IntegerType"]]
- Source:
Returns:
Array of Array[2]
- Type
- Array
except(otherDataset) → {module:eclairjs/sql.Dataset}
Returns a new Dataset containing rows in this frame but not in another frame. This is equivalent to EXCEPT in SQL.
Parameters:
Name | Type | Description |
---|---|---|
otherDataset |
module:eclairjs/sql.Dataset | to compare to this Dataset |
- Source:
Returns:
explain(if)
Prints the plans (logical and physical) to the console for debugging purposes.
Parameters:
Name | Type | Description |
---|---|---|
if |
boolean | false prints the physical plans only. |
- Source:
filter() → {module:eclairjs/sql.Dataset}
Filters rows using the given SQL expression string or Filters rows using the given Column..
Parameters:
Type | Description |
---|---|
string | module:eclairjs/sql.Column | function |
- Source:
Returns:
flatMap(func, encoder, bindArgsopt) → {module:eclairjs/sql.Dataset}
Returns a new Dataset by first applying a function to all elements of this Dataset,
and then flattening the results.
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
func |
function | ||
encoder |
module:eclairjs/sql.Encoder | ||
bindArgs |
Array.<Object> |
<optional> |
array whose values will be added to func's argument list. |
- Source:
Returns:
foreach(Function, bindArgsopt) → {void}
Applies a function to all elements of this Dataset.
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
Function |
function | with one parameter | |
bindArgs |
Array.<Object> |
<optional> |
array whose values will be added to func's argument list. |
- Source:
Returns:
- Type
- void
Example
rdd3.foreach(function(record) {
var connection = createNewConnection()
connection.send(record);
connection.close()
});
foreachPartition(Function, bindArgsopt) → {void}
Applies a function to each partition of this Dataset.
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
Function |
function | with one Array parameter | |
bindArgs |
Array.<Object> |
<optional> |
array whose values will be added to func's argument list. |
- Source:
Returns:
- Type
- void
Example
df.foreachPartition(function(partitionOfRecords) {
var connection = createNewConnection()
partitionOfRecords.forEach(function(record){
connection.send(record);
});
connection.close()
});
groupBy() → {module:eclairjs/sql.RelationalGroupedDataset}
Groups the Dataset using the specified columns, so we can run aggregation on them
Parameters:
Type | Description |
---|---|
Array.<string> | Array.<module:eclairjs/sql.Column> | Array of Column objects of column name strings |
- Source:
Returns:
groupByKey(func, encoder) → {module:eclairjs/sql.KeyValueGroupedDataset}
:: Experimental ::
(Java-specific)
Returns a KeyValueGroupedDataset where the data is grouped by the given key `func`.
Parameters:
Name | Type | Description |
---|---|---|
func |
MapFunction | |
encoder |
module:eclairjs/sql.Encoder |
- Since:
- EclairJS 0.7 Spark 2.0.0
- Source:
Returns:
- Type
- module:eclairjs/sql.KeyValueGroupedDataset
head(nopt) → {module:eclairjs/sql.Row}
Returns the first row.
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
n |
number |
<optional> |
- Source:
Returns:
inputFiles() → {Array.<string>}
Returns a best-effort snapshot of the files that compose this Dataset. This method simply asks each constituent
BaseRelation for its respective files and takes the union of all results. Depending on the source relations,
this may not find all input files. Duplicates are removed.
- Source:
Returns:
files
- Type
- Array.<string>
intersect(other) → {module:eclairjs/sql.Dataset}
Returns a new Dataset containing rows only in both this frame and another frame. This is equivalent to INTERSECT in SQL
Parameters:
Name | Type | Description |
---|---|---|
other |
module:eclairjs/sql.Dataset |
- Source:
Returns:
isLocal() → {boolean}
Returns true if the collect and take methods can be run locally (without any Spark executors).
- Source:
Returns:
- Type
- boolean
isStreaming() → {boolean}
Returns true if this Dataset contains one or more sources that continuously
return data as it arrives. A Dataset that reads data from a streaming source
must be executed as a StreamingQuery using the `start()` method in
DataStreamWriter. Methods that return a single answer, e.g. `count()` or
`collect()`, will throw an AnalysisException when there is a streaming
source present.
- Since:
- EclairJS 0.7 Spark 2.0.0
- Source:
Returns:
- Type
- boolean
join(Right, columnNamesOrJoinExpropt, joinTypeopt) → {module:eclairjs/sql.Dataset}
Cartesian join with another Dataset. Note that cartesian joins are very expensive without an extra filter that can be pushed down.
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
Right |
module:eclairjs/sql.Dataset | side of the join operation. | |
columnNamesOrJoinExpr |
string | Array.<string> | module:eclairjs/sql.Column |
<optional> |
If string or array of strings column names, inner equi-join with another Dataset using the given columns. Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax. If Column object, joinExprs inner join with another Dataset, using the given join expression. |
joinType |
string |
<optional> |
only valid if using Column joinExprs. |
- Source:
Returns:
Example
var joinedDf = df1.join(df2);
// or
var joinedDf = df1.join(df2,"age");
// or
var joinedDf = df1.join(df2, ["age", "DOB"]);
// or Column joinExpr
var joinedDf = df1.join(df2, df1.col("name").equalTo(df2.col("name")));
// or Column joinExpr
var joinedDf = df1.join(df2, df1.col("name").equalTo(df2.col("name")), "outer");
joinWith(other, condition, joinTypeopt) → {module:eclairjs/sql.Dataset}
:: Experimental ::
Joins this Dataset returning a module:eclairjs.Tuple2 for each pair where `condition` evaluates to
true.
This is similar to the relation `join` function with one important difference in the
result schema. Since `joinWith` preserves objects present on either side of the join, the
result schema is similarly nested into a tuple under the column names `_1` and `_2`.
This type of join can be useful both for preserving type-safety with the original object
types as well as working with relational data where either side of the join has column
names in common.
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
other |
module:eclairjs/sql.Dataset | Right side of the join. | |
condition |
module:eclairjs/sql.Column | Join expression. | |
joinType |
string |
<optional> |
One of: `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`. |
- Since:
- EclairJS 0.7 Spark 1.6.0
- Source:
Returns:
limit(number) → {module:eclairjs/sql.Dataset}
Returns a new Dataset by taking the first n rows. The difference between this function and head is that head
returns an array while limit returns a new Dataset.
Parameters:
Name | Type | Description |
---|---|---|
number |
integer |
- Source:
Returns:
map(func, encoder, bindArgsopt) → {module:eclairjs/sql.Dataset}
Returns a new Dataset that contains the result of applying func to each element..
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
func |
function | ||
encoder |
module:eclairjs/sql.Encoder | ||
bindArgs |
Array.<Object> |
<optional> |
array whose values will be added to func's argument list. |
- Source:
Returns:
mapPartitions(encoder, bindArgsopt) → {module:eclairjs/sql.Dataset}
Returns a new Dataset that contains the result of applying `f` to each partition..
Similar to map, but runs separately on each partition (block) of the Dataset, so func must accept an Array.
func should return a array rather than a single item.
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
|
function | ||
encoder |
module:eclairjs/sql.Encoder | ||
bindArgs |
Array.<Object> |
<optional> |
array whose values will be added to func's argument list. |
- Source:
Returns:
na() → {module:eclairjs/sql.DataframeNaFunctions}
Returns a DatasetNaFunctions for working with missing data.
- Source:
Returns:
- Type
- module:eclairjs/sql.DataframeNaFunctions
orderBy() → {module:eclairjs/sql.Dataset}
Returns a new Dataset sorted by the specified columns, if columnName is used sorted in ascending order.
This is an alias of the sort function.
Parameters:
Name | Type | Description |
---|---|---|
columnName,...columnName |
string | module:eclairjs/sql.Column | or sortExprs,... sortExprs |
- Source:
Returns:
persist(newLevelopt) → {module:eclairjs/sql.Dataset}
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
newLevel |
module:eclairjs/storage.StorageLevel |
<optional> |
- Source:
Returns:
printSchema()
Prints the schema to the console in a nice tree format.
- Source:
queryExecution() → {SQLContextQueryExecution}
- Source:
Returns:
- Type
- SQLContextQueryExecution
randomSplit(weights, seedopt) → {Array.<module:eclairjs/sql.Dataset>}
Randomly splits this Dataset with the provided weights.
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
weights |
Array.<float> | weights for splits, will be normalized if they don't sum to 1. | |
seed |
int |
<optional> |
Seed for sampling. |
- Source:
Returns:
- Type
- Array.<module:eclairjs/sql.Dataset>
rdd() → {module:eclairjs.RDD}
Represents the content of the Dataset as an RDD of Rows.
- Source:
Returns:
- Type
- module:eclairjs.RDD
reduce(func) → {object}
:: Experimental ::
Reduces the elements of this Dataset using the specified binary function. The given `func`
must be commutative and associative or the result may be non-deterministic.
Parameters:
Name | Type | Description |
---|---|---|
func |
ReduceFunction |
- Since:
- EclairJS 0.7 Spark 1.6.0
- Source:
Returns:
- Type
- object
registerTempTable(tableName)
Registers this Dataset as a temporary table using the given name.
Parameters:
Name | Type | Description |
---|---|---|
tableName |
string |
- Source:
repartition(numPartitions) → {module:eclairjs/sql.Dataset}
Returns a new Dataset that has exactly numPartitions partitions.
Parameters:
Name | Type | Description |
---|---|---|
numPartitions |
integer |
- Source:
Returns:
rollup(columnName,) → {module:eclairjs/sql.GroupedData}
Create a multi-dimensional rollup for the current Dataset using the specified columns,
so we can run aggregation on them. See GroupedData for all the available aggregate functions.
Parameters:
Name | Type | Description |
---|---|---|
columnName, |
string | module:eclairjs/sql.Column | .....columnName or sortExprs,... sortExprs |
- Source:
Returns:
- Type
- module:eclairjs/sql.GroupedData
Example
var result = peopleDataset.rollup("age", "networth").count();
// or
var col = peopleDataset.col("age");
var result = peopleDataset.rollup(col).count();
sample(withReplacement, fraction, seedopt) → {module:eclairjs/sql.Dataset}
Returns a new Dataset by sampling a fraction of rows, using a random seed.
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
withReplacement |
boolean | ||
fraction |
float | ||
seed |
integer |
<optional> |
- Source:
Returns:
schema() → {module:eclairjs/sql/types.StructType}
Returns the schema of this Dataset.
- Source:
Returns:
select() → {module:eclairjs/sql.Dataset}
Selects a set of column based expressions.
Parameters:
Type | Description |
---|---|
Array.<module:eclairjs/sql.Column> | Array.<module:eclairjs/sql.TypedColumn> | Array.<string> |
- Source:
Returns:
selectExpr() → {module:eclairjs/sql.Dataset}
Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.
Parameters:
Name | Type | Description |
---|---|---|
exprs,...exprs |
string |
- Source:
Returns:
Example
var result = peopleDataset.selectExpr("name", "age > 19");
show(numberOfRowsOrTruncateopt, truncateopt)
Displays the Dataset rows in a tabular form.
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
numberOfRowsOrTruncate |
interger | boolean |
<optional> |
defaults to 20. |
truncate |
boolean |
<optional> |
defaults to false, Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right |
- Source:
sort() → {module:eclairjs/sql.Dataset}
Returns a new Dataset sorted by the specified columns, if columnName is used sorted in ascending order.
Parameters:
Name | Type | Description |
---|---|---|
columnName,...columnName |
string | module:eclairjs/sql.Column | or sortExprs,... sortExprs |
- Source:
Returns:
Example
var result = peopleDataset.sort("age", "name");
// or
var col = peopleDataset.col("age");
var colExpr = col.desc();
var result = peopleDataset.sort(colExpr);
sortWithinPartitions(sortCol, sortCols) → {module:eclairjs/sql.Dataset}
Returns a new Dataset with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
Parameters:
Name | Type | Description |
---|---|---|
sortCol |
string | module:eclairjs/sql.Column | |
sortCols |
string | module:eclairjs/sql.Column |
- Since:
- EclairJS 0.7 Spark 2.0.0
- Source:
Returns:
sparkSession() → {module:eclairjs/sql.SparkSession}
Returns SparkSession
- Source:
Returns:
sqlContext() → {module:eclairjs/sql.SQLContext}
Returns SQLContext
- Source:
Returns:
stat() → {module:eclairjs/sql.DatasetStatFunctions}
Returns a DatasetStatFunctions for working statistic functions support.
- Source:
Returns:
- Type
- module:eclairjs/sql.DatasetStatFunctions
Example
var stat = peopleDataset.stat().cov("income", "networth");
take(n) → {Array.<module:eclairjs/sql.Row>}
Returns the first row in the Dataset.
Parameters:
Name | Type | Description |
---|---|---|
n |
number |
- Source:
Returns:
- Type
- Array.<module:eclairjs/sql.Row>
toDF() → {module:eclairjs/sql.Dataset}
Returns a new Dataset with columns renamed. This can be quite convenient in conversion from a
RDD of tuples into a Dataset with meaningful names. For example:
Parameters:
Name | Type | Description |
---|---|---|
colNames,...colNames |
string |
- Source:
Returns:
Example
var result = nameAgeDF.toDF("newName", "newAge");
toJSON() → {object}
Returns the content of the Dataset as JSON.
- Source:
Returns:
- Type
- object
toRDD() → {module:eclairjs.RDD}
Represents the content of the Dataset as an RDD of Rows.
- Source:
Returns:
- Type
- module:eclairjs.RDD
union(other) → {module:eclairjs/sql.Dataset}
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
This is equivalent to `UNION ALL` in SQL.
To do a SQL-style set union (that does deduplication of elements), use this function followed
by a distinct.
Parameters:
Name | Type | Description |
---|---|---|
other |
module:eclairjs/sql.Dataset |
- Since:
- EclairJS 0.7 Spark 2.0.0
- Source:
Returns:
unionAll(other) → {module:eclairjs/sql.Dataset}
Returns a new Dataset containing union of rows in this frame and another frame. This is equivalent to UNION ALL in SQL.
Parameters:
Name | Type | Description |
---|---|---|
other |
module:eclairjs/sql.Dataset |
- Source:
Returns:
unpersist(blockingopt) → {module:eclairjs/sql.Dataset}
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
blocking |
boolean |
<optional> |
Whether to block until all blocks are deleted. |
- Since:
- EclairJS 0.7 Spark 1.6.0
- Source:
Returns:
where(condition) → {module:eclairjs/sql.Dataset}
Filters rows using the given Column or SQL expression.
Parameters:
Name | Type | Description |
---|---|---|
condition |
module:eclairjs/sql.Column | string | . |
- Source:
Returns:
withColumn(name, col) → {module:eclairjs/sql.Dataset}
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
Parameters:
Name | Type | Description |
---|---|---|
name |
string | |
col |
module:eclairjs/sql.Column |
- Source:
Returns:
Example
var col = peopleDataset.col("age");
var df1 = peopleDataset.withColumn("newCol", col);
withColumnRenamed(existingName, newName) → {module:eclairjs/sql.Dataset}
Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName.
Parameters:
Name | Type | Description |
---|---|---|
existingName |
string | |
newName |
string |
- Source:
Returns:
write() → {module:eclairjs/sql.DatasetWriter}
Interface for saving the content of the Dataset out into external storage.
- Source:
Returns:
- Type
- module:eclairjs/sql.DatasetWriter
writeStream() → {module:eclairjs/sql/streaming.DataStreamWriter}
Interface for saving the content of the Dataset out into external storage.
- Source: