JSDoc: Class: DataFrameStatFunctions

new DataFrameStatFunctions()

Statistic functions for DataFrames.

Since:

EclairJS 0.1 Spark 1.4.0

Source:

sql/DataFrameStatFunctions.js, line 27

Methods

corr(col1, col2, methodopt) → {number}

Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.

Parameters:

Name	Type	Attributes	Description
`col1`	string		the name of the column
`col2`	string		the name of the column to calculate the correlation against
`method`	string	<optional>	currently only supports the "pearson"

Since:

EclairJS 0.1 Spark 1.4.0

Source:

sql/DataFrameStatFunctions.js, line 73

Returns:

The Pearson Correlation Coefficient.

Type: number

Example

var stat = peopleDataFrame.stat().cov("income", "networth", "pearson");

cov(col1, col2) → {Promise.<number>}

Calculate the sample covariance of two numerical columns of a DataFrame.

Parameters:

Name	Type	Description
`col1`	string	the name of the first column
`col2`	string	the name of the second column

Since:

EclairJS 0.1 Spark 1.4.0

Source:

sql/DataFrameStatFunctions.js, line 47

Returns:

the covariance of the two columns.

Type: Promise.<number>

Example

val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
     .withColumn("rand2", rand(seed=27))
   df.stat.cov("rand1", "rand2")
   res1: Double = 0.065...

crosstab(col1, col2) → {module:eclairjs/sql.DataFrame}

Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. The first column of each row will be the distinct values of `col1` and the column names will be the distinct values of `col2`. The name of the first column will be `$col1_$col2`. Counts will be returned as `Long`s. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist.

Parameters:

Name	Type	Description
`col1`	string	The name of the first column. Distinct items will make the first item of each row.
`col2`	string	The name of the second column. Distinct items will make the column names of the DataFrame.

Since:

EclairJS 0.1 Spark 1.4.0

Source:

sql/DataFrameStatFunctions.js, line 117

Returns:

A DataFrame containing for the contingency table.

Type: module:eclairjs/sql.DataFrame

Example

val df = sqlContext.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2),
     (3, 3))).toDF("key", "value")
   val ct = df.stat.crosstab("key", "value")
   ct.show()
   +---------+---+---+---+
   |key_value|  1|  2|  3|
   +---------+---+---+---+
   |        2|  2|  0|  1|
   |        1|  1|  1|  0|
   |        3|  0|  1|  1|
   +---------+---+---+---+

freqItems(cols, support) → {module:eclairjs/sql.DataFrame}

Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]]. The `support` should be greater than 1e-4. This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

Parameters:

Name	Type	Description
`cols`	Array.<string>	the names of the columns to search frequent items in.
`support`	number	Optional The minimum frequency for an item to be considered `frequent`. Should be greater than 1e-4. defaults to 1% (0.01)

Since:

EclairJS 0.1 Spark 1.4.0

Source:

sql/DataFrameStatFunctions.js, line 156

Returns:

A Local DataFrame with the Array of frequent items for each column.

Type: module:eclairjs/sql.DataFrame

Example

// find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
   // "a" and "b"
   var freqSingles = df.stat.freqItems(["a", "b"]), 0.4)
   freqSingles.show()
   +-----------+-------------+
   |a_freqItems|  b_freqItems|
   +-----------+-------------+
   |    [1, 99]|[-1.0, -99.0]|
   +-----------+-------------+

sampleBy(col, fractions, seed) → {module:eclairjs/sql.DataFrame}

Returns a stratified sample without replacement based on the fraction given on each stratum.

Parameters:

Name	Type	Description
`col`	string	column that defines strata
`fractions`	object	is expected to be a HashMap, the key of the map is the column name, and the value of the map is the replacement value. The value must be of the following type: `number`or `String`.
`seed`	integer	random seed

Since:

EclairJS 0.1 Spark 1.5.0

Source:

sql/DataFrameStatFunctions.js, line 190

Returns:

a new [[DataFrame]] that represents the stratified sample

Type: module:eclairjs/sql.DataFrame

Example

var df = sqlContext.createDataFrame([[1,1], [1,2], [2,1], [2,1], [2,3], [3,2], [3,3]], schema).toDF("key", "value");
   var fractions = {"1": 1.0, "3": 0.5);
   df.stat.sampleBy("key", fractions, 36L).show()
   +---+-----+
   |key|value|
   +---+-----+
   |  1|    1|
   |  1|    2|
   |  3|    2|
   +---+-----+