Class: DataFrameStatFunctions

eclairjs/sql.DataFrameStatFunctions

new DataFrameStatFunctions()

Statistic functions for DataFrames.
Since:
  • EclairJS 0.1 Spark 1.4.0
Source:

Methods

corr(col1, col2, methodopt) → {number}

Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.
Parameters:
Name Type Attributes Description
col1 string the name of the column
col2 string the name of the column to calculate the correlation against
method string <optional>
currently only supports the "pearson"
Since:
  • EclairJS 0.1 Spark 1.4.0
Source:
Returns:
The Pearson Correlation Coefficient.
Type
number
Example
var stat = peopleDataFrame.stat().cov("income", "networth", "pearson");

cov(col1, col2) → {Promise.<number>}

Calculate the sample covariance of two numerical columns of a DataFrame.
Parameters:
Name Type Description
col1 string the name of the first column
col2 string the name of the second column
Since:
  • EclairJS 0.1 Spark 1.4.0
Source:
Returns:
the covariance of the two columns.
Type
Promise.<number>
Example
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
     .withColumn("rand2", rand(seed=27))
   df.stat.cov("rand1", "rand2")
   res1: Double = 0.065...

crosstab(col1, col2) → {module:eclairjs/sql.DataFrame}

Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. The first column of each row will be the distinct values of `col1` and the column names will be the distinct values of `col2`. The name of the first column will be `$col1_$col2`. Counts will be returned as `Long`s. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist.
Parameters:
Name Type Description
col1 string The name of the first column. Distinct items will make the first item of each row.
col2 string The name of the second column. Distinct items will make the column names of the DataFrame.
Since:
  • EclairJS 0.1 Spark 1.4.0
Source:
Returns:
A DataFrame containing for the contingency table.
Type
module:eclairjs/sql.DataFrame
Example
val df = sqlContext.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2),
     (3, 3))).toDF("key", "value")
   val ct = df.stat.crosstab("key", "value")
   ct.show()
   +---------+---+---+---+
   |key_value|  1|  2|  3|
   +---------+---+---+---+
   |        2|  2|  0|  1|
   |        1|  1|  1|  0|
   |        3|  0|  1|  1|
   +---------+---+---+---+

freqItems(cols, support) → {module:eclairjs/sql.DataFrame}

Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]]. The `support` should be greater than 1e-4. This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.
Parameters:
Name Type Description
cols Array.<string> the names of the columns to search frequent items in.
support number Optional The minimum frequency for an item to be considered `frequent`. Should be greater than 1e-4. defaults to 1% (0.01)
Since:
  • EclairJS 0.1 Spark 1.4.0
Source:
Returns:
A Local DataFrame with the Array of frequent items for each column.
Type
module:eclairjs/sql.DataFrame
Example
// find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
   // "a" and "b"
   var freqSingles = df.stat.freqItems(["a", "b"]), 0.4)
   freqSingles.show()
   +-----------+-------------+
   |a_freqItems|  b_freqItems|
   +-----------+-------------+
   |    [1, 99]|[-1.0, -99.0]|
   +-----------+-------------+

sampleBy(col, fractions, seed) → {module:eclairjs/sql.DataFrame}

Returns a stratified sample without replacement based on the fraction given on each stratum.
Parameters:
Name Type Description
col string column that defines strata
fractions object is expected to be a HashMap, the key of the map is the column name, and the value of the map is the replacement value. The value must be of the following type: `number`or `String`.
seed integer random seed
Since:
  • EclairJS 0.1 Spark 1.5.0
Source:
Returns:
a new [[DataFrame]] that represents the stratified sample
Type
module:eclairjs/sql.DataFrame
Example
var df = sqlContext.createDataFrame([[1,1], [1,2], [2,1], [2,1], [2,3], [3,2], [3,3]], schema).toDF("key", "value");
   var fractions = {"1": 1.0, "3": 0.5);
   df.stat.sampleBy("key", fractions, 36L).show()
   +---+-----+
   |key|value|
   +---+-----+
   |  1|    1|
   |  1|    2|
   |  3|    2|
   +---+-----+