new DataFrameStatFunctions()
Statistic functions for DataFrames.
- Since:
- EclairJS 0.1 Spark 1.4.0
- Source:
Methods
corr(col1, col2, methodopt) → {number}
Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson
Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in
MLlib's Statistics.
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
col1 |
string | the name of the column | |
col2 |
string | the name of the column to calculate the correlation against | |
method |
string |
<optional> |
currently only supports the "pearson" |
- Since:
- EclairJS 0.1 Spark 1.4.0
- Source:
Returns:
The Pearson Correlation Coefficient.
- Type
- number
Example
var stat = peopleDataFrame.stat().cov("income", "networth", "pearson");
cov(col1, col2) → {number}
Calculate the sample covariance of two numerical columns of a DataFrame.
Parameters:
Name | Type | Description |
---|---|---|
col1 |
string | the name of the first column |
col2 |
string | the name of the second column |
- Since:
- EclairJS 0.1 Spark 1.4.0
- Source:
Returns:
the covariance of the two columns.
- Type
- number
Example
var stat = peopleDataFrame.stat().cov("income", "networth");
crosstab(col1, col2) → {module:eclairjs/sql.DataFrame}
Computes a pair-wise frequency table of the given columns. Also known as a contingency table.
The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero
pair frequencies will be returned.
The first column of each row will be the distinct values of `col1` and the column names will
be the distinct values of `col2`. The name of the first column will be `$col1_$col2`. Counts
will be returned as `Long`s. Pairs that have no occurrences will have zero as their counts.
Null elements will be replaced by "null", and back ticks will be dropped from elements if they
exist.
Parameters:
Name | Type | Description |
---|---|---|
col1 |
string | The name of the first column. Distinct items will make the first item of each row. |
col2 |
string | The name of the second column. Distinct items will make the column names of the DataFrame. |
- Since:
- EclairJS 0.1 Spark 1.4.0
- Source:
Returns:
A DataFrame containing for the contingency table.
- Type
- module:eclairjs/sql.DataFrame
Example
var df = sqlContext.createDataFrame([[1,1], [1,2], [2,1], [2,1], [2,3], [3,2], [3,3]], schema);
var ct = df.stat().crosstab("key", "value");
ct.show();
+---------+---+---+---+
|key_value| 1| 2| 3|
+---------+---+---+---+
| 2| 2| 0| 1|
| 1| 1| 1| 0|
| 3| 0| 1| 1|
+---------+---+---+---+
freqItems(cols, support) → {module:eclairjs/sql.DataFrame}
Finding frequent items for columns, possibly with false positives. Using the
frequent element count algorithm described in
[[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]].
The `support` should be greater than 1e-4.
This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting DataFrame.
Parameters:
Name | Type | Description |
---|---|---|
cols |
Array.<string> | the names of the columns to search frequent items in. |
support |
number | The minimum frequency for an item to be considered `frequent`. Should be greater than 1e-4. defaults to 1% (0.01) |
- Since:
- EclairJS 0.1 Spark 1.4.0
- Source:
Returns:
A Local DataFrame with the Array of frequent items for each column.
- Type
- module:eclairjs/sql.DataFrame
Example
// find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
// "a" and "b"
var freqSingles = df.stat.freqItems(["a", "b"]), 0.4)
freqSingles.show()
+-----------+-------------+
|a_freqItems| b_freqItems|
+-----------+-------------+
| [1, 99]|[-1.0, -99.0]|
+-----------+-------------+
sampleBy(col, fractions, seed) → {module:eclairjs/sql.DataFrame}
Returns a stratified sample without replacement based on the fraction given on each stratum.
Parameters:
Name | Type | Description |
---|---|---|
col |
string | column that defines strata |
fractions |
object | is expected to be a HashMap, the key of the map is the column name, and the value of the map is the replacement value. The value must be of the following type: `number`or `String`. |
seed |
integer | random seed |
- Since:
- EclairJS 0.1 Spark 1.5.0
- Source:
Returns:
a new [[DataFrame]] that represents the stratified sample
- Type
- module:eclairjs/sql.DataFrame
Example
var df = sqlContext.createDataFrame([[1,1], [1,2], [2,1], [2,1], [2,3], [3,2], [3,3]], schema).toDF("key", "value");
var fractions = {"1": 1.0, "3": 0.5);
df.stat.sampleBy("key", fractions, 36L).show()
+---+-----+
|key|value|
+---+-----+
| 1| 1|
| 1| 2|
| 3| 2|
+---+-----+