Class: RDD

eclairjs. RDD

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Constructor

new RDD()

Source:

Methods

aggregate(zeroValue, func1, func2, bindArgs1opt, bindArgs2opt) → {object}

Aggregate the elements of each partition, and then the results for all the partitions, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are allowed to modify and return their first argument instead of creating a new U to avoid memory allocation.
Parameters:
Name Type Attributes Description
zeroValue module:eclairjs.RDD (undocumented)
func1 function seqOp - (undocumented) Function with two parameters
func2 function combOp - (undocumented) Function with two parameters
bindArgs1 Array.<Object> <optional>
array whose values will be added to func1's argument list.
bindArgs2 Array.<Object> <optional>
array whose values will be added to func2's argument list.
Source:
Returns:
Type
object

cache() → {module:eclairjs.RDD}

Persist this RDD with the default storage level (`MEMORY_ONLY`).
Source:
Returns:
Type
module:eclairjs.RDD

cartesian(other) → {module:eclairjs.RDD}

Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in `this` and b is in `other`.
Parameters:
Name Type Description
other module:eclairjs.RDD (undocumented)
Source:
Returns:
Type
module:eclairjs.RDD

checkpoint()

Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with `SparkContext#setCheckpointDir` and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
Source:
Returns:
void

coalesce(numPartitions, shuffle) → {module:eclairjs.RDD}

Return a new RDD that is reduced into `numPartitions` partitions. This results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is). Note: With shuffle = true, you can actually coalesce to a larger number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large. Calling coalesce(1000, shuffle = true) will result in 1000 partitions with the data distributed using a hash partitioner.
Parameters:
Name Type Description
numPartitions int
shuffle boolean
Source:
Returns:
Type
module:eclairjs.RDD

collect() → {Array}

Return an array that contains all of the elements in this RDD.
Source:
Returns:
Type
Array

context() → {module:eclairjs.SparkContext}

Return the SparkContext that this RDD was created on.
Source:
Returns:
Type
module:eclairjs.SparkContext

count() → {integer}

Return the number of elements in the RDD.
Source:
Returns:
Type
integer

countApprox(timeout, confidenceopt) → {module:eclairjs/partial.PartialResult}

:: Experimental :: Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.
Parameters:
Name Type Attributes Description
timeout number (undocumented)
confidence number <optional>
(undocumented)
Source:
Returns:
Type
module:eclairjs/partial.PartialResult

countApproxDistinct(relativeSD) → {number}

Return approximate number of distinct elements in the RDD. The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.
Parameters:
Name Type Description
relativeSD number Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017.
Source:
Returns:
Type
number

countByValueApprox(timeout, confidenceopt) → {module:eclairjs/partial.PartialResult}

:: Experimental :: Approximate version of countByValue().
Parameters:
Name Type Attributes Description
timeout number (undocumented)
confidence number <optional>
(undocumented)
Source:
Returns:
Type
module:eclairjs/partial.PartialResult

distinct(numPartitionsopt) → {module:eclairjs.RDD}

Return a new RDD containing the distinct elements in this RDD.
Parameters:
Name Type Attributes Description
numPartitions int <optional>
Source:
Returns:
Type
module:eclairjs.RDD

filter(func, bindArgsopt) → {module:eclairjs.RDD}

Return a new RDD containing only the elements that satisfy a predicate.
Parameters:
Name Type Attributes Description
func function (undocumented) Function with one parameter
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
module:eclairjs.RDD

first() → {object}

Return the first element in this RDD.
Source:
Returns:
Type
object

flatMap(func, bindArgsopt) → {module:eclairjs.RDD}

Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
Parameters:
Name Type Attributes Description
func function (undocumented) - Function with one parameter
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
module:eclairjs.RDD

flatMapToPair(bindArgsopt) → {module:eclairjs.PairRDD}

Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
Parameters:
Name Type Attributes Description
function
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
module:eclairjs.PairRDD

foreach(func, bindArgsopt) → {void}

Applies a function to all elements of this RDD.
Parameters:
Name Type Attributes Description
func function Function with one parameter that returns void
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
void
Example
rdd3.foreach(function(record) {
   var connection = createNewConnection()
   connection.send(record);
   connection.close()
});

foreach(func, bindArgsopt) → {void}

Applies a function to each partition of this RDD.
Parameters:
Name Type Attributes Description
func function Function with one Array parameter that returns void
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
void
Example
rdd3.foreachPartition(function(partitionOfRecords) {
   var connection = createNewConnection()
   partitionOfRecords.forEach(function(record){
      connection.send(record);
   });
   connection.close()
});

getCheckpointFile() → {string}

Gets the name of the directory to which this RDD was checkpointed. This is not defined if the RDD is checkpointed locally.
Source:
Returns:
Type
string

getStorageLevel() → {module:eclairjs/storage.StorageLevel}

Source:
Returns:
Type
module:eclairjs/storage.StorageLevel

glom() → {module:eclairjs.RDD}

Return an RDD created by coalescing all elements within each partition into an array.
Source:
Returns:
Type
module:eclairjs.RDD

groupBy(func, numPartitionsopt, bindArgsopt) → {module:eclairjs.RDD}

Return an RDD of grouped items. Each group consists of a key and a sequence of elements mapping to that key. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated. Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using aggregateByKey or reduceByKey will provide much better performance.
Parameters:
Name Type Attributes Description
func function (undocumented) Function with one parameter
numPartitions number <optional>
How many partitions to use in the resulting RDD (if non-zero partitioner is ignored)
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
module:eclairjs.RDD

groupBy() → {int}

A unique ID for this RDD (within its SparkContext).
Source:
Returns:
Type
int

intersection(other) → {module:eclairjs.RDD}

Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did. Note that this method performs a shuffle internally.
Parameters:
Name Type Description
other module:eclairjs.RDD the other RDD
Source:
Returns:
Type
module:eclairjs.RDD

isCheckpointed() → {boolean}

Return whether this RDD is checkpointed and materialized, either reliably or locally.
Source:
Returns:
Type
boolean

isEmpty() → {boolean}

Source:
Returns:
true if and only if the RDD contains no elements at all. Note that an RDD
Type
boolean

keyBy(func, bindArgsopt) → {module:eclairjs.RDD}

Creates tuples of the elements in this RDD by applying `f`.
Parameters:
Name Type Attributes Description
func function (undocumented)
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
module:eclairjs.RDD

map(func, bindArgsopt) → {module:eclairjs.RDD}

Return a new RDD by applying a function to all elements of this RDD.
Parameters:
Name Type Attributes Description
func function (undocumented) Function with one parameter
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
module:eclairjs.RDD

mapPartitions(func, preservesPartitioningopt, bindArgsopt) → {module:eclairjs.RDD}

Return a new RDD by applying a function to each partition of this RDD. Similar to map, but runs separately on each partition (block) of the RDD, so func must accept an Array. func should return a array rather than a single item.
Parameters:
Name Type Attributes Description
func function (undocumented) Function with one parameter
preservesPartitioning boolean <optional>
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
module:eclairjs.RDD

mapPartitionsWithIndex(func, preservesPartitioningopt, bindArgsopt) → {module:eclairjs.RDD}

Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. `preservesPartitioning` indicates whether the input function preserves the partitioner, which should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
Parameters:
Name Type Attributes Description
func function (undocumented) Function with one parameter
preservesPartitioning boolean <optional>
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
module:eclairjs.RDD

mapToFloat((function), bindArgsopt) → {module:eclairjs.FloatRDD}

Return a new RDD by applying a function to all elements of this RDD.
Parameters:
Name Type Attributes Description
(function) func - (undocumented) Function with one parameter that returns tuple
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
module:eclairjs.FloatRDD

mapToPair((function), bindArgsopt) → {module:eclairjs.PairRDD}

Return a new RDD by applying a function to all elements of this RDD.
Parameters:
Name Type Attributes Description
(function) func - (undocumented) Function with one parameter that returns tuple
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
module:eclairjs.PairRDD

max((function), bindArgsopt) → {object}

Returns the max of this RDD as defined by the implicit Ordering[T].
Parameters:
Name Type Attributes Description
(function) comparator - Compares its two arguments for order. Returns a negative integer, zero, or a positive integer as the first argument is less than, equal to, or greater than the second.
bindArgs Array.<Object> <optional>
array whose values will be added to comparator's argument list.
Source:
Returns:
the maximum element of the RDD
Type
object

min((function), bindArgsopt) → {object}

Returns the min of this RDD as defined by the implicit Ordering[T].
Parameters:
Name Type Attributes Description
(function) comparator - Compares its two arguments for order. Returns a negative integer, zero, or a positive integer as the second argument is less than, equal to, or greater than the first.
bindArgs Array.<Object> <optional>
array whose values will be added to compartor's argument list.
Source:
Returns:
the minimum element of the RDD
Type
object

name() → {string}

A friendly name for this RDD
Source:
Returns:
Type
string

persist(newLevel) → {module:eclairjs.RDD}

Parameters:
Name Type Description
newLevel module:eclairjs/storage.StorageLevel
Source:
Returns:
Type
module:eclairjs.RDD

pipe(command, env) → {module:eclairjs.RDD}

Return an RDD created by piping elements to a forked external process. The print behavior can be customized by providing two functions.
Parameters:
Name Type Description
command List | string command to run in forked process.
env Map environment variables to set.
Source:
Returns:
the result RDD
Type
module:eclairjs.RDD

reduce(func, bindArgsopt) → {object}

Reduces the elements of this RDD using the specified commutative and associative binary operator.
Parameters:
Name Type Attributes Description
func function (undocumented) Function with two parameters
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
object

repartition(numPartitions) → {module:eclairjs.RDD}

Return a new RDD that has exactly numPartitions partitions. Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using `coalesce`, which can avoid performing a shuffle.
Parameters:
Name Type Description
numPartitions int (undocumented)
Source:
Returns:
Type
module:eclairjs.RDD

sample(withReplacement, fraction, seedopt) → {module:eclairjs.RDD}

Return a sampled subset of this RDD.
Parameters:
Name Type Attributes Description
withReplacement boolean can elements be sampled multiple times (replaced when sampled out)
fraction number expected size of the sample as a fraction of this RDD's size without replacement: probability that each element is chosen; fraction must be [0, 1] with replacement: expected number of times each element is chosen; fraction must be >= 0
seed number <optional>
seed for the random number generator
Source:
Returns:
Type
module:eclairjs.RDD

saveAsObjectFile(path, overwriteopt) → {void}

Save this RDD as a SequenceFile of serialized objects.
Parameters:
Name Type Attributes Description
path string
overwrite boolean <optional>
defaults to false, if true overwrites file if it exists
Source:
Returns:
Type
void

saveAsTextFile(path, overwriteopt) → {void}

Save this RDD as a text file, using string representations of elements.
Parameters:
Name Type Attributes Description
path string
overwrite boolean <optional>
defaults to false, if true overwrites file if it exists
Source:
Returns:
Type
void

setName() → {module:eclairjs.RDD}

Assign a name to this RDD.
Source:
Returns:
Type
module:eclairjs.RDD

sortBy(func, ascending, numPartitions, bindArgsopt) → {module:eclairjs.RDD}

Return this RDD sorted by the given key function.
Parameters:
Name Type Attributes Description
func function (undocumented) Function with one parameter
ascending boolean
numPartitions int
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
module:eclairjs.RDD

sparkContext() → {module:eclairjs.SparkContext}

The SparkContext that created this RDD.
Source:
Returns:
Type
module:eclairjs.SparkContext

subtract(other, numPartitionsopt) → {module:eclairjs.RDD}

Return an RDD with the elements from `this` that are not in `other`.
Parameters:
Name Type Attributes Description
other module:eclairjs.RDD
numPartitions int <optional>
Source:
Returns:
Type
module:eclairjs.RDD

take(num) → {Array}

Take the first num elements of the RDD.
Parameters:
Name Type Description
num int
Source:
Returns:
Type
Array

takeOrdered(num, funcopt, bindArgsopt) → {Array}

Returns the first k (smallest) elements from this RDD as defined by the specified implicit Ordering[T] and maintains the ordering. This does the opposite of top.
Parameters:
Name Type Attributes Description
num number the number of elements to return
func function <optional>
compares to arguments
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
an array of top elements
Type
Array
Example
var result = rdd.takeOrdered(25, function(a, b){
      return a > b ? -1 : a == b? 0 : 1;
    });

takeSample(withReplacement, num, seedopt) → {Array}

Return a fixed-size sampled subset of this RDD in an array
Parameters:
Name Type Attributes Description
withReplacement boolean whether sampling is done with replacement
num number size of the returned sample
seed number <optional>
seed for the random number generator
Source:
Returns:
sample of specified size in an array
Type
Array

toArray() → {Array}

Return an array that contains all of the elements in this RDD.
Source:
Returns:
Type
Array

toDebugString() → {string}

A description of this RDD and its recursive dependencies for debugging.
Source:
Returns:
Type
string

top(num) → {Array}

Returns the top k (largest) elements from this RDD as defined by the specified implicit Ordering[T]. This does the opposite of takeOrdered. For example: {{{ sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1) // returns Array(12) sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2) // returns Array(6, 5) }}}
Parameters:
Name Type Description
num number k, the number of top elements to return
Source:
Returns:
an array of top elements
Type
Array

toString() → {string}

Source:
Returns:
Type
string

treeAggregate(zeroValue, func1, func2, bindArgs1opt, bindArgs2opt) → {object}

Aggregates the elements of this RDD in a multi-level tree pattern.
Parameters:
Name Type Attributes Description
zeroValue (undocumented)
func1 function (undocumented) Function with two parameters
func2 function combOp - (undocumented) Function with two parameters
bindArgs1 Array.<Object> <optional>
array whose values will be added to func1's argument list.
bindArgs2 Array.<Object> <optional>
array whose values will be added to func2's argument list.
Source:
See:
  • [[org.apache.spark.rdd.RDD#aggregate]]
Returns:
Type
object

treeReduce(func, depth, bindArgsopt) → {object}

Reduces the elements of this RDD in a multi-level tree pattern.
Parameters:
Name Type Attributes Description
func function (undocumented) Function with one parameter
depth number suggested depth of the tree (default: 2)
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
See:
  • [[org.apache.spark.rdd.RDD#reduce]]
Returns:
Type
object

union(other) → {module:eclairjs.RDD}

Return the union of this RDD and another one. Any identical elements will appear multiple times (use `.distinct()` to eliminate them).
Parameters:
Name Type Description
other module:eclairjs.RDD (undocumented)
Source:
Returns:
Type
module:eclairjs.RDD

unpersist(blocking) → {module:eclairjs.RDD}

Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
Parameters:
Name Type Description
blocking boolean Whether to block until all blocks are deleted.
Source:
Returns:
This RDD.
Type
module:eclairjs.RDD

zip(other) → {module:eclairjs.RDD}

Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the *same number of partitions* and the *same number of elements in each partition* (e.g. one was made through a map on the other).
Parameters:
Name Type Description
other module:eclairjs.RDD (undocumented)
Source:
Returns:
Type
module:eclairjs.RDD

zipPartitions(rdd2, func, bindArgsopt) → {module:eclairjs.RDD}

Zip this RDD's partitions with another RDD and return a new RDD by applying a function to the zipped partitions. Assumes that both the RDDs have the same number of partitions, but does not require them to have the same number of elements in each partition.
Parameters:
Name Type Attributes Description
rdd2 module:eclairjs.RDD
func function Function with two parameters
bindArgs Array.<Object> <optional>
array whose values will be added to func's argument list.
Source:
Returns:
Type
module:eclairjs.RDD

zipWithIndex() → {module:eclairjs.RDD}

Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type. This method needs to trigger a spark job when this RDD contains more than one partitions. Note that some RDDs, such as those returned by groupBy(), do not guarantee order of elements in a partition. The index assigned to each element is therefore not guaranteed, and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
Source:
Returns:
Type
module:eclairjs.RDD

zipWithUniqueId() → {module:eclairjs.RDD}

Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k, 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]]. Note that some RDDs, such as those returned by groupBy(), do not guarantee order of elements in a partition. The unique ID assigned to each element is therefore not guaranteed, and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
Source:
Returns:
Type
module:eclairjs.RDD