Package com.gengoai.stream.spark
Class SparkStream<T>
- java.lang.Object
-
- com.gengoai.stream.spark.SparkStream<T>
-
- Type Parameters:
T
- the component type of the stream
- All Implemented Interfaces:
MStream<T>
,Serializable
,AutoCloseable
,Iterable<T>
public class SparkStream<T> extends Object implements MStream<T>, Serializable
A MStream wrapper around a Spark RDD.- Author:
- David B. Bracewell
- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description SparkStream(MStream<T> mStream)
Instantiates a new Spark stream.SparkStream(org.apache.spark.api.java.JavaRDD<T> rdd)
Instantiates a new Spark stream.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description SparkStream<T>
cache()
Caches the stream.void
close()
List<T>
collect()
Collects the items in the stream as a list<R> R
collect(Collector<? super T,?,R> collector)
Performs a reduction on the string using hte given collector.long
count()
The number of items in the streamMap<T,Long>
countByValue()
Counts the number of times each item occurs in the streamSparkStream<T>
distinct()
Removes duplicates from the streamSparkStream<T>
filter(SerializablePredicate<? super T> predicate)
Filters the stream.Optional<T>
first()
Gets the first item in the stream<R> SparkStream<R>
flatMap(SerializableFunction<? super T,Stream<? extends R>> mapper)
Maps the objects in this stream to one or more new objects using the given function.<R,U>
SparkPairStream<R,U>flatMapToPair(SerializableFunction<? super T,Stream<? extends Map.Entry<? extends R,? extends U>>> function)
Maps the objects in this stream to one or more new key-value pairs using the given function.T
fold(T zeroValue, SerializableBinaryOperator<T> operator)
Performs a reduction on the elements of this stream using the given binary operator.void
forEach(SerializableConsumer<? super T> consumer)
Performs an operation on each item in the streamvoid
forEachLocal(SerializableConsumer<? super T> consumer)
Performs an operation on each item in the stream ensuring that is done locally and not distributed.SparkStreamingContext
getContext()
Gets the context used to create the streamorg.apache.spark.api.java.JavaRDD<T>
getRDD()
Gets the wrapped rdd.<U> SparkPairStream<U,Iterable<T>>
groupBy(SerializableFunction<? super T,? extends U> function)
Groups the items in the stream using the given function that maps objects to key valuesMStream<T>
intersection(MStream<T> other)
Returns a new MStream containing the intersection of elements in this stream and the argument stream.boolean
isDistributed()
Is distributed boolean.boolean
isEmpty()
Determines if the stream is empty or notIterator<T>
iterator()
Gets an iterator for the streamStream<T>
javaStream()
Converts this stream into a java streamSparkStream<T>
limit(long number)
Limits the stream to the firstnumber
items.<R> SparkStream<R>
map(SerializableFunction<? super T,? extends R> function)
Maps the objects in the stream using the given function<R> SparkStream<R>
mapPartitions(SerializableFunction<Iterator<? super T>,Stream<R>> function)
Maps the objects in the stream by block using the given functioncom.gengoai.stream.spark.SparkDoubleStream
mapToDouble(SerializableToDoubleFunction<? super T> function)
Maps objects in this stream to double values<R,U>
SparkPairStream<R,U>mapToPair(SerializableFunction<? super T,? extends Map.Entry<? extends R,? extends U>> function)
Maps the objects in this stream to a key-value pair using the given function.Optional<T>
max(SerializableComparator<? super T> comparator)
Returns the max item in the stream using the given comparator to compare items.Optional<T>
min(SerializableComparator<? super T> comparator)
Returns the min item in the stream using the given comparator to compare items.MStream<T>
onClose(SerializableRunnable closeHandler)
Sets the handler to call when the stream is closed.SparkStream<T>
parallel()
Ensures that the stream is parallel or distributed.MStream<Stream<T>>
partition(long partitionSize)
Partitions the stream into iterables each of size<=
partitionSize
.MStream<T>
persist(StorageLevel storageLevel)
Persists the stream to the given storage levelOptional<T>
reduce(SerializableBinaryOperator<T> reducer)
Performs a reduction on the elements of this stream using the given binary operator.SparkStream<T>
repartition(int numPartitions)
Repartitions the stream to the given number of partitions.SparkStream<T>
sample(boolean withReplacement, int number)
Randomly samplesnumber
items from the stream.void
saveAsTextFile(Resource location)
Save as the stream to a text file at the given location.void
saveAsTextFile(String location)
Save as the stream to a text file at the given location.SparkStream<T>
shuffle()
Shuffles the items in the stream.SparkStream<T>
shuffle(Random random)
Shuffles the items in the string using the givenRandom
object.SparkStream<T>
skip(long n)
Skips the firstn
items in the stream<R extends Comparable<R>>
MStream<T>sortBy(boolean ascending, SerializableFunction<? super T,? extends R> keyFunction)
Sorts the items in the stream in ascending or descending order using the given keyFunction to determine how to compare.List<T>
take(int n)
Takes the firstn
items from the stream.SparkStream<T>
toDistributedStream()
To distributed stream spark stream.SparkStream<T>
union(MStream<T> other)
Unions this stream with another.void
updateConfig()
Updates the config instance used for this stream<U> SparkPairStream<T,U>
zip(MStream<U> other)
Zips (combines) this stream together with the given other creating a pair stream.SparkPairStream<T,Long>
zipWithIndex()
Creates a pair stream where the keys are items in this stream and values are the index (starting at 0) of the item in the stream.-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface java.lang.Iterable
forEach, spliterator
-
-
-
-
Method Detail
-
toDistributedStream
public SparkStream<T> toDistributedStream()
Description copied from interface:MStream
To distributed stream spark stream.- Specified by:
toDistributedStream
in interfaceMStream<T>
- Returns:
- A distributed version of the stream
-
cache
public SparkStream<T> cache()
Description copied from interface:MStream
Caches the stream.
-
close
public void close() throws IOException
- Specified by:
close
in interfaceAutoCloseable
- Throws:
IOException
-
collect
public <R> R collect(Collector<? super T,?,R> collector)
Description copied from interface:MStream
Performs a reduction on the string using hte given collector.
-
collect
public List<T> collect()
Description copied from interface:MStream
Collects the items in the stream as a list
-
count
public long count()
Description copied from interface:MStream
The number of items in the stream
-
countByValue
public Map<T,Long> countByValue()
Description copied from interface:MStream
Counts the number of times each item occurs in the stream- Specified by:
countByValue
in interfaceMStream<T>
- Returns:
- a map of object - long counts
-
distinct
public SparkStream<T> distinct()
Description copied from interface:MStream
Removes duplicates from the stream
-
filter
public SparkStream<T> filter(SerializablePredicate<? super T> predicate)
Description copied from interface:MStream
Filters the stream.
-
first
public Optional<T> first()
Description copied from interface:MStream
Gets the first item in the stream
-
flatMap
public <R> SparkStream<R> flatMap(SerializableFunction<? super T,Stream<? extends R>> mapper)
Description copied from interface:MStream
Maps the objects in this stream to one or more new objects using the given function.
-
flatMapToPair
public <R,U> SparkPairStream<R,U> flatMapToPair(SerializableFunction<? super T,Stream<? extends Map.Entry<? extends R,? extends U>>> function)
Description copied from interface:MStream
Maps the objects in this stream to one or more new key-value pairs using the given function.- Specified by:
flatMapToPair
in interfaceMStream<T>
- Type Parameters:
R
- the key type parameterU
- the value type parameter- Parameters:
function
- the function to use to map objects- Returns:
- the new pair stream
-
fold
public T fold(T zeroValue, SerializableBinaryOperator<T> operator)
Description copied from interface:MStream
Performs a reduction on the elements of this stream using the given binary operator.
-
forEach
public void forEach(SerializableConsumer<? super T> consumer)
Description copied from interface:MStream
Performs an operation on each item in the stream
-
forEachLocal
public void forEachLocal(SerializableConsumer<? super T> consumer)
Description copied from interface:MStream
Performs an operation on each item in the stream ensuring that is done locally and not distributed.- Specified by:
forEachLocal
in interfaceMStream<T>
- Parameters:
consumer
- the consumer action to perform
-
getContext
public SparkStreamingContext getContext()
Description copied from interface:MStream
Gets the context used to create the stream- Specified by:
getContext
in interfaceMStream<T>
- Returns:
- the context
-
getRDD
public org.apache.spark.api.java.JavaRDD<T> getRDD()
Gets the wrapped rdd.- Returns:
- the rdd
-
groupBy
public <U> SparkPairStream<U,Iterable<T>> groupBy(SerializableFunction<? super T,? extends U> function)
Description copied from interface:MStream
Groups the items in the stream using the given function that maps objects to key values
-
isEmpty
public boolean isEmpty()
Description copied from interface:MStream
Determines if the stream is empty or not
-
iterator
public Iterator<T> iterator()
Description copied from interface:MStream
Gets an iterator for the stream
-
javaStream
public Stream<T> javaStream()
Description copied from interface:MStream
Converts this stream into a java stream- Specified by:
javaStream
in interfaceMStream<T>
- Returns:
- the java stream
-
limit
public SparkStream<T> limit(long number)
Description copied from interface:MStream
Limits the stream to the firstnumber
items.
-
map
public <R> SparkStream<R> map(SerializableFunction<? super T,? extends R> function)
Description copied from interface:MStream
Maps the objects in the stream using the given function
-
mapPartitions
public <R> SparkStream<R> mapPartitions(SerializableFunction<Iterator<? super T>,Stream<R>> function)
Maps the objects in the stream by block using the given function- Type Parameters:
R
- the component type of the returning stream- Parameters:
function
- the function to use to map objects- Returns:
- the new stream
-
mapToDouble
public com.gengoai.stream.spark.SparkDoubleStream mapToDouble(SerializableToDoubleFunction<? super T> function)
Description copied from interface:MStream
Maps objects in this stream to double values- Specified by:
mapToDouble
in interfaceMStream<T>
- Parameters:
function
- the function to convert objects to doubles- Returns:
- the new double stream
-
mapToPair
public <R,U> SparkPairStream<R,U> mapToPair(SerializableFunction<? super T,? extends Map.Entry<? extends R,? extends U>> function)
Description copied from interface:MStream
Maps the objects in this stream to a key-value pair using the given function.
-
max
public Optional<T> max(SerializableComparator<? super T> comparator)
Description copied from interface:MStream
Returns the max item in the stream using the given comparator to compare items.
-
min
public Optional<T> min(SerializableComparator<? super T> comparator)
Description copied from interface:MStream
Returns the min item in the stream using the given comparator to compare items.
-
onClose
public MStream<T> onClose(SerializableRunnable closeHandler)
Description copied from interface:MStream
Sets the handler to call when the stream is closed. Typically, this is to clean up any open resources, such as file handles.
-
persist
public MStream<T> persist(StorageLevel storageLevel)
Description copied from interface:MStream
Persists the stream to the given storage level
-
parallel
public SparkStream<T> parallel()
Description copied from interface:MStream
Ensures that the stream is parallel or distributed.
-
partition
public MStream<Stream<T>> partition(long partitionSize)
Description copied from interface:MStream
Partitions the stream into iterables each of size<=
partitionSize
.
-
reduce
public Optional<T> reduce(SerializableBinaryOperator<T> reducer)
Description copied from interface:MStream
Performs a reduction on the elements of this stream using the given binary operator.
-
repartition
public SparkStream<T> repartition(int numPartitions)
Description copied from interface:MStream
Repartitions the stream to the given number of partitions. This may be a no-op for some streams, i.e. Local Streams.- Specified by:
repartition
in interfaceMStream<T>
- Parameters:
numPartitions
- the number of partitions the stream should have- Returns:
- the new stream
-
sample
public SparkStream<T> sample(boolean withReplacement, int number)
Description copied from interface:MStream
Randomly samplesnumber
items from the stream.
-
saveAsTextFile
public void saveAsTextFile(Resource location)
Description copied from interface:MStream
Save as the stream to a text file at the given location. Writing may result in multiple files being created.- Specified by:
saveAsTextFile
in interfaceMStream<T>
- Parameters:
location
- the location to write the stream to
-
saveAsTextFile
public void saveAsTextFile(String location)
Description copied from interface:MStream
Save as the stream to a text file at the given location. Writing may result in multiple files being created.- Specified by:
saveAsTextFile
in interfaceMStream<T>
- Parameters:
location
- the location to write the stream to
-
shuffle
public SparkStream<T> shuffle()
Description copied from interface:MStream
Shuffles the items in the stream.
-
shuffle
public SparkStream<T> shuffle(Random random)
Description copied from interface:MStream
Shuffles the items in the string using the givenRandom
object.
-
skip
public SparkStream<T> skip(long n)
Description copied from interface:MStream
Skips the firstn
items in the stream
-
sortBy
public <R extends Comparable<R>> MStream<T> sortBy(boolean ascending, SerializableFunction<? super T,? extends R> keyFunction)
Description copied from interface:MStream
Sorts the items in the stream in ascending or descending order using the given keyFunction to determine how to compare.- Specified by:
sortBy
in interfaceMStream<T>
- Type Parameters:
R
- the type parameter- Parameters:
ascending
- determines if the items should be sorted in ascending (true) or descending (false) orderkeyFunction
- function to use to convert the items in the stream to something that is comparable.- Returns:
- the new stream
-
take
public List<T> take(int n)
Description copied from interface:MStream
Takes the firstn
items from the stream.
-
isDistributed
public boolean isDistributed()
Description copied from interface:MStream
Is distributed boolean.- Specified by:
isDistributed
in interfaceMStream<T>
- Returns:
- True if the stream is distributed
-
intersection
public MStream<T> intersection(MStream<T> other)
Description copied from interface:MStream
Returns a new MStream containing the intersection of elements in this stream and the argument stream.- Specified by:
intersection
in interfaceMStream<T>
- Parameters:
other
- Stream to perform intersection with- Returns:
- the new stream
-
union
public SparkStream<T> union(MStream<T> other)
Description copied from interface:MStream
Unions this stream with another.
-
zip
public <U> SparkPairStream<T,U> zip(MStream<U> other)
Description copied from interface:MStream
Zips (combines) this stream together with the given other creating a pair stream. For example, if this stream contains [1,2,3] and stream 2 contains [4,5,6] the result would be a pair stream containing the key value pairs [(1,4), (2,5), (3,6)]. Note that the length of the resulting stream will be the minimum of the two streams.
-
zipWithIndex
public SparkPairStream<T,Long> zipWithIndex()
Description copied from interface:MStream
Creates a pair stream where the keys are items in this stream and values are the index (starting at 0) of the item in the stream.- Specified by:
zipWithIndex
in interfaceMStream<T>
- Returns:
- the new pair stream
-
updateConfig
public void updateConfig()
Description copied from interface:MStream
Updates the config instance used for this stream- Specified by:
updateConfig
in interfaceMStream<T>
-
-