Package com.gengoai.stream.spark
Class SparkPairStream<T,U>
- java.lang.Object
-
- com.gengoai.stream.spark.SparkPairStream<T,U>
-
- Type Parameters:
T
- the key type parameterU
- the value type parameter
- All Implemented Interfaces:
MPairStream<T,U>
,Serializable
,AutoCloseable
public class SparkPairStream<T,U> extends Object implements MPairStream<T,U>, Serializable
A MPairStream implementation backed by a JavaPairRDD.- Author:
- David B. Bracewell
- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description SparkPairStream(Map<? extends T,? extends U> map)
Instantiates a new Spark pair stream.SparkPairStream(org.apache.spark.api.java.JavaPairRDD<T,U> rdd)
Instantiates a new Spark pair stream.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description MPairStream<T,U>
cache()
Caches the stream.void
close()
List<Map.Entry<T,U>>
collectAsList()
Collects the items in the stream as a listMap<T,U>
collectAsMap()
Collects the items in the stream as a maplong
count()
The number of items in the streamMPairStream<T,U>
filter(SerializableBiPredicate<? super T,? super U> predicate)
Filters the stream.MPairStream<T,U>
filterByKey(SerializablePredicate<T> predicate)
Filters the stream by key.MPairStream<T,U>
filterByValue(SerializablePredicate<U> predicate)
Filters the stream by value.<R,V>
SparkPairStream<R,V>flatMapToPair(SerializableBiFunction<? super T,? super U,Stream<Map.Entry<? extends R,? extends V>>> function)
Maps the key-value pairs to one or more new key-value pairsvoid
forEach(SerializableBiConsumer<? super T,? super U> consumer)
Performs an operation on each item in the streamvoid
forEachLocal(SerializableBiConsumer<? super T,? super U> consumer)
Performs an operation on each item in the stream ensuring that is done locally and not distributed.StreamingContext
getContext()
Gets the context used to create the streamMPairStream<T,Iterable<U>>
groupByKey()
Group bys the items in the stream by the keyboolean
isDistributed()
boolean
isEmpty()
Determines if the stream is empty or notboolean
isReusable()
Can this stream be consumed more the once?Stream<Map.Entry<T,U>>
javaStream()
<V> MPairStream<T,Map.Entry<U,V>>
join(MPairStream<? extends T,? extends V> stream)
Performs an inner join between this stream and another using the key to match.MStream<T>
keys()
Returns a stream over the keys in this pair stream<V> MPairStream<T,Map.Entry<U,V>>
leftOuterJoin(MPairStream<? extends T,? extends V> stream)
Performs a left outer join between this stream and another using the key to match.<R> MStream<R>
map(SerializableBiFunction<? super T,? super U,? extends R> function)
Maps the key-value pairs to a new objectMDoubleStream
mapToDouble(SerializableToDoubleBiFunction<? super T,? super U> function)
Maps the key-value pairs to doubles<R,V>
MPairStream<R,V>mapToPair(SerializableBiFunction<? super T,? super U,? extends Map.Entry<? extends R,? extends V>> function)
Maps the key-value pairs to new key-value pairsOptional<Map.Entry<T,U>>
max(SerializableComparator<Map.Entry<T,U>> comparator)
Returns the max item in the stream using the given comparator to compare items.Optional<Map.Entry<T,U>>
min(SerializableComparator<Map.Entry<T,U>> comparator)
Returns the min item in the stream requiring that the items be comparable.MPairStream<T,U>
onClose(SerializableRunnable closeHandler)
Sets the handler to call when the stream is closed.MPairStream<T,U>
parallel()
Ensures that the stream is parallel or distributed.MPairStream<T,U>
persist(StorageLevel storageLevel)
MPairStream<T,U>
reduceByKey(SerializableBinaryOperator<U> operator)
Performs a reduction by key on the elements of this stream using the given binary operator.MPairStream<T,U>
repartition(int partitions)
Repartitions the stream to the given number of partitions.<V> MPairStream<T,Map.Entry<U,V>>
rightOuterJoin(MPairStream<? extends T,? extends V> stream)
Performs a right outer join between this stream and another using the key to match.MPairStream<T,U>
sample(boolean withReplacement, long number)
Randomly samplesnumber
items from the stream.MPairStream<T,U>
shuffle(Random random)
Shuffles the items in the string using the givenRandom
object.MPairStream<T,U>
sortByKey(SerializableComparator<T> comparator)
Sorts the items in the stream by key using the given comparator.MPairStream<T,U>
union(MPairStream<? extends T,? extends U> other)
Unions this stream with another.void
updateConfig()
Updates the config instance used for this StringMStream<U>
values()
Returns a stream of values-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface com.gengoai.stream.MPairStream
maxByKey, maxByKey, maxByValue, maxByValue, minByKey, minByKey, minByValue, minByValue, shuffle, sortByKey
-
-
-
-
Method Detail
-
cache
public MPairStream<T,U> cache()
Description copied from interface:MPairStream
Caches the stream.- Specified by:
cache
in interfaceMPairStream<T,U>
- Returns:
- the cached stream
-
close
public void close() throws Exception
- Specified by:
close
in interfaceAutoCloseable
- Throws:
Exception
-
collectAsList
public List<Map.Entry<T,U>> collectAsList()
Description copied from interface:MPairStream
Collects the items in the stream as a list- Specified by:
collectAsList
in interfaceMPairStream<T,U>
- Returns:
- the list of items in the stream
-
collectAsMap
public Map<T,U> collectAsMap()
Description copied from interface:MPairStream
Collects the items in the stream as a map- Specified by:
collectAsMap
in interfaceMPairStream<T,U>
- Returns:
- the map of items in the stream
-
count
public long count()
Description copied from interface:MPairStream
The number of items in the stream- Specified by:
count
in interfaceMPairStream<T,U>
- Returns:
- the number of items in the stream
-
filter
public MPairStream<T,U> filter(SerializableBiPredicate<? super T,? super U> predicate)
Description copied from interface:MPairStream
Filters the stream.- Specified by:
filter
in interfaceMPairStream<T,U>
- Parameters:
predicate
- the predicate to use to determine which objects are kept- Returns:
- the new stream
-
filterByKey
public MPairStream<T,U> filterByKey(SerializablePredicate<T> predicate)
Description copied from interface:MPairStream
Filters the stream by key.- Specified by:
filterByKey
in interfaceMPairStream<T,U>
- Parameters:
predicate
- the predicate to apply to keys in order to determine which objects are kept- Returns:
- the new stream
-
filterByValue
public MPairStream<T,U> filterByValue(SerializablePredicate<U> predicate)
Description copied from interface:MPairStream
Filters the stream by value.- Specified by:
filterByValue
in interfaceMPairStream<T,U>
- Parameters:
predicate
- the predicate to apply to values in order to determine which objects are kept- Returns:
- the new stream
-
flatMapToPair
public <R,V> SparkPairStream<R,V> flatMapToPair(SerializableBiFunction<? super T,? super U,Stream<Map.Entry<? extends R,? extends V>>> function)
Description copied from interface:MPairStream
Maps the key-value pairs to one or more new key-value pairs- Specified by:
flatMapToPair
in interfaceMPairStream<T,U>
- Type Parameters:
R
- the new key type parameterV
- the new value type parameter- Parameters:
function
- the function to map key-value pairs- Returns:
- the new pair stream
-
forEach
public void forEach(SerializableBiConsumer<? super T,? super U> consumer)
Description copied from interface:MPairStream
Performs an operation on each item in the stream- Specified by:
forEach
in interfaceMPairStream<T,U>
- Parameters:
consumer
- the consumer action to perform
-
forEachLocal
public void forEachLocal(SerializableBiConsumer<? super T,? super U> consumer)
Description copied from interface:MPairStream
Performs an operation on each item in the stream ensuring that is done locally and not distributed.- Specified by:
forEachLocal
in interfaceMPairStream<T,U>
- Parameters:
consumer
- the consumer action to perform
-
getContext
public StreamingContext getContext()
Description copied from interface:MPairStream
Gets the context used to create the stream- Specified by:
getContext
in interfaceMPairStream<T,U>
- Returns:
- the context
-
groupByKey
public MPairStream<T,Iterable<U>> groupByKey()
Description copied from interface:MPairStream
Group bys the items in the stream by the key- Specified by:
groupByKey
in interfaceMPairStream<T,U>
- Returns:
- the new stream
-
isEmpty
public boolean isEmpty()
Description copied from interface:MPairStream
Determines if the stream is empty or not- Specified by:
isEmpty
in interfaceMPairStream<T,U>
- Returns:
- True if empty, False otherwise
-
isReusable
public boolean isReusable()
Description copied from interface:MPairStream
Can this stream be consumed more the once?- Specified by:
isReusable
in interfaceMPairStream<T,U>
- Returns:
- True the stream can be reused multiple times, False the stream can only be used once
-
join
public <V> MPairStream<T,Map.Entry<U,V>> join(MPairStream<? extends T,? extends V> stream)
Description copied from interface:MPairStream
Performs an inner join between this stream and another using the key to match.- Specified by:
join
in interfaceMPairStream<T,U>
- Type Parameters:
V
- the value type parameter of the stream to join with- Parameters:
stream
- the other stream to inner join with- Returns:
- the new stream
-
keys
public MStream<T> keys()
Description copied from interface:MPairStream
Returns a stream over the keys in this pair stream- Specified by:
keys
in interfaceMPairStream<T,U>
- Returns:
- the key stream
-
leftOuterJoin
public <V> MPairStream<T,Map.Entry<U,V>> leftOuterJoin(MPairStream<? extends T,? extends V> stream)
Description copied from interface:MPairStream
Performs a left outer join between this stream and another using the key to match.- Specified by:
leftOuterJoin
in interfaceMPairStream<T,U>
- Type Parameters:
V
- the value type parameter of the stream to join with- Parameters:
stream
- the other stream to left outer join with- Returns:
- the new stream
-
map
public <R> MStream<R> map(SerializableBiFunction<? super T,? super U,? extends R> function)
Description copied from interface:MPairStream
Maps the key-value pairs to a new object- Specified by:
map
in interfaceMPairStream<T,U>
- Type Parameters:
R
- the type parameter being mapped to- Parameters:
function
- the function to map from key-value pairs to objects of typeR
- Returns:
- the new stream
-
mapToDouble
public MDoubleStream mapToDouble(SerializableToDoubleBiFunction<? super T,? super U> function)
Description copied from interface:MPairStream
Maps the key-value pairs to doubles- Specified by:
mapToDouble
in interfaceMPairStream<T,U>
- Parameters:
function
- the function to map from key-value pairs to double values- Returns:
- the new double stream
-
mapToPair
public <R,V> MPairStream<R,V> mapToPair(SerializableBiFunction<? super T,? super U,? extends Map.Entry<? extends R,? extends V>> function)
Description copied from interface:MPairStream
Maps the key-value pairs to new key-value pairs- Specified by:
mapToPair
in interfaceMPairStream<T,U>
- Type Parameters:
R
- the new key type parameterV
- the new value type parameter- Parameters:
function
- the function to map key-value pairs- Returns:
- the new pair stream
-
max
public Optional<Map.Entry<T,U>> max(SerializableComparator<Map.Entry<T,U>> comparator)
Description copied from interface:MPairStream
Returns the max item in the stream using the given comparator to compare items.- Specified by:
max
in interfaceMPairStream<T,U>
- Parameters:
comparator
- the comparator to use to compare values in the stream- Returns:
- the optional containing the max value
-
min
public Optional<Map.Entry<T,U>> min(SerializableComparator<Map.Entry<T,U>> comparator)
Description copied from interface:MPairStream
Returns the min item in the stream requiring that the items be comparable.- Specified by:
min
in interfaceMPairStream<T,U>
- Returns:
- the optional containing the min value
-
onClose
public MPairStream<T,U> onClose(SerializableRunnable closeHandler)
Description copied from interface:MPairStream
Sets the handler to call when the stream is closed. Typically, this is to clean up any open resources, such as file handles.- Specified by:
onClose
in interfaceMPairStream<T,U>
- Parameters:
closeHandler
- the handler to run when the stream is closed.
-
parallel
public MPairStream<T,U> parallel()
Description copied from interface:MPairStream
Ensures that the stream is parallel or distributed.- Specified by:
parallel
in interfaceMPairStream<T,U>
- Returns:
- the new stream
-
reduceByKey
public MPairStream<T,U> reduceByKey(SerializableBinaryOperator<U> operator)
Description copied from interface:MPairStream
Performs a reduction by key on the elements of this stream using the given binary operator.- Specified by:
reduceByKey
in interfaceMPairStream<T,U>
- Parameters:
operator
- the binary operator used to combine two objects- Returns:
- the new stream containing keys and reduced values
-
repartition
public MPairStream<T,U> repartition(int partitions)
Description copied from interface:MPairStream
Repartitions the stream to the given number of partitions. This may be a no-op for some streams, i.e. Local Streams.- Specified by:
repartition
in interfaceMPairStream<T,U>
- Parameters:
partitions
- the number of partitions the stream should have- Returns:
- the new stream
-
rightOuterJoin
public <V> MPairStream<T,Map.Entry<U,V>> rightOuterJoin(MPairStream<? extends T,? extends V> stream)
Description copied from interface:MPairStream
Performs a right outer join between this stream and another using the key to match.- Specified by:
rightOuterJoin
in interfaceMPairStream<T,U>
- Type Parameters:
V
- the value type parameter of the stream to join with- Parameters:
stream
- the other stream to right outer join with- Returns:
- the new stream
-
sample
public MPairStream<T,U> sample(boolean withReplacement, long number)
Description copied from interface:MPairStream
Randomly samplesnumber
items from the stream.- Specified by:
sample
in interfaceMPairStream<T,U>
- Parameters:
withReplacement
- true allow a single item to be represented in the sample multiple times, false allow a single item to only be picked once.number
- the number of items desired in the sample- Returns:
- the new stream
-
javaStream
public Stream<Map.Entry<T,U>> javaStream()
- Specified by:
javaStream
in interfaceMPairStream<T,U>
-
shuffle
public MPairStream<T,U> shuffle(Random random)
Description copied from interface:MPairStream
Shuffles the items in the string using the givenRandom
object.- Specified by:
shuffle
in interfaceMPairStream<T,U>
- Parameters:
random
- the random number generator- Returns:
- the new stream
-
sortByKey
public MPairStream<T,U> sortByKey(SerializableComparator<T> comparator)
Description copied from interface:MPairStream
Sorts the items in the stream by key using the given comparator.- Specified by:
sortByKey
in interfaceMPairStream<T,U>
- Parameters:
comparator
- The comparator to use to comapre keys- Returns:
- the new stream
-
union
public MPairStream<T,U> union(MPairStream<? extends T,? extends U> other)
Description copied from interface:MPairStream
Unions this stream with another.- Specified by:
union
in interfaceMPairStream<T,U>
- Parameters:
other
- the other stream to add to this one.- Returns:
- the new stream
-
updateConfig
public void updateConfig()
Description copied from interface:MPairStream
Updates the config instance used for this String- Specified by:
updateConfig
in interfaceMPairStream<T,U>
-
values
public MStream<U> values()
Description copied from interface:MPairStream
Returns a stream of values- Specified by:
values
in interfaceMPairStream<T,U>
- Returns:
- the new stream of values
-
isDistributed
public boolean isDistributed()
- Specified by:
isDistributed
in interfaceMPairStream<T,U>
-
persist
public MPairStream<T,U> persist(StorageLevel storageLevel)
- Specified by:
persist
in interfaceMPairStream<T,U>
-
-