sparklyr is an open-source and modern interface to scale data science and machine learning workflows using Apache Sparkā¢, R, and a rich extension ecosystem.
It enables using Apache Spark with ease using R by providing access to core functionality like installing, connecting and managing Spark and using Spark's MLlib, Spark Structured Streaming and Spark Pipelines from R.
Supports well-known R packages like dplyr, DBI and broom to reduce the cognitive overhead from having to re-learn libraries.
And enables a rich-ecosystem of extensions to use in Spark and R: XGBoost, MLeap, GraphFrames, H2O, and optionally enable Apache Arrow to significantly improve performance.
Through Spark, this allows you to scale your Data Science workflows in Hadoop YARN, Mesos, Kubernetes or Apache Livy.
To connect to a local cluster: install R, Java 8, and run from R:
# Run once install.packages("sparklyr") sparklyr::spark_install() # Connect to Spark local library(sparklyr) sc <- spark_connect(master = "local")
To connect to any other Spark clusters:
# Connect to Hadoop YARN sc <- spark_connect(master = "yarn") # Connect to Mesos sc <- spark_connect(master = "mesos://host:port") # Connect to Kubernetes sc <- spark_connect(master = "k8s://https://server") # Connect to Apache Livy sc <- spark_connect(master = "http://server/livy", method = "lvy")
To connect through specific distributions, cloud providers and tools start use the following resources:
Useful resources to learn sparklyr:
There are many organizations using sparklyr to scale their Data Science and Machine Learning frameworks using R with Apache Spark. Logos coming soon!
Current committers to sparklyr are sponsored by: Databricks, Qubole, and RStudio.