Plug and Play
Data Pipelines

Fast and simple framework for building and scaling data pipelines


Too much time is spent on setting up the data

With well-designed data pipelines, rapid iterations of machine learning experiments will result in models with superhuman accuracy.

> pip install hub
  • Generate datasets using plug-and-play data pipelines

    Using the python-native framework to seamlessly build data pipelines for feature extraction, machine learning and deep learning. Automatically ingest, clean and transform your raw data as new data comes in.

  • Test locally, then scale to the cloud with no code change

    Snark enables building streamable data pipelines which work locally, and can be simply scaled to thousand machines on the cloud. No need to configure cloud infrastructure anymore.

    Leverage most cost-efficient hardware on the cloud with the support of preemptible/spot instances.

  • Collaborate with your team

    Data versioning and synchronization protocol implemented for you to be accessed across teams. User access management with encryption at rest and in transit. Access your data from anywhere.

  • Visualize data at any step

    View results with our visualization engine deployed on premise or on cloud. Preview slices of data with no load time and keep track of feature engineering pipeline.

  • Arrays
  • Dataset
  • Pipeline
  • Train


Create an array

Create a large array that you can read and write from anywhere. When you write one slice of the array, it automatically syncs to the cloud. You can lazy-load an existing array on-demand or connect to any other storage.

import hub
import numpy as np

# Create a large array that you can read/write from anywhere.
datahub = hub.fs('./data').connect()
bigarray = datahub.array('your_array_name',
                          shape=(100000, 512, 512, 3),
                          chunk=(100, 512, 512, 3),

# Writing to one slice of the array. Automatically syncs to cloud.
image = np.random.random((512, 512, 3))
bigarray[0, :, :, :] = image

# Lazy-Load an existing array from cloud on-demand
bigarray ='your_array_name')
bigarray[0, :, :, :].mean()


Connect to the storage service of your choice

Connect your pipelines to any type of structured and unstructured data in the Powerful Cloud-Native Array Data Warehouse.

Amazon S3

Amazon S3


Google Cloud Storage



Amazon Redshift

Amazon Redshift




Google BigQuery


Postgres SQL



Make data work for you
Whatever you need, Snark can help. Talk to our data experts.