Skip to content

How to get BioConnect Data

This document describes how to access data stored in Google Cloud Storage (a.k.a. "Google buckets") using a command line tool called gsutil. This tool enables access to Cloud Storage using encrypted, secure access.

Prerequisites

  1. A login username with permission to access the data is the first prerequisite. If you do not have access permission, follow the instructions found under the "Access Error".
  2. A Python interpreter is needed to run the gsutil program. Python version 3.8+ is preferred. If you are installing on your own computer, then a new install may be required. If accessing Google bucket data from one of the HPC systems (e.g., Sumner, Winter), then it is possible to follow the container method or to use an existing Python interpreter from within the environment.

Installing gsutil

The installation instructions below describe the process for the Windows, MacOS, and Linux (HPC) operating systems. The instructions for MacOS and Windows operating systems follow the general instructions here. Instructions for the HPC environment employ a Singularity container accessible in that environment.

Note: If you have already installed gsutil you do not need to repeat the process again.

Windows installation

To install on Windows, open a PowerShell terminal and run the following PowerShell commands.

(New-Object Net.WebClient).DownloadFile("https://dl.google.com/dl/cloudsdk/channels/rapid/GoogleCloudSDKInstaller.exe", "$env:Temp\GoogleCloudSDKInstaller.exe")
& $env:Temp\GoogleCloudSDKInstaller.exe 

For additional assistance, please refer to the instructions outlined here.

MacOS installation

To install on macOS you will first need Python 3.8+, and then you can run the following commands in Terminal:

curl https://dl.google.com/dl cloudsdk/channels/rapid/downloads/google-cloud-sdk-346.0.0-darwin-x86_64.tar.gz -o google-cloud-sdk.tar.gz
tar -xzf google-cloud-sdk.tar.gz
./google-cloud-sdk/install.sh 

For additional assistance, please refer to the instructions outlined here.

Linux/HPC installation

In the HPC environment, the quickest way to access data is via a Singularity container. First, connect to the JAX VPN and login to Sumner with:

ssh login.sumner.jax.org

From your home directory, load singularity:

module load singularity

Run gcp_sdk singularity container directly from the JAX container registry:

singularity shell shub://jaxreg.jax.org/cube/gcp_sdk.sif:latest

Inside the container, login to gcp

~/singularity> gcloud auth login

Copy and paste the URL into a browser login into GCP and click on “Allow” button. Copy the verification code and paste into the window. See the image below for a sample of what this looks like.

Sample singularity login

Alternatively, in the HPC environment, run the following command via Terminal. (You can launch your terminal by pressing Ctrl+Alt+T)

wget https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz
tar -xvf google-cloud-sdk.tar.gz
./google-cloud-sdk/install.sh

For additional assistance, please refer to the instructions outlined here.

Login

From here, the instructions are the same for all approaches. To login, run the commands below and follow the instructions on the screen.

  1. Open up Terminal/Command Prompt and type the following command:

    gcloud auth login 
    
    Once this command has been run your computer will ask to open a url in a new browser. Once allowed, it will open a Google Login page. Log into your JAX Google account.

  2. Log into your JAX Google account. You will log in using your JAX email address and password. This process verifies you are authorized to view JAX data. Once logged in you will see this page.

    google-cloud-sdk

    Press "Allow".

    *Note:

    If you are unable to log in with your JAX credentials, please refer to the "Google Account Authentication" menu item where you will be walked through how best to contact helpdesk@jax.org.

  3. After you've logged in to Google, return to your command-screen and set the data project by running the following command: gcloud config set project jax-cube-prd-ctrl-01

    With the project selected, you can now begin following the steps outlined below to access your specific datasets.

    *Note:
    If you are still experiencing issues with access after having had your Google Account created, please refer to the "Access Error" menu item. You will need to reach out to IT to ensure they have granted you full access to the data buckets.

Using gsutil

The following steps will walk you through the commands for you to enter in order to access the data. The example output in each text box will show you what you should look for to verify that you have access to the data.

In order to have access to and use the data, you will need to download or 'copy' the files to your local computer. We first start by finding the data's location, and from there, use the URL of the data so we can copy the specific files we need.

*Note: You do not need to follow all of these steps progressively. Once you have completed steps 1-3 you can choose whether to proceed to either step 4,5 or 6. The final step you determine will be based on how many data buckets you would like to access at any one time.

Step 4 walks you through copying a single file, whereas steps 5 & 6 show you how to copy multiple files at once.

1. List bucket

By entering this command you are accessing a list of the available file folders in Jax's local data-index.
Command:

gsutil ls
Example Output:
[liangh@sumner103 ~]$ gsutil ls
gs://artifacts.jax-cube-prd-ctrl-01.appspot.com/
gs://dataproc-staging-us-east1-121275253190-gx00mz7u/
gs://dataproc-temp-us-east1-121275253190-ie6yztoa/
gs://jax-cube-prd-ctrl-01-bigquery-data/
gs://jax-cube-prd-ctrl-01-export-bucket/
*Note:
If you are experiencing issues with access, please refer to the "Access Error" menu item.

2. List Bucket Directory

Follow this step verbatim. With this command you are setting what list of files to observe; in this case Cube Project Data. By setting your directory to this location you are able to further hone into this specific folder and its data.
Command:

gsutil ls gs://jax-cube-prd-ctrl-01-project-data/
Example Output:
[liangh@sumner103 ~]$ gsutil ls gs://jax-cube-prd-ctrl-01-project-data/
gs://jax-cube-prd-ctrl-01-project-data/20191223_19-churchill-004/
gs://jax-cube-prd-ctrl-01-project-data/20191223_19-cube-001/
gs://jax-cube-prd-ctrl-01-project-data/20191224_19-churchill-004/

3. List Bucket Directory Continued

By continuing to zoom in by using "gsutil ls", we are locating the data's location. On this step, the link you enter after this command will vary based on the data you are needing to access.
Command:

gsutil ls gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046
Example Output:

[liangh@sumner103 ~]$ gsutil ls gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046
gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/pipeline-metadata.json
gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/processed-metadata.json
gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/processed-metadata.old.json
gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/qc-pass.txt
gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/cellranger/
gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/fastq/

4. Copy Single File from Bucket to Local

Once you have located the data file you wish to access and have entered into the correct directory where it has been stored, you can enter the command "gsutil cp" and post the file location specified from the Data Schedule page to download and access the necessary files.
Command:

gsutil cp gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/pipeline-metadata.json .
Example Output:
[liangh@sumner103 tmp]$ gsutil cp gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/pipeline-metadata.json .
Copying gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/pipeline-metadata.json...
/ [1 files][  7.7 KiB/  7.7 KiB]
Operation completed over 1 objects/7.7 KiB. 

5. Copy Files from Bucket to Local Using Wildcard

By using "wildcard" or inserting a '' after one of the file locations, you are able to download more than one file at a time. The wildcard symbol '' is used as a filler for the actual characters tied to the data file, allowing you to download or 'copy' files matching in everything but the area replaced by the '' symbol.
Command:

gsutil cp "gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/*.json" .
Example Output:*
[liangh@sumner103 tmp]$ gsutil cp gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/*.json .
Copying gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/pipeline-metadata.json...
Copying gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/processed-metadata.json...
Copying gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/processed-metadata.old.json...
/ [3 files][ 31.3 KiB/ 31.3 KiB]
Operation completed over 3 objects/31.3 KiB.

6. Copy Directory Recursive to Local

This step is used if you want to copy an entire component of the directory/data-files to your system.

Command line explanation
| -m | "Causes supported operations (acl ch, acl set, cp, mv, rm, rsync, and setmeta) to run in parallel. This can significantly improve performance if you are performing operations on a large number of files over a reasonably fast network connection." Source | | :--- | ---: | | -
r
| "It attempts to make a copy that's as close to the original as possible: same directory tree, same file types, same contents, same metadata (times, permissions, extended attributes, etc.). The -r or -R option for "recursive" means that it will copy all of the files including the files inside of subfolders." Source |

Command:

gsutil -m cp -r gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046 .
Example Output:
[liangh@sumner103 tmp]$ gsutil -m cp -r gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046 .
Copying gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/cellranger/MS19046.mri.tgz...
Copying gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/cellranger/_files/_cmdline...
Copying gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/cellranger/_files/_filelist...
Copying gs://jax-cube-prd-ctrl-01-project-data/single_cell/MS19046/cellranger/_files/_finalstate...