Register h5ad files of cellxgene-census

Setup

# !lamin init --storage s3://lamindata --name cellxgene-census --schema bionty
# !lamin close
2023-09-19 14:02:40,119:INFO - Found credentials in shared credentials file: ~/.aws/credentials
2023-09-19 14:02:40,903:INFO - Found credentials in shared credentials file: ~/.aws/credentials
❗ storage exists already
✅ registered instance on hub: https://lamin.ai/sunnyosun/cellxgene-census
✅ saved: User(id='kmvZDIX9', handle='sunnyosun', email='xs338@nyu.edu', name='Sunny Sun', updated_at=2023-09-19 12:02:50)
✅ saved: Storage(id='B4O1DDsR', root='s3://lamindata', type='s3', region='us-east-1', updated_at=2023-09-19 12:02:50, created_by_id='kmvZDIX9')
💡 loaded instance: sunnyosun/cellxgene-census
❗ locked instance (to unlock and push changes to the cloud SQLite file, call: lamin close)

!lamin load laminlabs/cellxgene-census
💡 loaded instance: laminlabs/cellxgene-census

import lamindb as ln
import lnschema_bionty as lb
import cellxgene_census
💡 lamindb instance: laminlabs/cellxgene-census
ln.track()
💡 notebook imports: cellxgene_census lamindb==0.54.4 lnschema_bionty==0.31.2
💡 Transform(uid='nhGTqlIHEyn7z8', name='Register h5ad files of cellxgene-census', short_name='files', version='0', type='notebook', reference='https://cellxgene-census-lamin-c192.netlify.app/notebooks/files', reference_type='cellxgene-census-lamin', updated_at=2023-10-16 15:04:08, latest_report_id=852, source_file_id=851, created_by_id=1)
💡 Run(uid='u6hWPhTQXwCTlNSi8Iaj', run_at=2023-10-24 15:48:23, transform_id=1, created_by_id=2)
census_version = "2023-07-25"

Register datasets

census = cellxgene_census.open_soma(census_version=census_version)
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
2023-10-05 17:31:56,984:INFO - The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
census
<Collection 's3://cellxgene-data-public/cell-census/2023-07-25/soma/' (open for 'r') (2 items)
    'census_data': 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_data' (unopened)
    'census_info': 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_info' (unopened)>
census["census_data"]
<Collection 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_data' (open for 'r') (2 items)
    'mus_musculus': 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_data/mus_musculus' (unopened)
    'homo_sapiens': 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_data/homo_sapiens' (unopened)>
census["census_info"]
<Collection 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_info' (open for 'r') (3 items)
    'summary': 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_info/summary' (unopened)
    'summary_cell_counts': 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_info/summary_cell_counts' (unopened)
    'datasets': 's3://cellxgene-data-public/cell-census/2023-07-25/soma/census_info/datasets' (unopened)>
datasets_df = census["census_info"]["datasets"].read().concat().to_pandas()
datasets_df.shape
(593, 8)
datasets_df.head()
soma_joinid collection_id collection_name collection_doi dataset_id dataset_title dataset_h5ad_path dataset_total_cell_count
0 0 e2c257e7-6f79-487c-b81c-39451cd4ab3c Spatial multiomics map of trophoblast developm... 10.1038/s41586-023-05869-0 f171db61-e57e-4535-a06a-35d8b6ef8f2b donor_p13_trophoblasts f171db61-e57e-4535-a06a-35d8b6ef8f2b.h5ad 31497
1 1 e2c257e7-6f79-487c-b81c-39451cd4ab3c Spatial multiomics map of trophoblast developm... 10.1038/s41586-023-05869-0 ecf2e08e-2032-4a9e-b466-b65b395f4a02 All donors trophoblasts ecf2e08e-2032-4a9e-b466-b65b395f4a02.h5ad 67070
2 2 e2c257e7-6f79-487c-b81c-39451cd4ab3c Spatial multiomics map of trophoblast developm... 10.1038/s41586-023-05869-0 74cff64f-9da9-4b2a-9b3b-8a04a1598040 All donors all cell states (in vivo) 74cff64f-9da9-4b2a-9b3b-8a04a1598040.h5ad 286326
3 3 f7cecffa-00b4-4560-a29a-8ad626b8ee08 Mapping single-cell transcriptomes in the intr... 10.1016/j.ccell.2022.11.001 5af90777-6760-4003-9dba-8f945fec6fdf Single-cell transcriptomic datasets of Renal c... 5af90777-6760-4003-9dba-8f945fec6fdf.h5ad 270855
4 4 3f50314f-bdc9-40c6-8e4a-b0901ebfbe4c Single-cell sequencing links multiregional imm... 10.1016/j.ccell.2021.03.007 bd65a70f-b274-4133-b9dd-0d1431b6af34 Single-cell sequencing links multiregional imm... bd65a70f-b274-4133-b9dd-0d1431b6af34.h5ad 167283
files = ln.File.from_dir("s3://cellxgene-data-public/cell-census/2023-07-25/h5ads")
ln.save(files)
dataset = ln.Dataset(files, name="cellxgene-census", version=census_version)
❗ returning existing dataset with same hash: Dataset(uid='EAUF1AaT4kOVyHYnZsUJ', name='cellxgene-census', version='2023-07-25', hash='pEJ9uvIeTLvHkZW2TBT5', updated_at=2023-10-24 16:00:07, transform_id=1, run_id=9, created_by_id=2)
init_self_from_db start
init_self_from_db done
slots done
start provenance
start loop
end loop
track_run_input
links
created
dataset.save()
collections_df = (
    datasets_df[["collection_id", "collection_name", "collection_doi"]]
    .drop_duplicates()
    .set_index("collection_id")
)
collections = []
for collection_id, row in collections_df.iterrows():
    collection = ln.ULabel(
        name=row.collection_name,
        description=row.collection_doi,
        reference=collection_id,
        reference_type="collection_id",
    )
    collections.append(collection)

ln.save(collections)

is_collection = ln.ULabel(name="is_collection")
is_collection.save()
is_collection.children.set(collections)
collections = is_collection.children
files = ln.File.filter()
feature_collection = ln.Feature(name="collection", type="category")
feature_collection.save()
for _, row in datasets_df.iterrows():
    file = files.filter(key__endswith=f"{row.dataset_id}.h5ad").one()
    file.description = f"{row.dataset_title}|{row.dataset_id}"
    file.save()
    file.labels.add(collections.get(reference=row.collection_id), feature_collection)

Annotate with species

feature_organism = ln.Feature(name="organism", type="category")
feature_organism.save()
files = ln.File.filter()
lb.settings.organism = "human"

human_datasets = (
    census["census_data"][lb.settings.organism.scientific_name]
    .obs.read(column_names=["dataset_id"])
    .concat()
    .to_pandas()
    .drop_duplicates()
)
print(human_datasets.shape)

for dataset_id in human_datasets.dataset_id:
    file = files.filter(description__contains=dataset_id).one()
    file.labels.add(lb.settings.organism, feature_organism)
(511, 1)
lb.settings.organism = "mouse"

mouse_datasets = (
    census["census_data"][lb.settings.organism.scientific_name]
    .obs.read(column_names=["dataset_id"])
    .concat()
    .to_pandas()
    .drop_duplicates()
)
print(mouse_datasets.shape)

for dataset_id in mouse_datasets.dataset_id:
    file = files.filter(description__contains=dataset_id).one()
    file.labels.add(lb.settings.organism, feature_organism)
(82, 1)
file.describe()
File(id='0sbCRBKbqkEuSjhzfp42', key='cell-census/2023-07-25/h5ads/8c42cfd0-0b0a-46d5-910c-fc833d83c45e.h5ad', suffix='.h5ad', accessor='AnnData', description='Krasnow Lab Human Lung Cell Atlas, 10X|8c42cfd0-0b0a-46d5-910c-fc833d83c45e', size=588959280, hash='N0yW4Iksvgw93PzdE_4M0w-71', hash_type='md5-n', updated_at=2023-10-05 16:06:49)

Provenance:
  🗃️ storage: Storage(id='oIYGbD74', root='s3://cellxgene-data-public', type='s3', region='us-west-2', updated_at=2023-09-19 13:17:56, created_by_id='kmvZDIX9')
  📔 transform: Transform(id='nhGTqlIHEyn7z8', name='Register h5ad files of cellxgene-census', short_name='files', version='0', type='notebook', reference='https://github.com/laminlabs/cellxgene-census-lamin/blob/2553c2690909976efe380ca96d9e4d6b9a6c6749/docs/notebooks/datasets.ipynb', reference_type='github', updated_at=2023-10-05 14:04:28, created_by_id='kmvZDIX9')
  👣 run: Run(id='60jqKpxivkwkpEFZr8mp', run_at=2023-10-05 15:31:55, transform_id='nhGTqlIHEyn7z8', created_by_id='kmvZDIX9')
  👤 created_by: User(id='kmvZDIX9', handle='sunnyosun', email='xs338@nyu.edu', name='Sunny Sun', updated_at=2023-09-19 14:58:33)
Features:
  external: FeatureSet(id='OHD9LSDGO1FtSWUtcpqG', n=2, registry='core.Feature', hash='NspE1QMvOo8aoOOrotmH', updated_at=2023-10-05 16:06:49, modality_id='FyZj4S3Z', created_by_id='kmvZDIX9')
    🔗 organism (1, bionty.Species): 'human'
    🔗 collection (1, core.ULabel): 'A molecular cell atlas of the human lung from single cell RNA sequencing'
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ ulabels (1, core.ULabel): 'A molecular cell atlas of the human lung from single cell RNA sequencing'
census.close()