Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

IRSA Tutorials

Querying the CosmoDC2 Mock v1 Catalogs

This tutorial demonstrates how to access and query the CosmoDC2 Mock v1 catalogs using IRSA’s Table Access Protocol (TAP) service. Background information on the catalogs is available on the IRSA CosmoDC2 page.

The catalogs are served through IRSA’s Virtual Observatory–standard TAP interface, which you can access programmatically in Python via the PyVO library. TAP queries are written in the Astronomical Data Query Language (ADQL) — a SQL-like language designed for astronomical catalogs (see the ADQL specification).

If you are new to PyVO’s query modes, the documentation provides a helpful comparison between synchronous and asynchronous execution: PyVO: Synchronous vs. Asynchronous Queries

Tips for Working with CosmoDC2 via TAP

# Uncomment the next line to install dependencies if needed.
# !pip install numpy matplotlib pyvo
import pyvo as vo
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
service = vo.dal.TAPService("https://irsa.ipac.caltech.edu/TAP")

1. List the available DC2 tables

tables = service.tables
for tablename in tables.keys():
    if not "tap_schema" in tablename:
        if "dc2" in tablename:
            tables[tablename].describe()
cosmodc2mockv1
    CosmoDC2MockV1 Catalog - unabridged, spatially partitioned

cosmodc2mockv1_heavy
    CosmoDC2MockV1 Catalog - stellar mass > 10^7 Msun

cosmodc2mockv1_new
    CosmoDC2MockV1 Catalog - unabridged

2. Choose the DC2 catalog you want to work with.

IRSA currently offers 3 versions of the DC2 catalog.

If you are new to the DC2 catalog, we recommend that you start with cosmodc2mockv1_heavy

# Choose the abridged table to start with.
# Queries should be faster on smaller tables.

tablename = 'cosmodc2mockv1_heavy'

3. What is the default maximum number of rows returned by the service?

This service will return a maximum of 2 billion rows by default.

service.maxrec
2000000000

This default maximum can be changed, and there is no hard upper limit to what it can be changed to.

print(service.hardlimit)
None

4. List the columns in the chosen table

This table contains 301 columns.

columns = tables[tablename].columns
print(len(columns))
301

Let’s learn a bit more about them.

for col in columns:
    print(f'{f"{col.name}":30s}  {col.description}')
Fetching long content....

5. Retrieve a list of galaxies within a small area

Since we know that cosmoDC2 is a large catalog, we can start with a spatial search over a small square area. The ADQL that is needed for the spatial constraint is shown below. We then show how to make a redshift histogram of the sample generated.

# Setup the query
adql = f"""
SELECT redshift
FROM {tablename}
WHERE CONTAINS(
    POINT('ICRS', ra, dec),
    CIRCLE('ICRS', 54.0, -37.0, 0.05)
) = 1
"""

cone_results = service.run_sync(adql)
#how many redshifts does this return?
print(len(cone_results))
10640
# Now that we have a list of galaxy redshifts in that region, we can
# create a histogram of the redshifts to see what redshifts this survey includes.

# Plot a histogram
num_bins = 20
# the histogram of the data
n, bins, patches = plt.hist(cone_results['redshift'], num_bins,
                            facecolor='blue', alpha = 0.5)
plt.xlabel('Redshift')
plt.ylabel('Number')
plt.title(f'Redshift Histogram {tablename}')
<Figure size 640x480 with 1 Axes>

We can see form this plot that the simulated galaxies go out to z = 3.

First, we’ll do a narrow redshift cut with no spatial constraint. Then, from that redshift sample we will visualize the galaxy main sequence at z = 2.0.

# Setup the query
adql = f"""
SELECT TOP 50000
    mag_r_lsst,
    (mag_g_lsst - mag_r_lsst) AS color,
    redshift
FROM {tablename}
WHERE redshift BETWEEN 1.95 AND 2.05
"""
redshift_results = service.run_sync(adql)
redshift_results
<DALResultsTable length=50000> mag_r_lsst color redshift mag float32 float32 float32 ---------- ------------ -------- 28.416 -0.09135628 2.0274 29.968 0.6469059 2.0347 30.732 0.8355446 2.0274 29.940 0.54089165 2.0257 27.383 -0.07013512 2.0289 28.493 -0.21741104 2.0273 27.624 -0.08807373 2.0323 30.076 0.57766914 2.0473 29.682 0.33118057 2.0490 ... ... ... 30.588 0.9645691 2.0490 28.014 -0.03717804 2.0340 27.949 -0.018476486 2.0240 27.963 -0.18163872 2.0326 29.253 0.2999897 2.0367 29.318 0.30609512 2.0280 29.904 0.6178303 2.0415 29.076 0.2580967 2.0444 27.677 -0.068496704 2.0455
# Construct a 2D histogram of the galaxy colors
plt.hist2d(redshift_results['mag_r_lsst'], redshift_results['color'],
            bins=100, cmap='plasma', cmax=500)

# Plot a colorbar with label.
cb = plt.colorbar()
cb.set_label('Number')

# Add title and labels to plot.
plt.xlabel('LSST Mag r')
plt.ylabel('LSST rest-frame g-r color')
<Figure size 640x480 with 2 Axes>

7. Suggestions for further queries:

TAP queries are extremely powerful and provide flexible ways to explore large catalogs like CosmoDC2, including spatial searches, photometric selections, cross-matching, and more. However, many valid ADQL queries can take minutes or longer to complete due to the size of the catalog, so we avoid running those directly in this tutorial. Instead, the examples here have so far focused on fast, lightweight queries that illustrate the key concepts without long wait times. If you are interested in exploring further, here are some additional query ideas that are scientifically useful but may take longer to run depending on server conditions.

Count the total number of redshifts in the chosen table

The answer for the 'cosmodc2mockv1_heavy' table is 597,488,849 redshifts.

adql = f"SELECT count(redshift) FROM {tablename}"

Generally useful for: estimating source density, validating spatial footprint, testing spatial completeness.

adql = f"""
SELECT COUNT(*)
FROM {tablename}
WHERE CONTAINS(POINT('ICRS', ra, dec), CIRCLE('ICRS', 54.2, -37.5, 0.2)) = 1
"""

This use of “TOP 5000” just limits the number of rows returned. Remove it if you want all rows, but keep in mind such a query can take a much longer time.

adql = f"""
SELECT TOP 5000
    ra,
    dec,
    redshift,
    stellar_mass
FROM {tablename}"""

Explore the stellar–halo mass relation

adql = f"""
SELECT TOP 500000
    stellar_mass,
    halo_mass
FROM {tablename}
WHERE halo_mass > 1e11"""

Find the brightest galaxies at high redshift

Return the results in ascending (ASC) order by r band magnitude.

adql = f"""
SELECT TOP 10000
    ra, dec, redshift, mag_r_lsst
FROM {tablename}
WHERE redshift > 2.5
ORDER BY mag_r_lsst ASC
"""

About this notebook

Author: IRSA Data Science Team, including Vandana Desai, Jessica Krick, Troy Raen, Brigitta Sipőcz, Andreas Faisst, Jaladh Singhal

Updated: 2025-12-16

Contact: the IRSA Helpdesk with questions or reporting problems.

Runtime: As of the date above, this notebook takes about 2 minutes to run to completion on a machine with 8GB RAM and 2 CPU. Large variations in this runtime can be expected if the TAP server is busy with many queries at once.