How to participate

To get started and submit your first solution, you will need to pass through the following steps:

Create an account on the CrunchDAO platform;
Setup your workspace to get access to the data;
Test your solution locally;
Get the confirmation that your code is running.

The QuickStarter Notebook below is designed to get you started in just 3 minutes.

1. Create an account

Creating an account on the CrunchDAO platform will allow you to get access to the competition dataset. Follow the link below to join the competition.

Home - CrunchDAO Account

2. Setup and data

Two types of submissions are possible. Your setup and the method of accessing the data will slightly differ depending on the use of a Python Notebook (.ipynb) or of a Python Script (.py).

2.1 Notebook Participation Setup

# Get the crunch library in your workspace.
%pip install crunch-cli --upgrade

# To use the library, import the crunch package and instantiate it to be able to access its functionality.
# You can do that using the following lines:
import crunch
crunch = crunch.load_notebook(__name__)

# Authenticates your user, downloads your project workspace, and enables your access to the data
!crunch setup <competition> --token <token>

⚠ To get your personal token ⚠

https://hub.crunchdao.com/competitions/venture-capital-portfolio-prediction/submit/via/notebook
New tokens are generated every minute, and each token can only be used once within a 3-minute timeframe.

2.2 Script Participation Setup

Go to https://hub.crunchdao.com/competitions/venture-capital-portfolio-prediction/submit and click on "reveal the command" button to access the commands that will set up your workspace. Execute the commands in a terminal, in a working directory of your choice.

3. Your working directory

Once you run the setup commands, the crunch package will download the data and create a folder named after your username on the platform. Here is a snapshot of your working directory folder.

$tree
.
├── data
│   ├── X_test.parquet
│   ├── X_train.parquet
│   └── y_train.parquet
├── main.py
├── requirements.txt
└── resources

3 directories, 5 files

If you need to save some files to run your code on the CrunchDAO's servers, like the weights of your model, the tree structure... etc., you have to save them under the resources folder.

4. Testing your code locally

The crunch test command allows you to perform a local test of your code. The associated test set is purposefully very small and should be used to check the functionality of your code only.

This command conducts a series of tests to also verify whether your generated prediction file aligns with the expected format for the rally. The example_submission file in the data folder serves as a reference for the expected format.

⚠️ Failure to pass these tests will result in your prediction not being scored and subsequently rejected.

# Upgrade the cruch-cli library to be sure to have the last version
pip install crunch-cli --upgrade

# Run a local test in a notebook
crunch.test(force_first_train=True)

# Run a local test in your terminal
crunch test --no-force-first-train

This function of the crunch package will run your code locally, simulating how it is called in the cloud.
In a notebook,force_first_train=True indicates that your model will be trained on the first date of the test set.
Similarly, --no-force-first-train controls the same parameter for terminal calls to the function (Note that in this case, using this flag will do the opposite as force_first_train=True in the notebook case)

Usage: crunch test [OPTIONS]

  Test your code locally.

Options:
  -m, --main-file TEXT       Entrypoint of your code.  [default: main.py]
  --model-directory TEXT     Directory where your model is stored.  [default:
                             resources]

  --no-force-first-train     Do not force the train at the first loop.
  --train-frequency INTEGER  Train interval.  [default: 1]
  --help                     Show this message and exit.

The key tests performed are:

Column Names: The columns in your file must precisely match those in example_submission.
Values Integrity: The prediction_column_name column should not contain any NaNs (Not-a-Number) or infinite values.
Binary Values: Values in the prediction_column_name column must exclusively be 0 or 1.
Moon Verification: Values in the moon_column_name column should match those found in the X_test received from the infer function.
ID Verification: Values in the id_column_name column must match the corresponding ones in the X_test for each moon.

The source code is public and can be accessed on the github repository here.

5. Submit

Download your notebook under the .ipynb format and upload it under the submit section of the CrunchDAO platform.

Specifying package versions

Since submitting a Notebook does not include a requirements.txt, users can instead specify a package's version using requirement specifiers at the import level in a comment on the same line.

# valid statement
import pandas # == 1.3
import sklearn # >= 1.2, < 2.0
import tqdm # [foo, bar]
import scikit # ~= 1.4.2
from requests import Session # == 1.5

Specifying multiple times will cause the submission to be rejected if they are different.

# inconsistent versions will be rejected
import pandas # == 1.3
import pandas # == 1.5

Specifying versions on standard libraries will do nothing (but they will still be rejected if there is an inconsistent version).

# will be ignored
import os # == 1.3
import sys # == 1.5

5.1 Submit with Crunch CLI (optional)

Usage: crunch push [OPTIONS]

  Send the new submission of your code.

Options:
  -m, --message TEXT      Specify the change of your code. (like a commit
                          message)

  -e, --main-file TEXT    Entrypoint of your code.  [default: main.py]
  --model-directory TEXT  Directory where your model is stored.  [default:
                          resources]

  --help                  Show this message and exit.

6. Check your submission

If the submission is complete you will see it appear under your submission section.

The backend is parsing your submission to retrieve the code of the interface's functions (ie: train, and infer) and the dependencies of your code. By clicking on the right-side arrow you will access your submission content.

Make sure that the system properly parsed your code and imports.

7. Testing your code on the server

To get a score on the leaderboard, you need to run your code on the competition server. Your code will be fed with never-seen data, and your predictions will be scored on this private test set.

In order to run your submission on the cloud and get a score, you need to click on a submission and then on the Run in the Cloud button.

Your code is called on each individual date. Code calls go through the dates sequentially, but are otherwise independent. Be reminded that the data contains, for each individual date, the cross-section of the investment vehicles of the universe at that time.

At each date, your code will access only the data available up to that point.

Here is a high-level overview of how your code will be called:

# This loop over the private test set dates to avoid leaking the x of future periods
for date in dates:
    # The wrapper will block the logging of users code after the 5 first dates
    if date >= log_treshold:
        log = False

    # If the user asked for a retrain on the current date
    if retrain:
        # Cutting the sample such that the user's code will only access the right part of the data
        X_train = X_train[X_train.date < date - embargo]
        y_train = y_train[y_train.date < date - embargo]
        
        # This is where your `train` code is called
        train(X_train, y_train, model_directory_path)
    
    # Only the current date
    X_test = X_test[X_test.date == date] 
    
    # This is where your `infer` code is called
    prediction = infer(model_directory_path, X_test)

    if date > log_treshold:
        predictions.append(prediction)

# Concat all of the individual predictions
prediction = pandas.concat(predictions)

# Upload it to our servers
upload(prediction)

# Upload the model's files to our servers
for file_name in os.listdir(model_directory_path):
    upload(file_name)

8. Monitoring Your Code Runs

Once you successfully launch your run on the cloud, you can monitor its proper execution with the run logs.

The logs for the execution of your code are only displayed on the 5 first dates of the test set, to avoid meta-labeling (a common cheat method in data-science tournaments)

Last updated 1 year ago