Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactor for general data science #498

Draft
wants to merge 271 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
271 commits
Select commit Hold shift + click to select a range
d04b280
remove value detection from data_science model evaluator
TPLin22 Dec 5, 2024
da8d0b7
data loader CoSTEER
XianBW Dec 6, 2024
9e8d74d
merge
XianBW Dec 6, 2024
b2a445c
ds model eval: init use gpt for shape evaluator
TPLin22 Dec 6, 2024
231bf91
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
TPLin22 Dec 6, 2024
dea5605
refactor: Update data loader evaluation and execution logic
you-n-g Dec 6, 2024
e023791
redundance
XianBW Dec 9, 2024
bf36de8
split spec.md
XianBW Dec 9, 2024
9674c16
ds model test: init evolving strategy and unit test
TPLin22 Dec 9, 2024
06e2043
data science scenario changes
XianBW Dec 10, 2024
4f76a0b
data science base file
XianBW Dec 10, 2024
14325de
proposal related
XianBW Dec 10, 2024
7031610
proposal related
XianBW Dec 10, 2024
075aa9e
complete judge
XianBW Dec 11, 2024
e2133a0
some changes
XianBW Dec 11, 2024
ca7d785
simple readme for data loader costeer
XianBW Dec 11, 2024
9c89f4b
proposal related
XianBW Dec 12, 2024
3352c99
Draft for Bowen
you-n-g Dec 12, 2024
318e457
add property
XianBW Dec 12, 2024
a26bdb4
fix knowledgemn
XianBW Dec 12, 2024
3c0883d
fix feedback prompt
XianBW Dec 12, 2024
bb4a0a4
fix feedback bug
XianBW Dec 12, 2024
67be799
fix json_mode bug
XianBW Dec 12, 2024
d24ba79
feature processing
WinstonLiyt Dec 12, 2024
7d7cab8
fix execute data volume problem
XianBW Dec 12, 2024
67ec2b0
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Dec 12, 2024
57293b1
proposal related
XianBW Dec 12, 2024
3a746fa
hypothesis2experiment base
XianBW Dec 13, 2024
68e4c1f
only hypothesis gen and task gen
XianBW Dec 13, 2024
a570408
proposal related
XianBW Dec 13, 2024
3324601
exp_gen base code
XianBW Dec 13, 2024
8d73ea3
dependency_codes inject
XianBW Dec 13, 2024
00526ff
proposal completed(not test)
XianBW Dec 13, 2024
6980e1d
rewrite ds model evaluate
TPLin22 Dec 13, 2024
1b051e3
fix data bug
XianBW Dec 13, 2024
db09a45
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Dec 13, 2024
a37071e
small code refinement on conf and other
peteryang1 Dec 13, 2024
f2ed789
load data in ds model
TPLin22 Dec 13, 2024
9bbf9a3
fix some bugs
WinstonLiyt Dec 13, 2024
93b6656
a debug llm tool app
XianBW Dec 15, 2024
b283cb1
model task base_code added
XianBW Dec 16, 2024
1728a90
return code dict in ds model evolvingstrategy
TPLin22 Dec 16, 2024
a23731f
fix ds_scen description_template
WinstonLiyt Dec 16, 2024
1178618
redundent prompts.yaml
XianBW Dec 16, 2024
505cc4f
fix some bugs
TPLin22 Dec 16, 2024
9fba70c
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Dec 16, 2024
f18478d
feature test change
XianBW Dec 16, 2024
a108072
fix a bug
WinstonLiyt Dec 16, 2024
636a1d5
fix prompts.yaml
XianBW Dec 16, 2024
0c3881c
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Dec 16, 2024
d7b6ca4
remove feature test local path
XianBW Dec 16, 2024
5c4490f
refine the structure of scene
WinstonLiyt Dec 16, 2024
6cf001e
fix some bugs
WinstonLiyt Dec 16, 2024
e1abb6f
exp_gen change
XianBW Dec 17, 2024
f7dff55
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Dec 17, 2024
1379654
exp_gen change
XianBW Dec 17, 2024
8d1eca9
spec & workspace changes
XianBW Dec 17, 2024
8ccd32f
init for workflow
TPLin22 Dec 17, 2024
2e2d153
fix
TPLin22 Dec 17, 2024
6926e18
ds model fit for spec & workspace change
TPLin22 Dec 17, 2024
3118c67
improve data_loader_spec
WinstonLiyt Dec 17, 2024
b630f79
ds model eval for more cases
TPLin22 Dec 17, 2024
84cf4d4
refine prompts
WinstonLiyt Dec 17, 2024
81a427a
spec change
XianBW Dec 17, 2024
7fae309
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Dec 17, 2024
b7aad31
spell check
qew21 Dec 18, 2024
12f1217
refine ds modal for more cases: eval and es
TPLin22 Dec 18, 2024
e86aa73
update model template
TPLin22 Dec 18, 2024
cf5e18c
prompts for model and ensemble
WinstonLiyt Dec 18, 2024
81b27b4
fix a bug
WinstonLiyt Dec 18, 2024
dc8f71c
fix a bug
WinstonLiyt Dec 18, 2024
b6acea3
init: ds workflow evovingstrategy
TPLin22 Dec 18, 2024
7f70ce2
Adding ensemble (#505)
xisen-w Dec 18, 2024
62dbcf5
data science loop changes
XianBW Dec 18, 2024
3e240f3
merge pull
XianBW Dec 18, 2024
13fae9a
data science loop base
XianBW Dec 18, 2024
999d133
ds loop feedback
XianBW Dec 19, 2024
b6241cd
fix
XianBW Dec 19, 2024
7e2874f
remove measure_time because it's duplicated (in LoopBase)
XianBW Dec 19, 2024
3335406
add the knowledge query for data_loader & feature
WinstonLiyt Dec 19, 2024
26da5c4
edit ds workflow evaluator
TPLin22 Dec 19, 2024
00ad54e
data_loader bug fix
XianBW Dec 19, 2024
35a1db9
stop evolving when all tasks completed
XianBW Dec 19, 2024
f96f9a2
llm app change
XianBW Dec 20, 2024
a0a3db5
fix break all complete strategy
peteryang1 Dec 20, 2024
74a2829
Adding queried knowledge (#508)
xisen-w Dec 20, 2024
737bdb9
fix loop bug
XianBW Dec 20, 2024
6234966
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Dec 20, 2024
ab41352
ds workflow evaluator; test; refine prompts
TPLin22 Dec 20, 2024
c2ed6e1
workflow spec
WinstonLiyt Dec 20, 2024
02ddf81
fix ci
WinstonLiyt Dec 20, 2024
bfa455a
feature task changes
XianBW Dec 20, 2024
61f0cb8
ds loop change
XianBW Dec 23, 2024
251688b
fix a bug in feat
WinstonLiyt Dec 23, 2024
438a569
add query knowledge for model and workflow
WinstonLiyt Dec 23, 2024
6497957
llm_debug info(for show) using pickle instead of json
XianBW Dec 23, 2024
4b7f4f2
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Dec 23, 2024
3920a5c
remove NextLoopException
peteryang1 Dec 23, 2024
e8a85a6
loop change
XianBW Dec 23, 2024
3114f78
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Dec 23, 2024
5845173
coder raise CoderError when all sub_tasks failed
XianBW Dec 23, 2024
3db73f0
rename code_dict to file_dict in FBWorkspace
XianBW Dec 23, 2024
7f85fdf
add CoSTEER unittest
XianBW Dec 23, 2024
9009b73
now show self.version in Task.get_task_information(), simplify CoSTEE…
XianBW Dec 23, 2024
39abb25
remove some properties in ModelTask, add model_type in it.
XianBW Dec 23, 2024
a6505d1
fix llm app bug
XianBW Dec 24, 2024
87dea18
llm web app bug fix
XianBW Dec 24, 2024
d2d88d9
ds loop bug fix
XianBW Dec 24, 2024
e8c2d6c
fix: give component code to feature&ens eval
XianBW Dec 24, 2024
0722d77
loop catch error bug
XianBW Dec 25, 2024
b53e03e
rename load_from_raw_data to load_data
XianBW Dec 25, 2024
01ad2e9
feat: Add debug data creation functionality for data science scenarios
you-n-g Dec 25, 2024
db1455b
support local folder (#511)
qew21 Dec 25, 2024
12a27ec
update sample data script
qew21 Dec 25, 2024
de2825f
make sure frac < 1
qew21 Dec 25, 2024
9a4ba5f
fix a bug
WinstonLiyt Dec 25, 2024
75b40d8
feature spec changes
XianBW Dec 25, 2024
84f0fb8
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Dec 25, 2024
e8f2410
fix
XianBW Dec 25, 2024
418c2ce
changeimport order
qew21 Dec 26, 2024
fe10da7
clear unnecessary std outputs
WinstonLiyt Dec 26, 2024
a4e3ced
fix a typo
WinstonLiyt Dec 26, 2024
e009fd7
create sample folder after unzip kaggle data
qew21 Dec 26, 2024
36d26ee
feature/model test script update
XianBW Dec 26, 2024
08df71a
Align the data types across modules.
WinstonLiyt Dec 26, 2024
c02fd79
fix a bug in model eval
WinstonLiyt Dec 26, 2024
d3e3f60
show line number
XianBW Dec 26, 2024
fa21c04
move sample entry point to app
qew21 Dec 26, 2024
6682711
spec & model prompt changes
XianBW Dec 27, 2024
f8113b2
Refine the competition specification to address the data type problem…
WinstonLiyt Dec 27, 2024
36b4191
fix some bugs
WinstonLiyt Dec 27, 2024
34aa750
add file filter in FBworkspace.code property
XianBW Dec 27, 2024
d30ff40
support non-binary prediction
qew21 Dec 27, 2024
72bfa90
avoid too much warnings
qew21 Dec 27, 2024
d8b5a4c
fix a bug in ensemble module
WinstonLiyt Dec 30, 2024
ea39d9f
filtered the knowledge query in all modules
WinstonLiyt Dec 30, 2024
ed305c1
delete RAG in idea proposal
WinstonLiyt Dec 30, 2024
d9d29b3
refine the code in ensemble
WinstonLiyt Dec 30, 2024
593854c
show exp workspace in llm_st
XianBW Dec 30, 2024
77d8b8b
exp_gen bug fix
XianBW Dec 30, 2024
6eb92ab
feedback bug fix
XianBW Dec 30, 2024
ab2aab4
use `feature` instead of `feat01`
XianBW Dec 30, 2024
7ef6a0e
Trace & method of judging if exp is completed change
XianBW Dec 31, 2024
5da62ac
fix a bug in package calling and execute ci
WinstonLiyt Jan 2, 2025
0d59c6c
fix code
qew21 Jan 2, 2025
716ba1d
bug fix
XianBW Jan 2, 2025
2e014f9
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 2, 2025
7c046cf
bug fix
XianBW Jan 2, 2025
f722e6c
fix a bug
WinstonLiyt Jan 2, 2025
d44942d
fix some bugs
WinstonLiyt Jan 2, 2025
4e9aff3
fix a bug
WinstonLiyt Jan 2, 2025
288777d
refactor: Enhance error handling and feedback in data science loop
you-n-g Jan 2, 2025
5e2adfa
support different use_azure on chat and embedding models
peteryang1 Jan 2, 2025
2abdb96
multi-model proposal logic
WinstonLiyt Jan 2, 2025
e4af411
fix a small syntax error
peteryang1 Jan 2, 2025
f97092f
loopBase and some changes
XianBW Jan 2, 2025
89b7bec
merge pull
XianBW Jan 2, 2025
386fffa
ensemble scores change
XianBW Jan 2, 2025
749c6ca
fbworkspace.code -> .all_codes
XianBW Jan 2, 2025
f159b11
use all model codes in workflow coder
XianBW Jan 2, 2025
3b22e7c
check scores.csv's keys(model_names)
XianBW Jan 2, 2025
f4b1dd2
model name changes
XianBW Jan 2, 2025
ff710d6
add a todo in ensemble test
XianBW Jan 2, 2025
aac5349
sota_exp changes
XianBW Jan 2, 2025
07a3ef7
give model info in exp gen
XianBW Jan 2, 2025
f6b55f6
add runner time limit
XianBW Jan 3, 2025
9f4c84d
config using debug data or not in evals
XianBW Jan 3, 2025
f51b35e
exp to feedback base
XianBW Jan 3, 2025
3b2f15c
add feature code when writing model task
XianBW Jan 3, 2025
9240820
small problem
XianBW Jan 3, 2025
82d0635
copying during sampling
qew21 Jan 3, 2025
19e9b4c
update
peteryang1 Jan 3, 2025
3e945d2
Merge branch 'xuyang1/several_small_code_update_to_ds_refactor' into …
peteryang1 Jan 3, 2025
1fb000f
refactor: Simplify code handling and improve workspace management
you-n-g Jan 3, 2025
615d3b5
model part output fix
XianBW Jan 3, 2025
86180bf
print model's execution time
qew21 Jan 3, 2025
cfda303
bug fix
XianBW Jan 6, 2025
2ddcb24
ensemble test fix
XianBW Jan 6, 2025
28576df
ens small change
XianBW Jan 6, 2025
271e5f1
ens_test bug fix
XianBW Jan 6, 2025
8238802
Refine partial expansion logic to display only a few subfolders when …
WinstonLiyt Jan 6, 2025
43f8c1f
several update on prompts
peteryang1 Jan 6, 2025
3fd376c
Merge branch 'xuyang1/several_update_on_prompts' into ds_refactor
peteryang1 Jan 6, 2025
89000dd
Merge branch 'ds_refactor' into MM
XianBW Jan 6, 2025
1f5ce9a
sample subfolders
qew21 Jan 6, 2025
9900495
Filter the stdout after code execution to remove irrelevant informati…
WinstonLiyt Jan 6, 2025
6c23e7d
Add some more prompts and comments
you-n-g Jan 6, 2025
9295094
several update on the first init rounds
peteryang1 Jan 6, 2025
488a7eb
Merge branch 'xuyang1/several_new_updates' into ds_refactor
peteryang1 Jan 6, 2025
edeb337
model timeout as error
qew21 Jan 7, 2025
5e4f544
fix pattern of getting model codes in workspace
XianBW Jan 7, 2025
d657834
small bux fix on model prompts
peteryang1 Jan 7, 2025
0e9f0e2
Merge branch 'xuyang1/small_update_on_model_prompts' into ds_refactor
peteryang1 Jan 7, 2025
eb89153
remove get_code_with_key since we have regex pattern
peteryang1 Jan 7, 2025
9d27fe7
fix: Correct tqdm progress bar update logic in LoopBase class
you-n-g Jan 7, 2025
0e671ab
feat: Add diff generation and enhance feedback mechanism in data scie…
you-n-g Jan 7, 2025
a300ae4
update some fix to model and workflow prompts
peteryang1 Jan 7, 2025
684ca66
Merge branch 'xuyang1/several_update_on_model_and_workflow_prompt' in…
peteryang1 Jan 7, 2025
84d4891
refine the logic of progress bar filter
WinstonLiyt Jan 7, 2025
3e58cb5
add last_successful_exp in exp_gen
peteryang1 Jan 7, 2025
301b0c0
fix a one line bug
peteryang1 Jan 7, 2025
29e7149
add a hint in prompt
peteryang1 Jan 7, 2025
7e3d774
fix data sample for bms
qew21 Jan 7, 2025
67fdceb
fix data sample for bms
qew21 Jan 7, 2025
b35d7fb
hypothesis small fix
qew21 Jan 7, 2025
c86bbd9
crawler readme update
XianBW Jan 8, 2025
2018bc8
fix component gen
qew21 Jan 8, 2025
40faf5a
fix bug
XianBW Jan 8, 2025
4c1d8b6
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 8, 2025
7cbf7a3
annotation change
XianBW Jan 8, 2025
3ce8242
load description.md if it exists
peteryang1 Jan 8, 2025
ae29d1d
refactor: Simplify SOTA description handling in feedback and prompts
you-n-g Jan 8, 2025
8492cd4
refactor: Use shared templates for feedback and experiment descriptions
you-n-g Jan 8, 2025
06515e8
change webapp for model codes changes
XianBW Jan 8, 2025
87cda02
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 8, 2025
71506af
update proposal
qew21 Jan 8, 2025
0f4073e
add timeout message for docker run output
XianBW Jan 8, 2025
b9b9c76
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 8, 2025
797acd5
fix
XianBW Jan 8, 2025
56d57ac
refine the code in docker time processing
WinstonLiyt Jan 8, 2025
6d8f476
use .shape instead of len() when do shape eval
XianBW Jan 8, 2025
02004b9
won't change size during iteration
qew21 Jan 8, 2025
ac145dc
support bson sample
qew21 Jan 8, 2025
c88a23d
sample support jsonl and bson
qew21 Jan 9, 2025
1895846
add former_code to coder prompts
peteryang1 Jan 9, 2025
0d17ad9
a little speed us in debug data creating
peteryang1 Jan 9, 2025
7adb539
filter progress bar when eval ens and main
XianBW Jan 9, 2025
cc2d18b
Merge commit 'af6af11' into HEAD
you-n-g Jan 9, 2025
262e242
avoid costeer makes no change to former code
peteryang1 Jan 9, 2025
462982a
fix several log error
peteryang1 Jan 9, 2025
10de120
add timeout judge threshold
XianBW Jan 9, 2025
e096bec
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 9, 2025
c1c9f93
fix some bugs in the evaluation of component output shapes
WinstonLiyt Jan 9, 2025
fdbb4b8
File structure for supporting litellm (#517)
YeewahChan Jan 9, 2025
0c919ed
ignore submission and show processing
qew21 Jan 9, 2025
3096cf5
ignore submission and show processing
qew21 Jan 9, 2025
c9ef301
add efficiency notice
peteryang1 Jan 9, 2025
376d840
refactor: Enhance error message with detailed feedback summary
you-n-g Jan 9, 2025
814b06e
refactor: Simplify component handling in DSExpGen class
you-n-g Jan 9, 2025
da60a21
refactor: Update code structure and add docstring for clarity
you-n-g Jan 9, 2025
c2b41fa
Merge branch 'ds_refactor' of github.com:microsoft/RD-Agent into ds_r…
XianBW Jan 10, 2025
fe88b07
reserve one sample to each label in data sampling
peteryang1 Jan 10, 2025
a26b80e
add Evaluation info
qew21 Jan 10, 2025
091d687
refine costeer code to avoid giving same code twice
peteryang1 Jan 10, 2025
c58d5f6
use raw_description as plain text
peteryang1 Jan 10, 2025
9259839
add a prompt hint to avoid same dict key
peteryang1 Jan 10, 2025
cbadaa5
model task name bug in first model exp gen
XianBW Jan 10, 2025
dc01b8a
Merge branches 'ds_refactor' and 'ds_refactor' of github.com:microsof…
XianBW Jan 10, 2025
059f36a
fix a typo
peteryang1 Jan 10, 2025
13ae2ae
add some debug info in costeer tests
peteryang1 Jan 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
Pipfile
public
release-notes.md
typescript

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down Expand Up @@ -170,3 +171,4 @@ mlruns/
# shell script
*.out
*.sh
.aider*
2 changes: 1 addition & 1 deletion rdagent/app/data_mining/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ class MedBasePropSetting(BasePropSetting):
runner: str = "rdagent.scenarios.data_mining.developer.model_runner.DMModelRunner"
"""Runner class"""

summarizer: str = "rdagent.scenarios.data_mining.developer.feedback.DMModelHypothesisExperiment2Feedback"
summarizer: str = "rdagent.scenarios.data_mining.developer.feedback.DMModelExperiment2Feedback"
"""Summarizer class"""

evolving_n: int = 10
Expand Down
48 changes: 48 additions & 0 deletions rdagent/app/data_science/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
from rdagent.app.kaggle.conf import KaggleBasePropSetting
from rdagent.core.conf import ExtendedSettingsConfigDict


class DataScienceBasePropSetting(KaggleBasePropSetting):
model_config = ExtendedSettingsConfigDict(env_prefix="DS_", protected_namespaces=())

# Main components
## Scen
scen: str = "rdagent.scenarios.data_science.scen.KaggleScen"
"""Scenario class for data mining model"""

## proposal
exp_gen: str = "rdagent.scenarios.data_science.proposal.exp_gen.DSExpGen"

# the two below should be used in ExpGen
# hypothesis_gen: str = "rdagent.scenarios.kaggle.proposal.proposal.KGHypothesisGen"
# """Hypothesis generation class"""
#
# hypothesis2experiment: str = "rdagent.scenarios.kaggle.proposal.proposal.KGHypothesis2Experiment"
# """Hypothesis to experiment class"""

## dev/coder
data_loader_coder: str = "rdagent.components.coder.data_science.raw_data_loader.DataLoaderCoSTEER"
"""Data Loader CoSTEER"""

# feature_coder: str = "rdagent.scenarios.kaggle.developer.coder.KGFactorCoSTEER"
# """Feature Coder class"""

# model_feature_selection_coder: str = "rdagent.scenarios.kaggle.developer.coder.KGModelFeatureSelectionCoder"
# """Model Feature Selection Coder class"""

# model_coder: str = "rdagent.scenarios.kaggle.developer.coder.KGModelCoSTEER"
# """Model Coder class"""

## dev/runner
feature_runner: str = "rdagent.scenarios.kaggle.developer.runner.KGFactorRunner"
"""Feature Runner class"""

model_runner: str = "rdagent.scenarios.kaggle.developer.runner.KGModelRunner"
"""Model Runner class"""

## feedback
summarizer: str = "rdagent.scenarios.kaggle.developer.feedback.KGExperiment2Feedback"
"""Summarizer class"""


DS_RD_SETTING = DataScienceBasePropSetting()
6 changes: 6 additions & 0 deletions rdagent/app/data_science/debug.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
import fire

from rdagent.scenarios.data_science.debug.data import create_debug_data

if __name__ == "__main__":
fire.Fire(create_debug_data)
151 changes: 151 additions & 0 deletions rdagent/app/data_science/loop.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
from pathlib import Path
from typing import Any

import fire

from rdagent.app.data_science.conf import DS_RD_SETTING
from rdagent.components.coder.data_science.ensemble import EnsembleCoSTEER
from rdagent.components.coder.data_science.feature import FeatureCoSTEER
from rdagent.components.coder.data_science.model import ModelCoSTEER
from rdagent.components.coder.data_science.raw_data_loader import DataLoaderCoSTEER
from rdagent.components.coder.data_science.workflow import WorkflowCoSTEER
from rdagent.components.workflow.conf import BasePropSetting
from rdagent.components.workflow.rd_loop import RDLoop
from rdagent.core.exception import CoderError, RunnerError
from rdagent.core.proposal import ExperimentFeedback, HypothesisFeedback
from rdagent.core.scenario import Scenario
from rdagent.core.utils import import_class
from rdagent.log import rdagent_logger as logger
from rdagent.scenarios.data_science.dev.feedback import DSExperiment2Feedback
from rdagent.scenarios.data_science.dev.runner import DSRunner
from rdagent.scenarios.data_science.experiment.experiment import DSExperiment
from rdagent.scenarios.data_science.proposal.exp_gen import DSExpGen, DSTrace
from rdagent.scenarios.kaggle.kaggle_crawler import download_data


class DataScienceRDLoop(RDLoop):
skip_loop_error = (CoderError, RunnerError)

def __init__(self, PROP_SETTING: BasePropSetting):
scen: Scenario = import_class(PROP_SETTING.scen)(PROP_SETTING.competition)

### shared components in the workflow # TODO: check if
knowledge_base = (
import_class(PROP_SETTING.knowledge_base)(PROP_SETTING.knowledge_base_path, scen)
if PROP_SETTING.knowledge_base != ""
else None
)

# 1) task generation from scratch
# self.scratch_gen: tuple[HypothesisGen, Hypothesis2Experiment] = DummyHypothesisGen(scen),

# 2) task generation from a complete solution
# self.exp_gen: ExpGen = import_class(PROP_SETTING.exp_gen)(scen)
self.exp_gen = DSExpGen(scen)
self.data_loader_coder = DataLoaderCoSTEER(scen)
self.feature_coder = FeatureCoSTEER(scen)
self.model_coder = ModelCoSTEER(scen)
self.ensemble_coder = EnsembleCoSTEER(scen)
self.workflow_coder = WorkflowCoSTEER(scen)

self.runner = DSRunner(scen)
# self.summarizer: Experiment2Feedback = import_class(PROP_SETTING.summarizer)(scen)
# logger.log_object(self.summarizer, tag="summarizer")

# self.trace = KGTrace(scen=scen, knowledge_base=knowledge_base)
self.trace = DSTrace(scen=scen)
self.summarizer = DSExperiment2Feedback(scen)
super(RDLoop, self).__init__()

def direct_exp_gen(self, prev_out: dict[str, Any]):
exp = self.exp_gen.gen(self.trace)
logger.log_object(exp, tag="debug_exp_gen")
return exp

def coding(self, prev_out: dict[str, Any]):
exp: DSExperiment = prev_out["direct_exp_gen"]
if exp.hypothesis.component == "DataLoadSpec":
exp = self.data_loader_coder.develop(exp)
elif exp.hypothesis.component == "FeatureEng":
exp = self.feature_coder.develop(exp)
elif exp.hypothesis.component == "Model":
exp = self.model_coder.develop(exp)
elif exp.hypothesis.component == "Ensemble":
exp = self.ensemble_coder.develop(exp)
elif exp.hypothesis.component == "Workflow":
exp = self.workflow_coder.develop(exp)
else:
raise NotImplementedError(f"Unsupported component in DataScienceRDLoop: {exp.hypothesis.component}")

return exp

def running(self, prev_out: dict[str, Any]):
exp: DSExperiment = prev_out["coding"]
if exp.next_component_required() is None:
return self.runner.develop(exp)
else:
return exp

def feedback(self, prev_out: dict[str, Any]) -> ExperimentFeedback:
exp: DSExperiment = prev_out["running"]
if exp.next_component_required() is None:
feedback = self.summarizer.generate_feedback(exp, self.trace)
else:
feedback = ExperimentFeedback(
reason=f"{exp.hypothesis.component} is completed.",
decision=True,
)
return feedback

def record(self, prev_out: dict[str, Any]):
e = prev_out.get(self.EXCEPTION_KEY, None)
if e is None:
self.trace.hist.append((prev_out["running"], prev_out["feedback"]))
else:
self.trace.hist.append(
(
prev_out["direct_exp_gen"] if isinstance(e, CoderError) else prev_out["coding"],
ExperimentFeedback.from_exception(e)
)
)


def main(path=None, step_n=None, competition="bms-molecular-translation"):
"""

Parameters
----------
path :
path like `$LOG_PATH/__session__/1/0_propose`. It indicates that we restore the state that after finish the step 0 in loop1
step_n :
How many steps to run; if None, it will run forever until error or KeyboardInterrupt
competition :


Auto R&D Evolving loop for models in a kaggle{} scenario.
You can continue running session by
.. code-block:: bash
dotenv run -- python rdagent/app/data_science/loop.py [--competition titanic] $LOG_PATH/__session__/1/0_propose --step_n 1 # `step_n` is a optional parameter
rdagent kaggle --competition playground-series-s4e8 # You are encouraged to use this one.
"""
if competition is not None:
DS_RD_SETTING.competition = competition

if DS_RD_SETTING.competition:
if DS_RD_SETTING.scen.endswith("KaggleScen"):
download_data(competition=DS_RD_SETTING.competition, settings=DS_RD_SETTING)
else:
if not Path(f"{DS_RD_SETTING.local_data_path}/{competition}").exists():
logger.error(f"Please prepare data for competition {competition} first.")
return
else:
logger.error("Please specify competition name.")
if path is None:
kaggle_loop = DataScienceRDLoop(DS_RD_SETTING)
else:
kaggle_loop = DataScienceRDLoop.load(path)
kaggle_loop.run(step_n=step_n)


if __name__ == "__main__":
fire.Fire(main)
28 changes: 12 additions & 16 deletions rdagent/app/kaggle/conf.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
from rdagent.components.workflow.conf import BasePropSetting
from rdagent.core.conf import ExtendedSettingsConfigDict
from rdagent.core.conf import ExtendedBaseSettings, ExtendedSettingsConfigDict


class KaggleBasePropSetting(BasePropSetting):
class KaggleBasePropSetting(ExtendedBaseSettings):
model_config = ExtendedSettingsConfigDict(env_prefix="KG_", protected_namespaces=())

# 1) overriding the default
Expand Down Expand Up @@ -30,7 +29,7 @@ class KaggleBasePropSetting(BasePropSetting):
model_runner: str = "rdagent.scenarios.kaggle.developer.runner.KGModelRunner"
"""Model Runner class"""

summarizer: str = "rdagent.scenarios.kaggle.developer.feedback.KGHypothesisExperiment2Feedback"
summarizer: str = "rdagent.scenarios.kaggle.developer.feedback.KGExperiment2Feedback"
"""Summarizer class"""

evolving_n: int = 10
Expand All @@ -45,12 +44,21 @@ class KaggleBasePropSetting(BasePropSetting):
local_data_path: str = ""
"""Folder storing Kaggle competition data"""

if_using_mle_data: bool = False
auto_submit: bool = False
"""Automatically upload and submit each experiment result to Kaggle platform"""
# Conditionally set the knowledge_base based on the use of graph RAG
knowledge_base: str = ""
"""Knowledge base class, uses 'KGKnowledgeGraph' when advanced graph-based RAG is enabled, otherwise empty."""
if_action_choosing_based_on_UCB: bool = False
"""Enable decision mechanism based on UCB algorithm"""

domain_knowledge_path: str = "/data/userdata/share/kaggle/domain_knowledge"
"""Folder storing domain knowledge files in .case format"""

knowledge_base_path: str = "kg_graph.pkl"
"""Advanced version of graph-based RAG"""

rag_path: str = "git_ignore_folder/kaggle_vector_base.pkl"
"""Base version of vector-based RAG"""

Expand All @@ -60,20 +68,8 @@ class KaggleBasePropSetting(BasePropSetting):
if_using_graph_rag: bool = False
"""Enable advanced graph-based RAG"""

# Conditionally set the knowledge_base based on the use of graph RAG
knowledge_base: str = ""
"""Knowledge base class, uses 'KGKnowledgeGraph' when advanced graph-based RAG is enabled, otherwise empty."""

knowledge_base_path: str = "kg_graph.pkl"
"""Advanced version of graph-based RAG"""

auto_submit: bool = False
"""Automatically upload and submit each experiment result to Kaggle platform"""

mini_case: bool = False
"""Enable mini-case study for experiments"""

if_using_mle_data: bool = False


KAGGLE_IMPLEMENT_SETTING = KaggleBasePropSetting()
28 changes: 15 additions & 13 deletions rdagent/app/kaggle/loop.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,13 @@
from rdagent.core.developer import Developer
from rdagent.core.exception import FactorEmptyError, ModelEmptyError
from rdagent.core.proposal import (
Experiment2Feedback,
Hypothesis2Experiment,
HypothesisExperiment2Feedback,
HypothesisGen,
)
from rdagent.core.scenario import Scenario
from rdagent.core.utils import import_class
from rdagent.log import rdagent_logger as logger
from rdagent.log.time import measure_time
from rdagent.scenarios.kaggle.experiment.scenario import (
KG_ACTION_FEATURE_ENGINEERING,
KG_ACTION_FEATURE_PROCESSING,
Expand All @@ -28,7 +27,6 @@


class KaggleRDLoop(RDLoop):
@measure_time
def __init__(self, PROP_SETTING: BasePropSetting):
with logger.tag("init"):
scen: Scenario = import_class(PROP_SETTING.scen)(PROP_SETTING.competition)
Expand All @@ -55,27 +53,31 @@ def __init__(self, PROP_SETTING: BasePropSetting):
logger.log_object(self.feature_runner, tag="feature runner")
self.model_runner: Developer = import_class(PROP_SETTING.model_runner)(scen)
logger.log_object(self.model_runner, tag="model runner")
self.summarizer: HypothesisExperiment2Feedback = import_class(PROP_SETTING.summarizer)(scen)
self.summarizer: Experiment2Feedback = import_class(PROP_SETTING.summarizer)(scen)
logger.log_object(self.summarizer, tag="summarizer")
self.trace = KGTrace(scen=scen, knowledge_base=knowledge_base)
super(RDLoop, self).__init__()

@measure_time
def coding(self, prev_out: dict[str, Any]):
with logger.tag("d"): # develop
if prev_out["propose"].action in [KG_ACTION_FEATURE_ENGINEERING, KG_ACTION_FEATURE_PROCESSING]:
exp = self.feature_coder.develop(prev_out["exp_gen"])
elif prev_out["propose"].action == KG_ACTION_MODEL_FEATURE_SELECTION:
exp = self.model_feature_selection_coder.develop(prev_out["exp_gen"])
if prev_out["direct_exp_gen"]["propose"].action in [
KG_ACTION_FEATURE_ENGINEERING,
KG_ACTION_FEATURE_PROCESSING,
]:
exp = self.feature_coder.develop(prev_out["direct_exp_gen"]["exp_gen"])
elif prev_out["direct_exp_gen"]["propose"].action == KG_ACTION_MODEL_FEATURE_SELECTION:
exp = self.model_feature_selection_coder.develop(prev_out["direct_exp_gen"]["exp_gen"])
else:
exp = self.model_coder.develop(prev_out["exp_gen"])
exp = self.model_coder.develop(prev_out["direct_exp_gen"]["exp_gen"])
logger.log_object(exp.sub_workspace_list, tag="coder result")
return exp

@measure_time
def running(self, prev_out: dict[str, Any]):
with logger.tag("ef"): # evaluate and feedback
if prev_out["propose"].action in [KG_ACTION_FEATURE_ENGINEERING, KG_ACTION_FEATURE_PROCESSING]:
if prev_out["direct_exp_gen"]["propose"].action in [
KG_ACTION_FEATURE_ENGINEERING,
KG_ACTION_FEATURE_PROCESSING,
]:
exp = self.feature_runner.develop(prev_out["coding"])
else:
exp = self.model_runner.develop(prev_out["coding"])
Expand Down Expand Up @@ -126,7 +128,7 @@ def main(path=None, step_n=None, competition=None):
"""
if competition:
KAGGLE_IMPLEMENT_SETTING.competition = competition
download_data(competition=competition, local_path=KAGGLE_IMPLEMENT_SETTING.local_data_path)
download_data(competition=competition, settings=KAGGLE_IMPLEMENT_SETTING)
if KAGGLE_IMPLEMENT_SETTING.if_using_graph_rag:
KAGGLE_IMPLEMENT_SETTING.knowledge_base = (
"rdagent.scenarios.kaggle.knowledge_management.graph.KGKnowledgeGraph"
Expand Down
4 changes: 2 additions & 2 deletions rdagent/app/qlib_rd_loop/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ class ModelBasePropSetting(BasePropSetting):
runner: str = "rdagent.scenarios.qlib.developer.model_runner.QlibModelRunner"
"""Runner class"""

summarizer: str = "rdagent.scenarios.qlib.developer.feedback.QlibModelHypothesisExperiment2Feedback"
summarizer: str = "rdagent.scenarios.qlib.developer.feedback.QlibModelExperiment2Feedback"
"""Summarizer class"""

evolving_n: int = 10
Expand All @@ -47,7 +47,7 @@ class FactorBasePropSetting(BasePropSetting):
runner: str = "rdagent.scenarios.qlib.developer.factor_runner.QlibFactorRunner"
"""Runner class"""

summarizer: str = "rdagent.scenarios.qlib.developer.feedback.QlibFactorHypothesisExperiment2Feedback"
summarizer: str = "rdagent.scenarios.qlib.developer.feedback.QlibFactorExperiment2Feedback"
"""Summarizer class"""

evolving_n: int = 10
Expand Down
2 changes: 0 additions & 2 deletions rdagent/app/qlib_rd_loop/factor.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,11 @@
from rdagent.components.workflow.rd_loop import RDLoop
from rdagent.core.exception import FactorEmptyError
from rdagent.log import rdagent_logger as logger
from rdagent.log.time import measure_time


class FactorRDLoop(RDLoop):
skip_loop_error = (FactorEmptyError,)

@measure_time
def running(self, prev_out: dict[str, Any]):
with logger.tag("ef"): # evaluate and feedback
exp = self.runner.develop(prev_out["coding"])
Expand Down
Loading
Loading