11. Setup Configuration#
11.1. Configuration Overview & Template#
StructSense is configured with environment variables and a YAML config.
Pass the YAML via CLI, e.g. --config config/ner_agent.yaml.
Do not rename these top‑level YAML keys:
agent_configtask_config
Do not replace runtime variables in braces {}:
{literature}— input text (e.g., extracted PDF content){extracted_structured_information}— extractor output{aligned_structured_information}— alignment output{judged_structured_information_with_human_feedback}— judge output{modification_context},{user_feedback_text}— inputs to feedback agent
Config Template
A blank template is available in config_template.
11.2. Agent Configuration#
These agent IDs are required and must not be renamed:
extractor_agentalignment_agentjudge_agenthumanfeedback_agent
Each agent has: role, goal, backstory, and llm.
Example:
agent_config:
extractor_agent:
role: >
agent role
goal: >
goal
backstory: >
agent backstory
llm:
model: openrouter/openai/gpt-4o-mini
base_url: https://openrouter.ai/api/v1
11.2.1. Using Ollama (Local Models)#
agent_config:
extractor_agent:
role: >
agent role
goal: >
goal
backstory: >
agent backstory
llm:
model: ollama/deepseek-r1:14b
base_url: http://localhost:11434
Run without a paid API key:
structsense-cli extract \
--source SOME.pdf \
--config ner_config_gpt.yaml \
--env_file .env
11.3. Task Configuration#
Required task IDs (do not rename):
extraction_taskalignment_taskjudge_taskhumanfeedback_task
Each task includes:
description— includes expected input (e.g.,{literature})expected_output— JSON output format or exampleagent_id— must match an agent ID fromagent_config
Example:
task_config:
extraction_task:
description: >
Extract structured information from the given literature.
Input: {literature}
expected_output: >
Format: JSON
Example: {"entities": [...], "relations": [...]}
agent_id: extractor_agent
11.4. Embeddings & Knowledge#
11.4.1. Embedding Configuration#
embedder_config:
provider: ollama
config:
api_base: http://localhost:11434
model: nomic-embed-text:latest
11.4.2. Knowledge Source (Vector DB)#
WEAVIATE_* environment variables are optional and only needed if you enable a knowledge source for schema/ontology lookup.
11.5. Environment Variables#
11.5.1. Core#
Variable |
Description |
Default |
|---|---|---|
|
Enable vector DB knowledge source |
|
|
Required when using Weaviate |
— |
Note:
WEAVIATE_API_KEYis required if you enable the knowledge source.
11.5.2. Weaviate (optional)#
Variable |
Description |
Default |
|---|---|---|
|
HTTP host |
|
|
HTTP port |
|
|
HTTPS for HTTP connection |
|
|
gRPC host |
|
|
gRPC port |
|
|
Use secure gRPC |
|
11.5.3. Weaviate Timeouts#
Variable |
Description |
Default |
|---|---|---|
|
Initialization timeout (s) |
|
|
Query timeout (s) |
|
|
Insert timeout (s) |
|
11.5.4. Ollama for Weaviate#
Variable |
Description |
Default |
|---|---|---|
|
Ollama API endpoint |
|
|
Embedding model |
|
If Ollama runs on host and Weaviate in Docker, use
http://host.docker.internal:11434.
If both are in Docker on the same host network, usehttp://localhost:11434.
11.5.5. Experiment Tracking (optional)#
Variable |
Description |
Default |
|---|---|---|
|
Enable W&B |
|
|
Enable MLflow |
|
|
MLflow tracking URL |
|
11.6. Example .env#
ENABLE_KG_SOURCE=false
OLLAMA_API_ENDPOINT=http://localhost:11434
OLLAMA_MODEL=nomic-embed-text:v1.5