RepoWise Icon

RepoWise

A Conversational Framework for Mining and Reasoning About Project Repositories

Department of Computer Science, University of California, Davis

Abstract

Open-source software (OSS) projects often face challenges in governance, sustainability, and contributor onboarding. Traditional analytics provide static metrics but lack interpretability and interactivity. RepoWise introduces a conversational framework powered by large language models (LLMs) that performs forensic-style reasoning over OSS repositories. It enables natural-language dialogue around key project documents—such as governance.md, contributing.md, and README.md—to surface insights into project health, sustainability risks, and actionable next steps.

By combining conversational AI with OSS analytics, RepoWise offers an interpretable, interactive approach to understanding the social and technical dynamics of open-source development. The system automatically extracts and indexes governance documents, contribution guidelines, commit data, and issue reports from GitHub repositories, then uses a five-stage intent classification pipeline with dual retrieval engines to generate context-grounded, evidence-backed responses.

We’d really appreciate your thoughts on this tool. Please share your feedback here:  https://forms.gle/GUQyYY6SijDbtUVe9

System Architecture

High-Level Overview

High-Level Architecture

Figure 1: High-level system architecture showing major components (click to enlarge)

Detailed Architecture

Detailed Architecture

Figure 2: Detailed architecture with RAG pipeline and data flow (click to enlarge)

Core Components

  1. User Interface (Frontend): React-based chat interface for natural language queries
  2. Intent Classification Module: Five-stage hierarchical pipeline for routing queries
  3. RAG Pipeline: Semantic search over project documentation using all-MiniLM-L6-v2 embeddings and ChromaDB
  4. CSV Data Pipeline: Structured retrieval of commit and issue metadata via GitHub API
  5. Prompt Assembly Engine: Merges retrieved context with anti-hallucination rules
  6. LLM Generation Module: Local Mistral 7B via Ollama for privacy-preserving reasoning
  7. Persistent Storage Layer: ChromaDB vector store, CSV cache, and file cache for reproducibility

Intent Classification System

The Intent Classification Module is the cognitive core of RepoWise. It determines how each user query should be processed by identifying its semantic intent and routing it to the corresponding retrieval engine. The classification mechanism follows a five-stage hierarchical logic that incrementally filters, analyzes, or refines the query based on linguistic and semantic cues until an appropriate intent label is assigned.

Intent Classification Flow

Figure 3: Five-stage intent classification pipeline (click to enlarge)

Five-Stage Classification Pipeline

  1. Stage 0 - Out-of-Scope Detection: Screens out greetings and irrelevant inputs (confidence ~0.99)
  2. Stage 1 - Procedural Detection: Identifies process-oriented questions about documentation using phrases like "how do I," "who maintains" (confidence ~0.95)
  3. Stage 2 - Statistical Detection: Checks for numerical or activity-based queries like "Show top contributors" or "Which issues have most comments" (confidence ~0.90)
  4. Stage 3 - Keyword Scoring: Computes weighted frequencies for governance-related tokens. Queries with cumulative score ≥1.5 are classified as PROJECT_DOC (confidence 0.50-0.95)
  5. Stage 4 - Heuristic Fallback: Captures edge cases through syntactic and semantic heuristics (confidence 0.40-0.65)

Query Classification Categories

  • PROJECT_DOC: Governance, contribution guidelines, project policies → routed to RAG Pipeline
    • Example: "Who maintains this project?"
    • Example: "What are the contribution guidelines?"
  • COMMITS: Code changes, contributors, development activity → routed to CSV Data Pipeline
    • Example: "Who are the most active contributors?"
    • Example: "What files were changed recently?"
  • ISSUES: Bugs, feature requests, project health → routed to CSV Data Pipeline
    • Example: "How many open issues are there?"
    • Example: "What are the most commented issues?"

Prompt Templates

Each query is transformed into a task-specific prompt composed of five modular components. RepoWise currently defines eight intent-driven templates that balance factual grounding, task specialization, and evidence transparency.

Template Routing

Intent classification determines which prompt template is assembled. Out-of-scope requests are handled directly, while documentation, statistical, and general inquiries leverage structured or semantic retrieval pipelines.

Intent Template Tokens
OUT_OF_SCOPE Direct response (no LLM) 0
PROJECT_DOC_BASED (WHO) WHO ∼2000
PROJECT_DOC_BASED (HOW) HOW ∼2000
PROJECT_DOC_BASED (WHAT) WHAT ∼2000
PROJECT_DOC_BASED (LIST) LIST ∼2000
COMMITS COMMITS ∼1000
ISSUES ISSUES ∼1000
GENERAL GENERAL ∼900

Component Structure

Every template combines universal guardrails with task-specific context. The table below summarizes how each component contributes to grounded, verifiable answers.

Component Purpose Tokens Scope
System Role Establish analytical persona ∼50 All
Task Instructions Define extraction task 500–1200 Specific
Anti-Hallucination Constrain factual boundaries ∼3000 All
Retrieved Context Provide evidence base 500–4000 Specific
User Question Present query 50–200 All

Component Templates

Component 1: System Role (Universal, ∼50 tokens)

You are a precise document analyst for the {project_name} project.

Principle: Establishes domain-specific context. Identical across all templates to ensure consistent analytical framing.

Component 2: Task Instructions (Template-Specific)

The second component tailors the extraction strategy to the user’s intent.

WHO Template (∼1000 tokens)
TASK: ENTITY EXTRACTION - Extract names, emails, GitHub usernames, and roles

1. Search the documents below for actual names, email addresses, and GitHub usernames.
2. Look for these patterns:
   - @username (GitHub usernames)
   - Name <email@example.com> (name with email)
   - "Maintained by: Name" (explicit roles)
   - "Team: [list of names]" (team structures)
   - "Contact: email@domain.com" (contact information)

3. ONLY extract what explicitly appears in the documents.
4. DO NOT invent names or assume maintainers.
5. DO NOT guess email addresses or GitHub usernames.

6. If no names found, state: "Based on the available project documents, I cannot find information about maintainers. The following documents were searched: [list documents]."

7. Distinguish between:
   - Project maintainers (ongoing stewardship)
   - Original authors/creators (historical)
   - Contributors (code contributions)
   - Organization/foundation (ownership)

Principle: Pattern-based entity extraction prevents hallucination of maintainer identities. Explicit role distinction clarifies governance structure.

HOW Template (∼1200 tokens)
TASK: PROCEDURAL EXTRACTION - Extract step-by-step instructions

1. Extract the exact procedure/steps from the documents.
2. Present steps in sequential order (numbered or bulleted).
3. Include prerequisites if mentioned in documents.
4. Include links, references, or commands if provided.
5. Preserve the structure: setup → process → completion.

6. DO NOT add steps not explicitly stated in documents.
7. DO NOT assume standard practices not documented.

8. If procedure is incomplete, acknowledge gaps: "Steps X-Y are not documented."

9. If no procedure found, state: "Based on the available project documents, I cannot find a documented process for [topic]. The following documents were searched: [list documents]."

10. For multi-step processes, indicate:
    - Required actions (MUST do)
    - Optional actions (MAY do)
    - Conditional actions (IF condition, THEN do)

Principle: Sequential ordering with explicit requirement levels (MUST/MAY/IF) preserves procedural fidelity and acknowledges incomplete documentation.

WHAT Template (∼1000 tokens)
TASK: INFORMATION EXTRACTION - Extract specific information

1. Extract the exact information requested from documents.
2. Quote directly from documents when possible.
3. Include relevant details:
   - Dates or version numbers if applicable
   - Requirements or constraints
   - Exceptions or special cases

4. Provide document citations for all information.

5. DO NOT paraphrase unless necessary for clarity.
6. DO NOT combine information from multiple documents unless they complement each other.

7. If information is ambiguous, present all interpretations.

8. If information not found, state: "Based on the available project documents, I cannot find information about [topic]. The following documents were searched: [list documents]."

9. For policies or rules:
   - State the policy clearly
   - Include any exceptions
   - Cite the authoritative document

Principle: Direct quotation and multi-source acknowledgment preserve information accuracy and handle ambiguity transparently.

LIST Template (∼1100 tokens)
TASK: LIST EXTRACTION - Extract and format lists

1. Extract ALL items that match the question.
2. Format as bulleted or numbered list (preserve original format if specified).
3. Include descriptions or explanations if provided in documents.
4. Group related items if the document groups them.
5. Preserve hierarchical structure if present (main items, sub-items, detailed points).
6. Preserve order if specified in documents (e.g., "priority order", "sequence").

7. DO NOT add items not explicitly in documents.
8. DO NOT reorganize items unless necessary for clarity.

9. If list is incomplete, state: "Partial list provided. Complete list may not be documented."

10. If no list found, state: "Based on the available project documents, I cannot find a list of [topic]. The following documents were searched: [list documents]."

11. For each item, include:
    - The item name/title
    - Brief description (if provided)
    - Requirements or conditions (if applicable)

Principle: Structure preservation with explicit incompleteness handling maintains list fidelity and avoids arbitrary reordering.

COMMITS Template (∼800 tokens)
TASK: COMMIT DATA ANALYSIS - Analyze commit and contributor data

1. Analyze the CSV data provided below.
2. Answer questions based on commit statistics.
3. Include relevant information:
   - Commit dates and timestamps
   - Author names and emails
   - File changes (added, modified, deleted)
   - Commit messages and descriptions

4. For ranking queries ("top N contributors"):
   - Sort by relevant metric (commit count, file changes, etc.)
   - Provide top N as requested

5. For temporal queries ("latest commits", "recent activity"):
   - Sort by date/timestamp
   - Include timeframe in response

6. Calculate statistics when needed:
   - Counts
   - Averages
   - Trends

7. If data is unavailable, state: "Commit data is not available for this project."

Example Context (CSV format):

COMMIT DATA FOR vercel-swr:

| commit_id | author | author_email   | date            | message                | files_changed | additions | deletions |
|-----------|--------|----------------|-----------------|------------------------|---------------|-----------|-----------|
| abc123... | john   | j@example.com  | 2025-01-15 14:30 | Fix bug in parser      | 3             | 45        | 12        |
| def456... | jane   | jane@ex.com    | 2025-01-14 09:15 | Add new feature        | 7             | 234       | 18        |
| ghi789... | john   | j@example.com  | 2025-01-13 11:20 | Update docs            | 1             | 8         | 2         |

Example Query: “Who are the top contributors?”

Expected Response:

Top contributors for vercel-swr based on commit count:

1. john (j@example.com) - 2 commits, 53 additions, 14 deletions
2. jane (jane@ex.com) - 1 commit, 234 additions, 18 deletions

Principle: CSV-based retrieval enables direct aggregation, sorting, and ranking operations that are computationally inefficient with semantic search. Structured data maintains precise numerical relationships without semantic approximation.

ISSUES Template (∼800 tokens)
TASK: ISSUE DATA ANALYSIS - Analyze issue and discussion data

1. Analyze the CSV data provided below.
2. Answer questions based on issue statistics.
3. Include relevant information:
   - Issue titles and descriptions
   - Comment counts and discussion activity
   - Issue state (open, closed)
   - Labels and tags
   - Created/closed dates

4. For ranking queries ("most commented issues"):
   - Sort by the relevant metric
   - Provide top N as requested

5. For filtering queries ("open issues", "high-priority"):
   - Filter by state, labels, or other criteria
   - Include counts

6. Calculate statistics when needed:
   - Counts
   - Ratios
   - Trends

7. If data is unavailable, state: "Issue data is not available for this project."

Example Context (CSV format):

ISSUE DATA FOR vercel-swr:

| issue_id | title                        | author | comments | state  | labels             | created_at  | closed_at   |
|----------|------------------------------|--------|----------|--------|--------------------|-------------|-------------|
| #123     | Bug: Parser fails on edge    | user1  | 45       | open   | bug,high-priority  | 2025-01-10  | null        |
| #124     | Feature: Add export func     | user2  | 32       | closed | enhancement        | 2025-01-08  | 2025-01-12  |
| #125     | Question: Configure cache?   | user3  | 18       | open   | question           | 2025-01-09  | null        |

Example Query: “Which issues have the most comments?”

Expected Response:

Most commented issues for vercel-swr:

1. #123: "Bug: Parser fails on edge case" (45 comments, open)
   Labels: bug, high-priority
   Created: January 10, 2025

2. #124: "Feature: Add export functionality" (32 comments, closed)
   Labels: enhancement
   Created: January 8, 2025 | Closed: January 12, 2025

3. #125: "Question: How to configure cache?" (18 comments, open)
   Labels: question
   Created: January 9, 2025

Principle: CSV format enables precise filtering, sorting, and aggregation across issue metadata. Structured tables preserve exact numeric values critical for ranking queries.

Why CSV over RAG for COMMITS/ISSUES
  1. Aggregation: Computing “top N” requires sorting entire datasets, not retrieving top-5 chunks.
  2. Precision: Numeric operations (counts, sums, averages) need exact values, not semantic similarity.
  3. Multi-field filtering: Queries like “open issues with >10 comments and label=bug” require Boolean logic across columns.
  4. Temporal ordering: Chronological sorting by date is a structured operation, not a semantic one.
  5. Performance: Table scans with indexing outperform vector similarity search for statistical queries.
GENERAL Template (∼500 tokens)
TASK: PROVIDE GENERAL GUIDANCE - Help with general programming questions

1. Provide helpful, accurate information based on software engineering best practices.
2. Reference widely-accepted standards when applicable.
3. Keep answers concise and actionable.
4. Provide examples if helpful.

5. DO NOT make project-specific claims without evidence.
6. DO NOT assume user's context or requirements.

7. If question is too broad, ask for clarification or provide high-level overview.
8. Acknowledge when multiple valid approaches exist.

Principle: Generic guidance without project assumptions acknowledges technical diversity and avoids false specificity.

Component 3: Anti-Hallucination Rules (Universal, ∼3000 tokens)

CRITICAL INSTRUCTIONS - FOLLOW EXACTLY:

1. INFORMATION SOURCE
   Your ONLY source of information is the project documents provided below in the "AVAILABLE GOVERNANCE DOCUMENTS" or data tables section.

   Do not:
   - Use external knowledge, training data, or general information about similar projects
   - Assume information based on common practices in open source
   - Reference information from previous conversations or other projects

2. HANDLING MISSING INFORMATION
   If information is NOT in the provided documents, respond exactly like this:

   "Based on the available project documents, I cannot find information about [specific topic]. The following documents were searched: [list document names]."

   Never:
   - Say "typically", "usually", or similar generalization words
   - Fill gaps with assumptions
   - Suggest what the project "should" have without evidence

3. SOURCE ATTRIBUTION
   Always cite which specific document contains the information using:
   - "According to [DOCUMENT_NAME], ..."
   - If information appears in multiple documents, cite all sources

4. PRECISION OVER SPECULATION
   - Only state what the documents explicitly say
   - If ambiguous, acknowledge ambiguity
   - Do not extrapolate beyond what is written
   - Quote directly when precision is critical

5. DOCUMENT SCOPE
   - Only use the documents provided in this prompt
   - Do not reference documents that might exist but are not provided
   - If a relevant document seems to be missing, state that it may be in the missing document type

6. NO ASSUMPTIONS ABOUT PROJECT STRUCTURE
   Do not assume:
   - Organizational structure
   - Roles or responsibilities
   - Development processes
   - Technical decisions

   Unless explicitly documented.

7. EXPLICIT UNCERTAINTY
   State:
   - Conflicts when documents disagree
   - What is missing when information is partial

   Use phrases like:
   - "partially documented"
   - "not fully specified"
   - "unclear from available documents"
   - "requires clarification"

Principle: A seven-rule constraint system prioritizes factual accuracy over completeness. Explicit missing-information protocols prevent hallucination and maintain consistent factual standards.

Component 4: Retrieved Context (Template-Specific)

PROJECT_DOC Templates (∼3000 tokens)
AVAILABLE GOVERNANCE DOCUMENTS FOR {project_name}:

[README] README.md:
# SWR - React Hooks for Data Fetching

SWR is a React Hooks library for data fetching. The name "SWR" is derived from stale-while-revalidate, a cache invalidation strategy...

[CONTRIBUTING] CONTRIBUTING.md:
# Contributing to SWR

Thanks for your interest in contributing to SWR! Please follow these guidelines...

[CODE_OF_CONDUCT] .github/CODE_OF_CONDUCT.md:
# Contributor Covenant Code of Conduct

## Our Pledge
We as members, contributors, and leaders pledge to make participation...

[LICENSE] LICENSE:
MIT License
Copyright (c) 2025 Vercel, Inc.
...

Retrieval Process:

  1. Query embedding (all-MiniLM-L6-v2, 384 dimensions)
  2. Vector similarity search (ChromaDB)
  3. Hybrid reranking (semantic + keyword + recency)
  4. Top-5 chunks assembled (∼3000 tokens)
COMMITS/ISSUES Templates (∼500–1000 tokens)

See CSV format examples in the Task Instructions above.

Data Source: GitHub API → Local CSV cache.

GENERAL Template (∼200 tokens)
CONTEXT:
This is a general programming/development question not specific to a particular project.
Provide helpful, accurate information based on software engineering best practices.
The user has not selected a specific project, so avoid making project-specific claims.

Component 5: User Question (Universal, ∼50–200 tokens)

USER QUESTION: {question}

Your answer:

Principle: Clear separation between instructions and query ensures a consistent generation cue across all templates.

Design Rationale

  • Modularity: Universal components (System Role, Anti-Hallucination, User Question) pair with task-specific instructions and context to enable rapid template extension.
  • Factual Grounding: Extensive anti-hallucination guidance (∼37.5% of total prompt) prioritizes accuracy over fluency.
  • Task Specialization: Distinct WHO/HOW/WHAT/LIST instructions optimize extraction for entities, procedures, facts, and lists.
  • Structured vs. Semantic Retrieval: CSV format for COMMITS/ISSUES enables SQL-like aggregation, sorting, and filtering that semantic search cannot perform efficiently.
  • Evidence Transparency: Mandatory source attribution supports independent verification and reproducible analysis.

Key Features

🤖 Five-Stage Intent Classification

Hierarchical pipeline achieving 86.4% accuracy in routing queries to appropriate retrieval engines

🔍 Dual Retrieval Engines

Semantic RAG for documentation and structured CSV pipeline for commits/issues data

🛡️ Anti-Hallucination Mechanisms

Context-grounded responses with strict factual boundaries and evidence-backed reasoning

📊 Evidence-Grounded Repository Mining

Forensic-style reasoning with source attribution, provenance tracking, and transparent citation of evidence

🔒 Privacy-Preserving Local LLM

Mistral 7B running locally via Ollama ensures data privacy and reproducibility

💬 Interactive Conversational Interface

Natural language dialogue with provenance metadata and source attribution

Technology Stack

Backend

  • Framework: FastAPI (Python)
  • Vector Database: ChromaDB for document embeddings
  • Embeddings: all-MiniLM-L6-v2 (sentence-transformers)
  • LLM: Mistral 7B via Ollama (local inference)

Frontend

  • Framework: React 18 with Vite
  • UI Library: Tailwind CSS
  • State Management: TanStack Query (React Query)

Installation & Setup

Backend Setup

# Clone the repository
git clone https://github.com/RepoWise/backend.git
cd backend

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Configure environment
cp .env.example .env
# Edit .env with your GitHub token and settings

# Install Ollama (for local LLM)
curl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral:7b

# Start the server
./start_dev.sh

Frontend Setup

# Clone the repository
git clone https://github.com/RepoWise/frontend.git
cd frontend

# Install dependencies
npm install

# Configure environment
cp .env.example .env
# Edit .env with backend URL

# Start development server
npm run dev

Usage Example

Adding a Repository

POST /api/projects/add
{
  "github_url": "https://github.com/facebook/react"
}

Querying Documentation

POST /api/query
{
  "project_id": "facebook-react",
  "query": "What are the contribution guidelines?"
}

Response Format

{
  "answer": "To contribute to React, you should...",
  "sources": [
    {
      "file_path": "CONTRIBUTING.md",
      "score": 0.89,
      "content": "..."
    }
  ],
  "suggested_questions": [
    "How do I submit a pull request?",
    "What is the code review process?"
  ]
}

Research & Publications

RepoWise represents a paradigm shift in repository analytics—moving from static metrics to interactive, interpretable inquiry. By combining conversational retrieval with evidence-grounded LLM reasoning, the framework enables stakeholders to ask natural-language questions and receive contextual, verifiable answers with source citations, transforming how developers, maintainers, and researchers understand and navigate project repositories.

Use Cases

  • Contributor Onboarding: New contributors ask procedural questions in natural language ("How do I submit a pull request?", "What coding standards should I follow?") and receive step-by-step guidance extracted directly from project documentation with exact citations
  • Governance and Community Health: Maintainers query governance structure ("Who has commit access?"), contribution policies ("What's our code review process?"), and community dynamics ("Who are the most active contributors?") to assess transparency, engagement, and sustainability
  • Repository Forensics: Researchers and auditors perform forensic analysis to trace decision provenance ("When was the license changed?"), verify compliance ("Are there any dual-licensed files?"), detect documentation drift, and investigate historical development patterns

Key Research Contributions

  • Five-stage intent classification pipeline with 86.4% accuracy
  • Dual retrieval architecture combining semantic RAG and structured CSV data
  • Evidence-grounded reasoning with anti-hallucination mechanisms
  • Local LLM inference for privacy-preserving, reproducible analysis
  • Conversational framework for governance and sustainability forensics

Future Work

  • Agentic Multi-Turn Reasoning: Extend RepoWise with autonomous, multi-turn reasoning framework capable of proactive retrieval and conversational memory management
  • Code-Level Analysis: Expand beyond documentation to include code summarization, dependency forensics, and automatic sustainability scoring
  • Foundation Recommendation: Develop AI-driven foundation alignment analysis to recommend which OSS foundation (Apache, Eclipse, OSGeo, etc.) best fits a project's governance model based on textual and social cues
  • Public Deployment: Deploy as a web-based service with user authentication and persistent repository contexts for community-driven evaluation
  • Longitudinal Studies: Conduct large-scale user studies to assess real-world usability, adoption patterns, and long-term maintenance behavior

Cite RepoWise

If you use RepoWise in your research or projects, please cite it using the following entry:

@software{RepoWise2025,
  author       = {RepoWise contributors},
  title        = {RepoWise — Repository sustainability tracker (website)},
  year         = {2025},
  url          = {https://repowise.github.io/RepoWise-website/},
}

Acknowledgments

This research was supported by the National Science Foundation under Grant No. 2020751, as well as by the Alfred P. Sloan Foundation through the OSPO for UC initiative (Award No. 2024-22424).

Contact

RepoWise is developed by the DECAL Lab at UC Davis.

GitHub Organization | Backend Repository | Frontend Repository

For questions or feedback, please open an issue on our GitHub repository.

RepoWise User Analytics

Views: ...

Registered Users: ...

Processed Repositories: ...

Viewers Preview