A powerful application for indexing and querying code repositories using AI. This tool provides intelligent code search, explanation, and documentation capabilities using advanced RAG (Retrieval-Augmented Generation) techniques.
-
Specialized Code Embeddings
- CodeBERT and GraphCodeBERT integration
- AST-based code embeddings
- Cross-language code embeddings
- Function-level and class-level embeddings
- Language-specific parsers for multiple languages
-
Enhanced Code Analysis
- AST-based code structure analysis
- Code dependency tracking
- Code flow analysis
- Code complexity metrics
- Type inference and validation
-
Advanced Query Processing
- Query intent detection (explain, bug, feature, usage, implementation)
- Query reformulation for better search results
- Dynamic context window sizing
- Conversation history support
- Context-aware responses
-
Hybrid Search
- Dense and sparse retriever combination
- Code-specific reranking using BM25
- Vector similarity search
- Code knowledge graph integration
- Multi-stage retrieval pipeline
- Comprehensive Documentation
- API documentation generation
- Module documentation
- Code examples
- Project README generation
- Documentation site generation
- Code explanation capabilities
-
Analytics System
- System metrics tracking
- Query performance analytics
- Usage statistics
- Metrics retention management
- Performance monitoring
-
Enhanced Logging
- Rotating log files
- Component-specific logging
- Detailed error tracking
- Performance monitoring
- Debug information
ai-code-context/
├── app/
│ ├── analytics/ # Analytics and monitoring system
│ │ ├── monitor.py # System metrics tracking
│ │ └── metrics.py # Performance metrics
│ ├── config/ # Configuration management
│ │ └── settings.py # Application settings
│ ├── docs/ # Documentation generation
│ │ └── auto_documenter.py # Auto-documentation system
│ ├── github/ # GitHub integration
│ │ ├── repo_scanner.py # Repository scanning
│ │ └── indexer.py # Code indexing
│ ├── rag/ # RAG system components
│ │ ├── advanced_rag.py # Advanced RAG implementation
│ │ ├── query_optimizer.py # Query optimization
│ │ └── code_explainer.py # Code explanation
│ ├── utils/ # Utility functions
│ │ ├── code_chunker.py # Code chunking
│ │ ├── llm.py # LLM integration
│ │ └── text_processing.py # Text processing
│ └── vector_store/ # Vector storage
│ └── chroma_store.py # ChromaDB implementation
├── logs/ # Application logs
├── metrics/ # Analytics metrics
├── docs/ # Generated documentation
├── .env # Environment variables
├── requirements.txt # Dependencies
└── README.md # This file
-
Repository Indexing
GitHub Repository → Scanner → Code Chunker → Vector Store -
Query Processing
User Query → Query Optimizer → RAG System → LLM → Response -
Documentation Generation
Code → Auto Documenter → Documentation Site
- Clone the repository:
git clone https://github.com/yourusername/ai-code-context.git
cd ai-code-context- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
cp .env.example .env
# Edit .env with your configurationHere's a quick guide to get up and running:
-
Set up environment variables:
cp .env.example .env # Edit .env with your GitHub token and OpenAI/Anthropic API key -
Index a repository:
python -m app.main index --repo owner/repo
-
Query the codebase:
python -m app.main query --query "How does X work?"
That's it! You'll get a natural language explanation of the code based on your query. For more advanced usage, see the Usage section below.
Configure the application by creating a .env file in the project root directory. You can copy the .env.example file and modify it as needed.
GITHUB_ACCESS_TOKEN: Your GitHub access token for repository accessOPENAI_API_KEYorANTHROPIC_API_KEY: At least one LLM API key is required
GITHUB_REPOSITORY: Default repository to index in "owner/repo" formatGITHUB_BRANCH: Default branch to index (defaults to "main")SUPPORTED_FILE_TYPES: Comma-separated list of file extensions to index (e.g., "py,js,ts,jsx,tsx")EXCLUDED_DIRS: Directories to exclude from indexing (e.g., "node_modules,.git,pycache")
MODEL_NAME: LLM model to use (default: "gpt-4")USE_OPENAI: Set to "true" to use OpenAI modelsUSE_ANTHROPIC: Set to "true" to use Anthropic Claude modelsTEMPERATURE: Controls response randomness (0.0-1.0, default: 0.7)MAX_TOKENS: Maximum tokens in generated responses (default: 4000)
CHUNK_SIZE: Size of code chunks for processing (default: 1000)CHUNK_OVERLAP: Overlap between chunks (default: 200)MIN_CHUNK_SIZE: Minimum size for each chunk (default: 100)MAX_CHUNK_SIZE: Maximum size for each chunk (default: 2000)
QUERY_REFORMULATION: Enable query reformulation (default: true)CONVERSATION_HISTORY: Enable conversation history (default: true)MAX_HISTORY_TURNS: Maximum conversation turns to remember (default: 5)CONTEXT_WINDOW: Number of code snippets to include in context (default: 3)
CHROMA_PERSISTENCE_DIR: Directory for ChromaDB persistence (default: "./chroma_db")CHROMA_COLLECTION_NAME: Collection name in ChromaDB (default: "code_chunks")SIMILARITY_METRIC: Similarity metric for vector search (default: "cosine")USE_DISTRIBUTED_STORE: Whether to use distributed storage (default: false)
LOG_LEVEL: Logging verbosity (default: "INFO")LOG_DIR: Directory for log files (default: "logs")TRACK_SYSTEM_METRICS: Enable system metrics tracking (default: true)TRACK_QUERY_METRICS: Enable query metrics tracking (default: true)
For more advanced configuration options, see the .env.example file.
The application needs to index a GitHub repository before it can answer questions about the code.
python -m app.main index --repo owner/repo --branch mainParameters:
--repo: The GitHub repository to index in the format "owner/repo" (overrides GITHUB_REPOSITORY from .env)--branch: The branch to index (overrides GITHUB_BRANCH from .env, defaults to "main")
Example:
python -m app.main index --repo microsoft/TypeScript --branch mainOnce a repository is indexed, you can query it with natural language questions.
python -m app.main query --query "your question here" Parameters:
--query: Your natural language question about the code (required)--history: JSON string of conversation history for contextual queries (optional)--show-snippets: Display code snippets in the output (optional, off by default)--explain: Generate detailed explanations of the code snippets (optional, off by default)--generate-docs: Generate documentation based on the query (optional, off by default)
Example - Basic query:
python -m app.main query --query "How are React hooks used for state management?"Example - Show code snippets:
python -m app.main query --query "How are React hooks used for state management?" --show-snippetsExample - With code explanations:
python -m app.main query --query "How are React hooks used for state management?" --explainExample - With conversation history:
python -m app.main query --query "How are they implemented?" --history '[{"query": "What are React hooks?", "answer": "React hooks are functions that..."}]'The output is structured as follows:
- Response: A natural language explanation answering your question
- Code Snippets (optional, with
--show-snippets): Relevant code from the repository - Code Explanations (optional, with
--explain): Detailed explanation of each code snippet
Combining multiple flags:
python -m app.main query --query "Explain the implementation of useState hook" --show-snippets --explainFor documentation generation:
python -m app.main query --query "Generate documentation for the repository" --generate-docsThe application maintains separate log files for different components:
logs/app.log: Main application logslogs/github_scanner.log: GitHub scanning logslogs/vector_store.log: Vector store operationslogs/llm.log: LLM interactionslogs/auto_documenter.log: Documentation generation logs
-
LLM Service Unavailable
- Check your API keys in
.env - Verify network connectivity
- Check service status
- Check your API keys in
-
Vector Store Errors
- Verify ChromaDB installation
- Check disk space
- Verify permissions
-
Documentation Generation Failures
- Check file permissions
- Verify output directory exists
- Check for syntax errors in code
-
Indexing Large Repositories
- Adjust chunk size and overlap
- Use batch processing
- Monitor memory usage
-
Query Performance
- Enable caching
- Optimize context window size
- Use appropriate model size
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for GPT models
- Anthropic for Claude models
- ChromaDB for vector storage
- Sentence Transformers for embeddings
- CodeBERT for code-specific embeddings
Here are some examples of how to use the application for different purposes:
Index the repository and ask questions to quickly understand the codebase:
python -m app.main index --repo owner/repo
python -m app.main query --query "What is the high-level architecture of this project?"
python -m app.main query --query "What are the main components and how do they interact?"python -m app.main query --query "How do I implement authentication in this codebase?"
python -m app.main query --query "What's the pattern for adding a new API endpoint?"python -m app.main query --query "Why might I be getting this error: [paste error message]"
python -m app.main query --query "What could cause this function to return null in these cases?"python -m app.main query --query "How are React hooks used in this project?"
python -m app.main query --query "What design patterns are used for handling async operations?"python -m app.main query --query "What areas of this codebase might need refactoring?"
python -m app.main query --query "Are there any potential security vulnerabilities in the authentication system?"python -m app.main query --query "What's the code style and contribution process for this project?"
python -m app.main query --query "How are tests structured and implemented in this project?"