Spaces:

dev-jas
/

CodeMind

Running

App Files Files Community

CodeMind / README.md

devjas1

(FEAT/DOCS)[Docs: Readme + .gitignore]: add README.md with project details and setup instructions

03e744b 3 months ago

preview code

raw

history blame

5.8 kB

	# CodeMind

	A CLI tool for intelligent document analysis and commit message generation using EmbeddingGemma-300m for embeddings, FAISS for vector storage, and Phi-2 for text generation.

	## Features

	- Document Indexing: Embed and index documents for semantic search
	- Semantic Search: Find relevant documents using natural language queries
	- Smart Commit Messages: Generate meaningful commit messages from staged git changes
	- RAG (Retrieval-Augmented Generation): Answer questions using indexed document context

	## Setup

	### Prerequisites

	- Windows 11
	- Conda environment
	- Git

	### Installation

	1. Create a Conda environment:

	```bash
	conda create -n codemind python=3.9
	conda activate codemind
	```

	2. Clone the repository:

	```bash
	git clone https://github.com/devjas1/codemind.git
	cd codemind
	```

	3. Install dependencies:

	```bash
	pip install -r requirements.txt
	```

	4. Download models:

	Embedding Model (EmbeddingGemma-300m):

	- Download from Hugging Face: `google/embeddinggemma-300m`
	- Place in `./models/embeddinggemma-300m/` directory

	Generation Model (Phi-2 GGUF):

	- Download the quantized Phi-2 model: `phi-2.Q4_0.gguf`
	- Place in `./models/` directory
	- Download from: [Microsoft Phi-2 GGUF](https://huggingface.co/microsoft/phi-2-gguf) or similar quantized versions

	### Directory Structure

	```
	CodeMind/
	├── cli.py # Main CLI entry point
	├── config.yaml # Configuration file
	├── requirements.txt # Python dependencies
	├── models/ # Model storage
	│ ├── embeddinggemma-300m/ # Embedding model directory
	│ └── phi-2.Q4_0.gguf # Phi-2 quantized model file
	├── src/ # Core modules
	│ ├── config_loader.py # Configuration management
	│ ├── embedder.py # Document embedding
	│ ├── retriever.py # Semantic search
	│ ├── generator.py # Text generation
	│ └── diff_analyzer.py # Git diff analysis
	├── docs/ # Documentation
	└── vector_cache/ # FAISS index storage (auto-created)
	```

	## Usage

	### Initialize Document Index

	Index documents from a directory for semantic search:

	```bash
	python cli.py init ./docs/
	```

	This will:

	- Embed all documents in the specified directory
	- Create a FAISS index in `vector_cache/`
	- Save metadata for retrieval

	### Semantic Search

	Search for relevant documents using natural language:

	```bash
	python cli.py search "how to configure the model"
	```

	Returns ranked results with similarity scores.

	### Ask Questions (RAG)

	Get answers based on your indexed documents:

	```bash
	python cli.py ask "What are the configuration options?"
	```

	Uses retrieval-augmented generation to provide contextual answers.

	### Git Commit Message Generation

	Generate intelligent commit messages from staged changes:

	```bash
	# Preview commit message without applying
	python cli.py commit --preview

	# Show staged files and analysis without generating message
	python cli.py commit --dry-run

	# Generate and apply commit message
	python cli.py commit --apply
	```

	### Start API Server (Future Feature)

	```bash
	python cli.py serve --port 8000
	```

	_Note: API server functionality is planned for future releases._

	## Configuration

	Edit `config.yaml` to customize behavior:

	```yaml
	embedding:
	model_path: "./models/embeddinggemma-300m"
	dim: 768
	truncate_to: 128

	generator:
	model_path: "./models/phi-2.Q4_0.gguf"
	quantization: "Q4_0"
	max_tokens: 512
	n_ctx: 2048

	retrieval:
	vector_store: "faiss"
	top_k: 5
	similarity_threshold: 0.75

	commit:
	tone: "imperative"
	style: "conventional"
	max_length: 72

	logging:
	verbose: true
	telemetry: false
	```

	### Configuration Options

	- embedding.model_path: Path to the EmbeddingGemma-300m model
	- generator.model_path: Path to the Phi-2 GGUF model file
	- retrieval.top_k: Number of documents to retrieve for context
	- retrieval.similarity_threshold: Minimum similarity score for results
	- generator.max_tokens: Maximum tokens for generation
	- generator.n_ctx: Context window size for Phi-2

	## Dependencies

	- `sentence-transformers>=2.2.2` - Document embedding
	- `faiss-cpu>=1.7.4` - Vector similarity search
	- `llama-cpp-python>=0.2.23` - Phi-2 model inference (Windows compatible)
	- `typer>=0.9.0` - CLI framework
	- `PyYAML>=6.0` - Configuration file parsing

	## Troubleshooting

	### Model Loading Issues

	If you encounter model loading errors:

	1. Embedding Model: Ensure `embeddinggemma-300m` is a directory containing all model files
	2. Phi-2 Model: Ensure `phi-2.Q4_0.gguf` is a single GGUF file
	3. Paths: All paths in `config.yaml` should be relative to the project root

	### Memory Issues

	For systems with limited RAM:

	- Use Q4_0 quantization for Phi-2 (already configured)
	- Reduce `n_ctx` in config.yaml if needed
	- Process documents in smaller batches

	### Windows-Specific Issues

	- Ensure `llama-cpp-python` version supports Windows
	- Use PowerShell or Command Prompt for CLI commands
	- Check file path separators in configuration

	## Development

	To test the modules:

	```bash
	python -c "from src import *; print('All modules imported successfully')"
	```

	To run in development mode:

	```bash
	python cli.py --help
	```

	## License

	[Insert your license information here]

	## Contributing

	[Insert contribution guidelines here]