Update README.md
Browse files
README.md
CHANGED
|
@@ -11,6 +11,9 @@ language:
|
|
| 11 |
|
| 12 |
## CodeSage-Large-v2
|
| 13 |
|
|
|
|
|
|
|
|
|
|
| 14 |
### Model description
|
| 15 |
CodeSage is a family of open code embedding models with an encoder architecture that supports a wide range of source code understanding tasks. It was initially introduced in the paper:
|
| 16 |
|
|
@@ -55,9 +58,11 @@ For this V2 model, we enhanced semantic search performance by improving the qual
|
|
| 55 |
### Training Data
|
| 56 |
This pretrained checkpoint is the same as those used by our V1 model ([codesage/codesage-small](https://huggingface.co/codesage/codesage-small), which is trained on [The Stack](https://huggingface.co/datasets/bigcode/the-stack-dedup) data. The constative learning data are extracted from [The Stack V2](https://huggingface.co/datasets/bigcode/the-stack-v2). Same as our V1 model, we supported nine languages as follows: c, c-sharp, go, java, javascript, typescript, php, python, ruby.
|
| 57 |
|
| 58 |
-
### How to
|
| 59 |
-
This checkpoint consists of an encoder (
|
| 60 |
|
|
|
|
|
|
|
| 61 |
```
|
| 62 |
from transformers import AutoModel, AutoTokenizer
|
| 63 |
|
|
@@ -74,6 +79,12 @@ inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", ret
|
|
| 74 |
embedding = model(inputs)[0]
|
| 75 |
```
|
| 76 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
### BibTeX entry and citation info
|
| 78 |
```
|
| 79 |
@inproceedings{
|
|
|
|
| 11 |
|
| 12 |
## CodeSage-Large-v2
|
| 13 |
|
| 14 |
+
### [Blogpost]
|
| 15 |
+
Please check out our [blogpost](https://code-representation-learning.github.io/codesage-v2.html) for more details.
|
| 16 |
+
|
| 17 |
### Model description
|
| 18 |
CodeSage is a family of open code embedding models with an encoder architecture that supports a wide range of source code understanding tasks. It was initially introduced in the paper:
|
| 19 |
|
|
|
|
| 58 |
### Training Data
|
| 59 |
This pretrained checkpoint is the same as those used by our V1 model ([codesage/codesage-small](https://huggingface.co/codesage/codesage-small), which is trained on [The Stack](https://huggingface.co/datasets/bigcode/the-stack-dedup) data. The constative learning data are extracted from [The Stack V2](https://huggingface.co/datasets/bigcode/the-stack-v2). Same as our V1 model, we supported nine languages as follows: c, c-sharp, go, java, javascript, typescript, php, python, ruby.
|
| 60 |
|
| 61 |
+
### How to Use
|
| 62 |
+
This checkpoint consists of an encoder (1.3B model), which can be used to extract code embeddings of 2048 dimension.
|
| 63 |
|
| 64 |
+
1. Accessing CodeSage via HuggingFace: it can be easily loaded using the AutoModel functionality and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
|
| 65 |
+
|
| 66 |
```
|
| 67 |
from transformers import AutoModel, AutoTokenizer
|
| 68 |
|
|
|
|
| 79 |
embedding = model(inputs)[0]
|
| 80 |
```
|
| 81 |
|
| 82 |
+
2. Accessing CodeSage via SentenceTransformer
|
| 83 |
+
```
|
| 84 |
+
from sentence_transformers import SentenceTransformer
|
| 85 |
+
model = SentenceTransformer("codesage/codesage-large-v2", trust_remote_code=True)
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
### BibTeX entry and citation info
|
| 89 |
```
|
| 90 |
@inproceedings{
|