In [1]:
from google.colab import drive
drive.mount('/content/drive')
base_dir = "/content/drive/MyDrive/huggingface-rag"
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [2]:
!pip install -U sentence-transformers openai qdrant_client fastembed
Requirement already satisfied: sentence-transformers in /usr/local/lib/python3.12/dist-packages (5.1.2)
Requirement already satisfied: openai in /usr/local/lib/python3.12/dist-packages (2.8.1)
Requirement already satisfied: qdrant_client in /usr/local/lib/python3.12/dist-packages (1.16.1)
Requirement already satisfied: fastembed in /usr/local/lib/python3.12/dist-packages (0.7.3)
Requirement already satisfied: transformers<5.0.0,>=4.41.0 in /usr/local/lib/python3.12/dist-packages (from sentence-transformers) (4.57.2)
Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from sentence-transformers) (4.67.1)
Requirement already satisfied: torch>=1.11.0 in /usr/local/lib/python3.12/dist-packages (from sentence-transformers) (2.9.0+cu126)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.12/dist-packages (from sentence-transformers) (1.6.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.12/dist-packages (from sentence-transformers) (1.16.3)
Requirement already satisfied: huggingface-hub>=0.20.0 in /usr/local/lib/python3.12/dist-packages (from sentence-transformers) (0.36.0)
Requirement already satisfied: Pillow in /usr/local/lib/python3.12/dist-packages (from sentence-transformers) (11.3.0)
Requirement already satisfied: typing_extensions>=4.5.0 in /usr/local/lib/python3.12/dist-packages (from sentence-transformers) (4.15.0)
Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.12/dist-packages (from openai) (4.11.0)
Requirement already satisfied: distro<2,>=1.7.0 in /usr/local/lib/python3.12/dist-packages (from openai) (1.9.0)
Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.12/dist-packages (from openai) (0.28.1)
Requirement already satisfied: jiter<1,>=0.10.0 in /usr/local/lib/python3.12/dist-packages (from openai) (0.12.0)
Requirement already satisfied: pydantic<3,>=1.9.0 in /usr/local/lib/python3.12/dist-packages (from openai) (2.12.3)
Requirement already satisfied: sniffio in /usr/local/lib/python3.12/dist-packages (from openai) (1.3.1)
Requirement already satisfied: grpcio>=1.41.0 in /usr/local/lib/python3.12/dist-packages (from qdrant_client) (1.76.0)
Requirement already satisfied: numpy>=1.26 in /usr/local/lib/python3.12/dist-packages (from qdrant_client) (2.0.2)
Requirement already satisfied: portalocker<4.0,>=2.7.0 in /usr/local/lib/python3.12/dist-packages (from qdrant_client) (3.2.0)
Requirement already satisfied: protobuf>=3.20.0 in /usr/local/lib/python3.12/dist-packages (from qdrant_client) (5.29.5)
Requirement already satisfied: urllib3<3,>=1.26.14 in /usr/local/lib/python3.12/dist-packages (from qdrant_client) (2.5.0)
Requirement already satisfied: loguru<0.8.0,>=0.7.2 in /usr/local/lib/python3.12/dist-packages (from fastembed) (0.7.3)
Requirement already satisfied: mmh3<6.0.0,>=4.1.0 in /usr/local/lib/python3.12/dist-packages (from fastembed) (5.2.0)
Requirement already satisfied: onnxruntime!=1.20.0,>=1.17.0 in /usr/local/lib/python3.12/dist-packages (from fastembed) (1.23.2)
Requirement already satisfied: py-rust-stemmers<0.2.0,>=0.1.0 in /usr/local/lib/python3.12/dist-packages (from fastembed) (0.1.5)
Requirement already satisfied: requests<3.0,>=2.31 in /usr/local/lib/python3.12/dist-packages (from fastembed) (2.32.4)
Requirement already satisfied: tokenizers<1.0,>=0.15 in /usr/local/lib/python3.12/dist-packages (from fastembed) (0.22.1)
Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.12/dist-packages (from anyio<5,>=3.5.0->openai) (3.11)
Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->openai) (2025.11.12)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->openai) (1.0.9)
Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.16.0)
Requirement already satisfied: h2<5,>=3 in /usr/local/lib/python3.12/dist-packages (from httpx[http2]>=0.20.0->qdrant_client) (4.3.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from huggingface-hub>=0.20.0->sentence-transformers) (3.20.0)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub>=0.20.0->sentence-transformers) (2025.3.0)
Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub>=0.20.0->sentence-transformers) (25.0)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub>=0.20.0->sentence-transformers) (6.0.3)
Requirement already satisfied: hf-xet<2.0.0,>=1.1.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub>=0.20.0->sentence-transformers) (1.2.0)
Requirement already satisfied: coloredlogs in /usr/local/lib/python3.12/dist-packages (from onnxruntime!=1.20.0,>=1.17.0->fastembed) (15.0.1)
Requirement already satisfied: flatbuffers in /usr/local/lib/python3.12/dist-packages (from onnxruntime!=1.20.0,>=1.17.0->fastembed) (25.9.23)
Requirement already satisfied: sympy in /usr/local/lib/python3.12/dist-packages (from onnxruntime!=1.20.0,>=1.17.0->fastembed) (1.14.0)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from pydantic<3,>=1.9.0->openai) (0.7.0)
Requirement already satisfied: pydantic-core==2.41.4 in /usr/local/lib/python3.12/dist-packages (from pydantic<3,>=1.9.0->openai) (2.41.4)
Requirement already satisfied: typing-inspection>=0.4.2 in /usr/local/lib/python3.12/dist-packages (from pydantic<3,>=1.9.0->openai) (0.4.2)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests<3.0,>=2.31->fastembed) (3.4.4)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (75.2.0)
Requirement already satisfied: networkx>=2.5.1 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (3.6)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (3.1.6)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (12.6.77)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (12.6.77)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.6.80 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (12.6.80)
Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (9.10.2.21)
Requirement already satisfied: nvidia-cublas-cu12==12.6.4.1 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (12.6.4.1)
Requirement already satisfied: nvidia-cufft-cu12==11.3.0.4 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (11.3.0.4)
Requirement already satisfied: nvidia-curand-cu12==10.3.7.77 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (10.3.7.77)
Requirement already satisfied: nvidia-cusolver-cu12==11.7.1.2 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (11.7.1.2)
Requirement already satisfied: nvidia-cusparse-cu12==12.5.4.2 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (12.5.4.2)
Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (0.7.1)
Requirement already satisfied: nvidia-nccl-cu12==2.27.5 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (2.27.5)
Requirement already satisfied: nvidia-nvshmem-cu12==3.3.20 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (3.3.20)
Requirement already satisfied: nvidia-nvtx-cu12==12.6.77 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (12.6.77)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.6.85 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (12.6.85)
Requirement already satisfied: nvidia-cufile-cu12==1.11.1.6 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (1.11.1.6)
Requirement already satisfied: triton==3.5.0 in /usr/local/lib/python3.12/dist-packages (from torch>=1.11.0->sentence-transformers) (3.5.0)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.12/dist-packages (from transformers<5.0.0,>=4.41.0->sentence-transformers) (2025.11.3)
Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.12/dist-packages (from transformers<5.0.0,>=4.41.0->sentence-transformers) (0.7.0)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn->sentence-transformers) (1.5.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn->sentence-transformers) (3.6.0)
Requirement already satisfied: hyperframe<7,>=6.1 in /usr/local/lib/python3.12/dist-packages (from h2<5,>=3->httpx[http2]>=0.20.0->qdrant_client) (6.1.0)
Requirement already satisfied: hpack<5,>=4.1 in /usr/local/lib/python3.12/dist-packages (from h2<5,>=3->httpx[http2]>=0.20.0->qdrant_client) (4.1.0)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy->onnxruntime!=1.20.0,>=1.17.0->fastembed) (1.3.0)
Requirement already satisfied: humanfriendly>=9.1 in /usr/local/lib/python3.12/dist-packages (from coloredlogs->onnxruntime!=1.20.0,>=1.17.0->fastembed) (10.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch>=1.11.0->sentence-transformers) (3.0.3)
In [3]:
import os
import httpx
from openai import OpenAI
from qdrant_client import QdrantClient, models
from sentence_transformers import SentenceTransformer
from fastembed import SparseTextEmbedding

qdrant_path = f"{base_dir}/qdrant_hybrid_db"
collection_name = 'huggingface_transformers_docs'
dense_model = SentenceTransformer("fyerfyer/finetune-jina-transformers-v1", trust_remote_code=True)
sparse_model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")

lock_file = os.path.join(qdrant_path, ".lock")
if os.path.exists(lock_file):
  try:
    os.remove(lock_file)
    print(f"Removed stale lock file: {lock_file}")
  except Exception as e:
    print(f"Warning: Could not remove lock file: {e}")
Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]
model.onnx:   0%|          | 0.00/532M [00:00<?, ?B/s]
tokenizer_config.json: 0.00B [00:00, ?B/s]
tokenizer.json: 0.00B [00:00, ?B/s]
special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]
config.json:   0%|          | 0.00/755 [00:00<?, ?B/s]
Removed stale lock file: /content/drive/MyDrive/huggingface-rag/qdrant_hybrid_db/.lock
In [4]:
from google.colab import userdata

class HFRAG:
  def __init__(self):
    self.dense_model = dense_model
    self.sparse_model = sparse_model
    self.db_client = QdrantClient(path=qdrant_path)
    if not self.db_client.collection_exists(collection_name):
      raise ValueError(f"Cannot find collection {collection_name}, please check qdrant path")
    print(f"Successfully connected to qdrant: {qdrant_path}")

    self.llm_client = OpenAI(
      api_key=userdata.get('DEEPSEEK_API_KEY'),
      base_url="https://api.deepseek.com",
      http_client=httpx.Client(proxy=None, trust_env=False) # 开了代理的话要加这个,不然会报错
    )

  def retrieve(self, query: str, top_k: int = 5):
    # Generate dense vector
    query_dense_vec = self.dense_model.encode(query).tolist()

    # Generate sparse vector
    query_sparse_gen = list(self.sparse_model.embed([query]))[0]
    query_sparse_vec = models.SparseVector(
      indices=query_sparse_gen.indices.tolist(),
      values=query_sparse_gen.values.tolist()
    )

    # Create prefetch for dense retrieval
    prefetch_dense = models.Prefetch(
      query=query_dense_vec,
      using="text-dense",
      limit=20,
    )

    # Create prefetch for sparse retrieval
    prefetch_sparse = models.Prefetch(
      query=query_sparse_vec,
      using="text-sparse",
      limit=20,
    )

    # Hybrid search with RRF fusion
    results = self.db_client.query_points(
      collection_name=collection_name,
      prefetch=[prefetch_dense, prefetch_sparse],
      query=models.FusionQuery(fusion=models.Fusion.RRF),
      limit=top_k,
      with_payload=True
    ).points

    return results

  def generate(self, query: str, search_results):
    if not search_results:
      return "I'm sorry, but I couldn't find any relevant information in the knowledge base regarding your query."

    context_pieces = []
    for idx, hit in enumerate(search_results, 1):
      source = hit.payload.get('source', 'unknown')
      filename = source.split('/')[-1] if '/' in source else source
      text = hit.payload['text']

      piece = f"""<doc id="{idx}" source="{filename}">
  {text}
  </doc>"""
      context_pieces.append(piece)

    context_str = "\n\n".join(context_pieces)

    system_prompt = """You are an expert AI assistant specializing in the Hugging Face Transformers library and NLP technology.

  YOUR MISSION:
  Answer the user's question using ONLY the provided "Retrieved Context". Do not rely on your internal knowledge base unless it is to explain syntax or general programming concepts not covered in the documents.

  GUIDELINES:
  1. **Grounding**: Base your answer strictly on the provided context chunks.
  2. **Code First**: If the context contains code examples, prioritize showing them in your answer using Python markdown blocks.
  3. **Citation**: When referencing specific information, cite the source file name (e.g., `[model_doc.md]`).
  4. **Honesty**: If the provided context does not contain enough information to answer the question, state: "The provided documents do not contain the answer to this question." Do not hallucinate or make up parameters.
  5. **Clarity**: Keep explanations concise and technical.

  Output Format:
  - Use Markdown for formatting.
  - Use `code blocks` for function names and parameters.
  """

    # 4. User Prompt (The "Input")
    user_prompt = f"""
  ### User Query
  {query}

  ### Retrieved Context
  Please use the following documents to answer the query above:

  {context_str}

  ### Answer
  """

    print(f"\nThinking (Processing {len(search_results)} context chunks)...")

    try:
      response = self.llm_client.chat.completions.create(
        model="deepseek-chat",
        messages=[
          {"role": "system", "content": system_prompt},
          {"role": "user", "content": user_prompt},
        ],
        temperature=0.1,
        max_tokens=4096,
        stream=True
      )

      full_response = ""
      print("-" * 60)
      for chunk in response:
        if chunk.choices[0].delta.content:
          content = chunk.choices[0].delta.content
          print(content, end="", flush=True)
          full_response += content
      print("\n" + "-" * 60)
      return full_response

    except Exception as e:
      return f"Error calling LLM: {e}"

  def chat(self, query: str):
    print(f"\nUser: {query}")
    results = self.retrieve(query)
    self.generate(query, results)
In [ ]:
if __name__ == "__main__":
  rag = HFRAG()

  print("\nHuggingFace RAG assitant is started! Input 'quit' to exit")
  while True:
    user_input = input("\nPlease input your question: ")
    if user_input.lower() in ['quit', 'exit']:
      break

    rag.chat(user_input)
Successfully connected to qdrant: /content/drive/MyDrive/huggingface-rag/qdrant_hybrid_db

HuggingFace RAG assitant is started! Input 'quit' to exit

Please input your question: How to use AutoModel?

User: How to use AutoModel?

Thinking (Processing 5 context chunks)...
------------------------------------------------------------
Based on the provided documents, here's how to use `AutoModel`:

## Overview

`AutoModel` is a convenient class that automatically selects the correct model architecture based on the pretrained model name or path. It eliminates the need to know the exact model class name [`auto.md`].

## Basic Usage

The fundamental pattern for using `AutoModel` is:

```python
from transformers import AutoModel

model = AutoModel.from_pretrained("model-name-or-path")
```

For example:
```python
model = AutoModel.from_pretrained("google-bert/bert-base-cased")
```
This automatically creates an instance of the appropriate model class (e.g., `BertModel`) [`auto.md`].

## Task-Specific AutoModels

There are specialized `AutoModel` classes for different tasks. The documents show several examples:

### For Masked Language Modeling
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2")
model = AutoModelForMaskedLM.from_pretrained(
    "albert/albert-base-v2",
    dtype=torch.float16,
    attn_implementation="sdpa",
    device_map="auto"
)
```
[`albert.md`]

### For Automatic Speech Recognition (CTC)
```python
from transformers import AutoModelForCTC, AutoProcessor
import torch

processor = AutoProcessor.from_pretrained("facebook/hubert-base-ls960")
model = AutoModelForCTC.from_pretrained(
    "facebook/hubert-base-ls960", 
    dtype=torch.float16, 
    device_map="auto", 
    attn_implementation="sdpa"
)
```
[`hubert.md`]

### For Multiple Tasks with Same Model
```python
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoModelForQuestionAnswering

# Use the same model for different tasks
model1 = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model2 = AutoModelForSequenceClassification.from_pretrained("meta-llama/Llama-2-7b-hf")
model3 = AutoModelForQuestionAnswering.from_pretrained("meta-llama/Llama-2-7b-hf")
```
[`models.md`]

## Key Benefits

- **Automatic architecture detection**: No need to know the exact model class name [`auto.md`]
- **Easy model switching**: Use the same API for different models and tasks [`models.md`]
- **Task-specific loading**: Specialized classes for different NLP tasks [`models.md`]

The main advantage is that you can easily switch between models or tasks while using the same consistent API [`models.md`].
------------------------------------------------------------