एक हैंड्स-ऑन ट्यूटोरियल: Google जनरेटिव AI और Langchain के साथ एक मॉड्यूलर LLM मूल्यांकन पाइपलाइन का निर्माण करें – Gadgets Solutions

एलएलएम का मूल्यांकन अकादमिक और औद्योगिक दोनों सेटिंग्स में कृत्रिम बुद्धिमत्ता की विश्वसनीयता और उपयोगिता को आगे बढ़ाने में एक महत्वपूर्ण चुनौती के रूप में उभरा है। जैसे-जैसे इन मॉडलों की क्षमताओं का विस्तार होता है, वैसे-वैसे कठोर, प्रतिलिपि प्रस्तुत करने योग्य और बहुआयामी मूल्यांकन विधियों की आवश्यकता भी होती है। इस ट्यूटोरियल में, हम क्षेत्र के सबसे महत्वपूर्ण सीमाओं में से एक की एक व्यापक परीक्षा प्रदान करते हैं: प्रदर्शन के विभिन्न आयामों में एलएलएम की ताकत और सीमाओं का व्यवस्थित रूप से मूल्यांकन करना। Google के अत्याधुनिक जनरेटिव AI मॉडल को बेंचमार्क के रूप में और लैंगचेन लाइब्रेरी के रूप में हमारे ऑर्केस्ट्रेशन टूल के रूप में उपयोग करते हुए, हम Google Colab में कार्यान्वयन के लिए एक मजबूत और मॉड्यूलर मूल्यांकन पाइपलाइन प्रस्तुत करते हैं। यह फ्रेमवर्क मानदंड-आधारित स्कोरिंग को एकीकृत करता है, शुद्धता, प्रासंगिकता, सुसंगतता, और संक्षिप्तता को शामिल करता है, जो कि जोड़ीदार मॉडल तुलना और समृद्ध दृश्य विश्लेषण के साथ बारीक और कार्रवाई योग्य अंतर्दृष्टि प्रदान करता है। विशेषज्ञ-मान्य प्रश्न सेट और उद्देश्य जमीनी सत्य उत्तरों में ग्राउंडेड, यह दृष्टिकोण व्यावहारिक अनुकूलनशीलता के साथ मात्रात्मक कठोरता को संतुलित करता है, शोधकर्ताओं और डेवलपर्स को उच्च-निष्ठा एलएलएम मूल्यांकन के लिए एक तैयार-से-उपयोग, एक्स्टेंसिबल टूलकिट की पेशकश करता है।

!pip install langchain langchain-google-genai ragas pandas matplotlib

हम एआई-संचालित वर्कफ़्लोज़ के निर्माण और चलाने के लिए प्रमुख पायथन लाइब्रेरी स्थापित करते हैं, एलएलएम इंटरैक्शन के लिए ऑर्केस्ट्रेटिंग के लिए लैंगचेन (Google के जनरेटिव एआई के लिए लैंगचेन-गूगल-गेनाई एक्सटेंशन के साथ), रिट्रीवल-एगमेंटेड जेनरेशन के लिए राग, और डेटा हेरफेर और विज़ुअलाइज़ेशन के लिए पांडस प्लस मैटप्लोटलिब।

import os
import pandas as pd
import matplotlib.pyplot as plt
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.evaluation import load_evaluator
from langchain.schema import HumanMessage

हम कोर पायथन उपयोगिताओं को शामिल करते हैं, जिसमें पर्यावरण प्रबंधन के लिए ओएस, डेटाफ्रेम से निपटने के लिए पांडा, और प्लॉटिंग के लिए मैटप्लोटलिब.प्लॉट शामिल हैं, लैंगचेन के Google जनरेटिव एआई क्लाइंट के साथ, शीघ्र टेम्पलेटिंग, चेन निर्माण, मूल्यांकनकर्ता लोडर, और मानवीय स्कीमा के निर्माण और संवादी एलएलएम पाइपलाइनों का आकलन करने के लिए।

os.environ("GOOGLE_API_KEY") = "Use Your API Key"

यहां, हम Google_API_KEY चर में आपके Google API कुंजी को संग्रहीत करके आपके वातावरण को कॉन्फ़िगर करते हैं, जिससे Langchain Google Generative AI क्लाइंट को अनुरोधों को सुरक्षित रूप से प्रमाणित करने की अनुमति मिलती है।

def create_evaluation_dataset():
    """Create a simple dataset for evaluation."""
    questions = (
        "Explain the concept of quantum computing in simple terms.",
        "How does a neural network learn?",
        "What are the main differences between SQL and NoSQL databases?",
        "Explain how blockchain technology works.",
        "What is the difference between supervised and unsupervised learning?"
    )
   
    ground_truth = (
        "Quantum computing uses quantum bits or qubits that can exist in multiple states simultaneously, unlike classical bits. This allows quantum computers to process certain types of information much faster than classical computers for specific problems.",
        "Neural networks learn through a process called backpropagation where they adjust the weights between neurons based on the error between predicted and actual outputs, gradually minimizing this error through many iterations of training data.",
        "SQL databases are relational with structured schemas, fixed tables, and use SQL for queries. NoSQL databases are non-relational, schema-flexible, and designed for specific data models like document, key-value, wide-column, or graph formats.",
        "Blockchain is a distributed ledger technology where data is stored in blocks that are linked cryptographically. Each block contains transaction data and a timestamp, creating an immutable chain. Consensus mechanisms verify transactions without central authority.",
        "Supervised learning uses labeled data where the algorithm learns to predict outputs based on input-output pairs. Unsupervised learning works with unlabeled data to find patterns or structures without predefined outputs."
    )
   
    return pd.DataFrame({"question": questions, "ground_truth": ground_truth})

हम एआई और डेटाबेस अवधारणाओं पर पांच उदाहरण के प्रश्नों को अपने संबंधित ग्राउंड ands ट्रूथ उत्तरों के साथ जोड़कर एक छोटे से मूल्यांकन डेटाफ्रेम का निर्माण करते हैं, जिससे पूर्वनिर्धारित सही आउटपुट के खिलाफ एलएलएम की प्रतिक्रियाओं को बेंचमार्क करना आसान हो जाता है।

def setup_models():
    """Set up different Google Generative AI models for comparison."""
    models = {
        "gemini-2.0-flash-lite": ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0),
        "gemini-2.0-flash": ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0)
    }
    return models

अब, यह फ़ंक्शन दो शून्य gooperature chatgooglegenerativeai क्लाइंट्स को इंस्टेंट करता है, एक हल्के “मिथुन। 2.0‑ फ़्लैश and लाइट” मॉडल का उपयोग करता है और दूसरा पूर्ण “मिथुन 2.0‑ फ़्लैश” मॉडल, ताकि आप आसानी से उनके आउटपुट साइड ‘बाई – के साथ तुलना कर सकें।

def generate_responses(models, dataset):
    """Generate responses from each model for the questions in the dataset."""
    responses = {}
   
    for model_name, model in models.items():
        model_responses = ()
        for question in dataset("question"):
            try:
                response = model.invoke((HumanMessage(content=question)))
                model_responses.append(response.content)
            except Exception as e:
                print(f"Error with model {model_name} on question: {question}")
                print(f"Error: {e}")
                model_responses.append("Error generating response")
       
        responses(model_name) = model_responses
   
    return responses

यह फ़ंक्शन प्रत्येक कॉन्फ़िगर किए गए मॉडल और डेटासेट में प्रत्येक प्रश्न के माध्यम से लूप करता है, एक प्रतिक्रिया उत्पन्न करने के लिए मॉडल को आमंत्रित करता है, किसी भी त्रुटि को पकड़ता है (उन्हें लॉगिंग करता है और एक प्लेसहोल्डर सम्मिलित करता है), और प्रत्येक मॉडल के नाम को उत्पन्न उत्तर की सूची में एक शब्दकोश मानचित्रण करता है।

def evaluate_responses(models, dataset, responses):
    """Evaluate model responses using different evaluation criteria."""
    evaluator_model = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0)
   
    reference_criteria = ("correctness")
    reference_free_criteria = (
        "relevance",  
        "coherence",    
        "conciseness"  
    )
   
    results = {model_name: {criterion: () for criterion in reference_criteria + reference_free_criteria}
               for model_name in models.keys()}
   
    for criterion in reference_criteria:
        evaluator = load_evaluator("labeled_criteria", criteria=criterion, llm=evaluator_model)
       
        for model_name in models.keys():
            for i, question in enumerate(dataset("question")):
                ground_truth = dataset("ground_truth")(i)
                response = responses(model_name)(i)
               
                if response != "Error generating response":
                    eval_result = evaluator.evaluate_strings(
                        prediction=response,
                        reference=ground_truth,
                        input=question
                    )
                    normalized_score = float(eval_result.get('score', 0)) * 2
                    results(model_name)(criterion).append(normalized_score)
                else:
                    results(model_name)(criterion).append(0)  
   
    for criterion in reference_free_criteria:
        evaluator = load_evaluator("criteria", criteria=criterion, llm=evaluator_model)
       
        for model_name in models.keys():
            for i, question in enumerate(dataset("question")):
                response = responses(model_name)(i)
               
                if response != "Error generating response":
                    eval_result = evaluator.evaluate_strings(
                        prediction=response,
                        input=question
                    )
                    normalized_score = float(eval_result.get('score', 0)) * 2
                    results(model_name)(criterion).append(normalized_score)
                else:
                    results(model_name)(criterion).append(0)  
    return results

यह फ़ंक्शन एक “GEMINI ‘2.0 – फ़्लैश‘ Lite “मूल्यांकनकर्ता का लाभ उठाता है, जो कि प्रत्येक मॉडल के उत्तरों को संदर्भित कर दिया गया है और दोनों पर आधारित सुधार और संदर्भ – मुक्त मेट्रिक्स (प्रासंगिकता, सुसंगतता, सुसंगतता), उन स्कोर को सामान्य करता है, और मूल्यांकन परिणामों की सूची में प्रत्येक मॉडल और मानदंड को नेस्टेड डिक्शनरी मैपिंग देता है।

def calculate_average_scores(evaluation_results):
    """Calculate average scores for each model and criterion."""
    avg_scores = {}
   
    for model_name, criteria in evaluation_results.items():
        avg_scores(model_name) = {}
       
        for criterion, scores in criteria.items():
            if scores:
                avg_scores(model_name)(criterion) = sum(scores) / len(scores)
            else:
                avg_scores(model_name)(criterion) = 0
               
        all_scores = (score for criterion_scores in criteria.values() for score in criterion_scores if score is not None)
        if all_scores:
            avg_scores(model_name)("overall") = sum(all_scores) / len(all_scores)
        else:
            avg_scores(model_name)("overall") = 0
           
    return avg_scores

यह फ़ंक्शन नेस्टेड मूल्यांकन परिणामों को प्रत्येक मॉडल के लिए सभी प्रश्नों में प्रत्येक मानदंड के लिए औसत स्कोर की गणना करने के लिए संसाधित करता है। इसके अलावा, यह सभी व्यक्तिगत मीट्रिक स्कोर को पूल करके एक समग्र औसत की गणना करता है। लौटा हुआ शब्दकोश प्रत्येक मॉडल के प्रति and मानदंड औसत और एक समग्र “समग्र” प्रदर्शन स्कोर के लिए प्रत्येक मॉडल को मैप करता है।

def visualize_results(avg_scores):
    """Visualize evaluation results with bar charts."""
    models = list(avg_scores.keys())
    criteria = list(avg_scores(models(0)).keys())
   
    plt.figure(figsize=(14, 8))
   
    bar_width = 0.8 / len(models)
   
    positions = range(len(criteria))
   
    for i, model in enumerate(models):
        model_scores = (avg_scores(model)(criterion) for criterion in criteria)
        plt.bar((p + i * bar_width for p in positions), model_scores,
                width=bar_width, label=model)
   
    plt.xlabel('Evaluation Criteria', fontsize=12)
    plt.ylabel('Average Score (0-10)', fontsize=12)
    plt.title('LLM Model Comparison by Evaluation Criteria', fontsize=14)
    plt.xticks((p + bar_width * (len(models) - 1) / 2 for p in positions), criteria)
    plt.legend()
    plt.grid(axis="y", linestyle="--", alpha=0.7)
   
    plt.tight_layout()
    plt.show()
   
    plt.figure(figsize=(10, 8))
   
    categories = (c for c in criteria if c != 'overall')
    N = len(categories)
   
    angles = (n / float(N) * 2 * 3.14159 for n in range(N))
    angles += angles(:1)  
   
    plt.polar(angles, (0) * (N + 1))
    plt.xticks(angles(:-1), categories)
   
    for model in models:
        values = (avg_scores(model)(c) for c in categories)
        values += values(:1)  
        plt.polar(angles, values, label=model)
   
    plt.legend(loc="upper right")
    plt.title('LLM Model Comparison - Radar Chart', fontsize=14)
    plt.tight_layout()
    plt.show()

यह फ़ंक्शन सभी मूल्यांकन मानदंडों में प्रत्येक मॉडल के औसत स्कोर की तुलना करने के लिए साइड-बाय-साइड बार चार्ट बनाता है। फिर यह उनके प्रदर्शन प्रोफाइल की कल्पना करने के लिए एक रडार चार्ट प्रदान करता है, जिससे सापेक्ष ताकत और कमजोरियों की त्वरित पहचान को सक्षम किया जाता है।

def main():
    print("Creating evaluation dataset...")
    dataset = create_evaluation_dataset()
   
    print("Setting up models...")
    models = setup_models()
   
    print("Generating responses...")
    responses = generate_responses(models, dataset)
   
    print("Evaluating responses...")
    evaluation_results = evaluate_responses(models, dataset, responses)
   
    print("Calculating average scores...")
    avg_scores = calculate_average_scores(evaluation_results)
   
    print("Average scores:")
    for model, scores in avg_scores.items():
        print(f"\n{model}:")
        for criterion, score in scores.items():
            print(f"  {criterion}: {score:.2f}")
   
    print("\nVisualizing results...")
    visualize_results(avg_scores)
   
    print("Saving results to CSV...")
    results_df = pd.DataFrame(columns=("Model", "Criterion", "Score"))
    for model, criteria in avg_scores.items():
        for criterion, score in criteria.items():
            results_df = pd.concat((results_df, pd.DataFrame(({"Model": model, "Criterion": criterion, "Score": score}))),
                                  ignore_index=True)
   
    results_df.to_csv("llm_evaluation_results.csv", index=False)
    print("Results saved to llm_evaluation_results.csv")
   
    detailed_df = pd.DataFrame(columns=("Question", "Ground Truth") + list(models.keys()))
   
    for i, question in enumerate(dataset("question")):
        row = {
            "Question": question,
            "Ground Truth": dataset("ground_truth")(i)
        }
       
        for model_name in models.keys():
            row(model_name) = responses(model_name)(i)
       
        detailed_df = pd.concat((detailed_df, pd.DataFrame((row))), ignore_index=True)
   
    detailed_df.to_csv("llm_response_comparison.csv", index=False)
    print("Detailed responses saved to llm_response_comparison.csv")

मुख्य फ़ंक्शन संपूर्ण मूल्यांकन वर्कफ़्लो एंड – से – एंड को ऑर्केस्ट्रेट करता है: यह डेटासेट का निर्माण करता है, मॉडल को इनिशियलाइज़ करता है, प्रतिक्रिया करता है और स्कोर करता है, औसत मैट्रिक्स की गणना करता है और प्रदर्शित करता है, चार्ट के साथ प्रदर्शन की कल्पना करता है, और अंत में सीएसवी फ़ाइलों के रूप में सारांश और विस्तृत परिणाम दोनों का निर्यात करता है।

def pairwise_model_comparison(models, dataset, responses):
    """Compare two models side by side using an LLM as judge."""
    evaluator_model = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0)
   
    pairwise_template = """
    Question: {question}
   
    Response A: {response_a}
   
    Response B: {response_b}
   
    Which response better answers the user's question? Consider factors like accuracy,
    helpfulness, clarity, and completeness.
   
    First, analyze each response point by point. Then conclude with your choice of either:
    A is better, B is better, or They are equally good/bad.
   
    Your analysis:
    """
   
    pairwise_prompt = PromptTemplate(
        input_variables=("question", "response_a", "response_b"),
        template=pairwise_template
    )
   
    pairwise_chain = LLMChain(llm=evaluator_model, prompt=pairwise_prompt)
   
    model_names = list(models.keys())
   
    pairwise_results = {f"{model_a} vs {model_b}": () for model_a in model_names for model_b in model_names if model_a != model_b}
   
    for i, question in enumerate(dataset("question")):
        for j, model_a in enumerate(model_names):
            for model_b in model_names(j+1:):  
                response_a = responses(model_a)(i)
                response_b = responses(model_b)(i)
               
                if response_a != "Error generating response" and response_b != "Error generating response":
                    comparison_result = pairwise_chain.run(
                        question=question,
                        response_a=response_a,
                        response_b=response_b
                    )
                   
                    key_ab = f"{model_a} vs {model_b}"
                    pairwise_results(key_ab).append({
                        "question": question,
                        "result": comparison_result
                    })
   
    return pairwise_results

यह फ़ंक्शन प्रत्येक अद्वितीय मॉडल जोड़ी के लिए “जेमिनी-2.0-फ्लैश-लाइट” न्यायाधीश को सटीकता, स्पष्टता और पूर्णता पर अपनी प्रतिक्रियाओं का विश्लेषण करने और रैंक करने के लिए प्रत्येक अद्वितीय मॉडल जोड़ी के लिए सिर-से-सिर की तुलना करता है, साइड-बाय-साइड मूल्यांकन के लिए एक संरचित शब्दकोश में प्रति-प्रश्न के फैसले को इकट्ठा करता है।

def enhanced_main():
    """Enhanced main function with additional evaluations."""
    print("Creating evaluation dataset...")
    dataset = create_evaluation_dataset()
   
    print("Setting up models...")
    models = setup_models()
   
    print("Generating responses...")
    responses = generate_responses(models, dataset)
   
    print("Evaluating responses...")
    evaluation_results = evaluate_responses(models, dataset, responses)
   
    print("Calculating average scores...")
    avg_scores = calculate_average_scores(evaluation_results)
   
    print("Average scores:")
    for model, scores in avg_scores.items():
        print(f"\n{model}:")
        for criterion, score in scores.items():
            print(f"  {criterion}: {score:.2f}")
   
    print("\nVisualizing results...")
    visualize_results(avg_scores)
   
    print("\nPerforming pairwise model comparison...")
    pairwise_results = pairwise_model_comparison(models, dataset, responses)
   
    print("\nPairwise comparison results:")
    for comparison, results in pairwise_results.items():
        print(f"\n{comparison}:")
        for i, result in enumerate(results(:2)):
            print(f"  Question {i+1}: {result('question')}")
            print(f"  Analysis: {result('result')(:100)}...")
   
    print("\nSaving all results...")
    results_df = pd.DataFrame(columns=("Model", "Criterion", "Score"))
    for model, criteria in avg_scores.items():
        for criterion, score in criteria.items():
            results_df = pd.concat((results_df, pd.DataFrame(({"Model": model, "Criterion": criterion, "Score": score}))),
                                  ignore_index=True)
   
    results_df.to_csv("llm_evaluation_results.csv", index=False)
   
    detailed_df = pd.DataFrame(columns=("Question", "Ground Truth") + list(models.keys()))
   
    for i, question in enumerate(dataset("question")):
        row = {
            "Question": question,
            "Ground Truth": dataset("ground_truth")(i)
        }
       
        for model_name in models.keys():
            row(model_name) = responses(model_name)(i)
       
        detailed_df = pd.concat((detailed_df, pd.DataFrame((row))), ignore_index=True)
   
    detailed_df.to_csv("llm_response_comparison.csv", index=False)
   
    pairwise_df = pd.DataFrame(columns=("Comparison", "Question", "Analysis"))
   
    for comparison, results in pairwise_results.items():
        for result in results:
            pairwise_df = pd.concat((pairwise_df, pd.DataFrame(({
                "Comparison": comparison,
                "Question": result("question"),
                "Analysis": result("result")
            }))), ignore_index=True)
   
    pairwise_df.to_csv("llm_pairwise_comparison.csv", index=False)
   
    print("All results saved to CSV files.")

Enganced_main फ़ंक्शन स्वचालित जोड़ीदार मॉडल तुलनाओं को जोड़कर, प्रत्येक चरण में संक्षिप्त प्रगति अपडेट को मुद्रित करके, और तीन CSV फ़ाइलों, सारांश स्कोर, विस्तृत प्रतिक्रियाओं और जोड़ीदार विश्लेषण को निर्यात करके कोर मूल्यांकन पाइपलाइन का विस्तार करता है, इसलिए आप एक पूर्ण, साइड-बाय-साइड मूल्यांकन कार्यक्षेत्र के साथ समाप्त होते हैं।

if __name__ == "__main__":
    enhanced_main()

अंत में, यह गार्ड यह सुनिश्चित करता है कि जब स्क्रिप्ट को सीधे निष्पादित किया जाता है (आयात नहीं किया जाता है), तो यह पूर्ण मूल्यांकन को चलाने और तुलना पाइपलाइन अंत को ‘से – के अंत में बढ़ाया।

अंत में, इस ट्यूटोरियल ने एलएलएम के प्रदर्शन का मूल्यांकन करने और तुलना करने के लिए एक बहुमुखी और राजसी ढांचा पेश किया है, जिसमें ऑर्केस्ट्रेशन के लिए लैंगचेन लाइब्रेरी के साथ Google की जनरेटिव एआई क्षमताओं का लाभ उठाया गया है। सरलीकृत सटीकता-आधारित मैट्रिक्स के विपरीत, यहां प्रस्तुत कार्यप्रणाली भाषा की समझ की बहुआयामी प्रकृति को गले लगाती है, दानेदार मानदंड-आधारित मूल्यांकन, संरचित मॉडल-से-मॉडल तुलना और सहज ज्ञान युक्त विज़ुअलाइज़ेशन का संयोजन करती है। शुद्धता, प्रासंगिकता, सुसंगतता और संक्षिप्तता सहित प्रमुख विशेषताओं को कैप्चर करके, हमारी मूल्यांकन पाइपलाइन चिकित्सकों को सूक्ष्म अभी तक महत्वपूर्ण प्रदर्शन अंतरों की पहचान करने में सक्षम बनाती है जो सीधे डाउनस्ट्रीम अनुप्रयोगों को प्रभावित करते हैं। CSV- आधारित रिपोर्टिंग, रडार प्लॉट और बार ग्राफ़ सहित आउटपुट, न केवल पारदर्शी बेंचमार्किंग का समर्थन करते हैं, बल्कि मॉडल चयन और परिनियोजन में डेटा-संचालित निर्णय लेने का भी मार्गदर्शन करते हैं।

यह रहा कोलैब नोटबुक। इसके अलावा, हमें फॉलो करना न भूलें ट्विटर और हमारे साथ जुड़ें तार -चैनल और लिंक्डइन जीआरओयूपी। हमारे साथ जुड़ने के लिए मत भूलना 90K+ एमएल सबरेडिट।

🔥 ।

Asif Razzaq MarkTechPost Media Inc के सीईओ हैं .. एक दूरदर्शी उद्यमी और इंजीनियर के रूप में, ASIF सामाजिक अच्छे के लिए कृत्रिम बुद्धिमत्ता की क्षमता का उपयोग करने के लिए प्रतिबद्ध है। उनका सबसे हालिया प्रयास एक आर्टिफिशियल इंटेलिजेंस मीडिया प्लेटफॉर्म, मार्कटेकपोस्ट का शुभारंभ है, जो मशीन लर्निंग और डीप लर्निंग न्यूज के अपने गहन कवरेज के लिए खड़ा है, जो तकनीकी रूप से ध्वनि और आसानी से एक व्यापक दर्शकों द्वारा समझ में आता है। मंच 2 मिलियन से अधिक मासिक विचारों का दावा करता है, दर्शकों के बीच अपनी लोकप्रियता को दर्शाता है।

एक हैंड्स-ऑन ट्यूटोरियल: Google जनरेटिव AI और Langchain के साथ एक मॉड्यूलर LLM मूल्यांकन पाइपलाइन का निर्माण करें – Gadgets Solutions

LEAVE A REPLY Cancel reply

FOLLOW US

LATEST POSTS

Related Stories

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US