import React, { Component } from 'react'
// React-DML imports
import { Grid, Cell } from "react-mdl";

export class DataAndPrep extends Component {
  render() {
    return (
      <div>
        <Grid className="HomePageGrid">
          <Cell col={12}>
            <div className="Profile-title-text">

              <h2 className="Profile-subsubtitle-text">Data and Prep</h2>

              <p className="Intro">The dataset used in this project consists of over 450 PMP practice questions, categorized into three difficulty levels: Easy, Medium, and Hard. These questions are scenario-based and focus on key PMP exam domains such as people, processes, and business environments. To ensure accuracy in difficulty classification, the dataset includes textual data, which provides a rich source of information for model training and prediction. By leveraging this dataset, the project aims to develop an adaptive system that enhances study efficiency for PMP candidates.</p>

              <img src="https://firebasestorage.googleapis.com/v0/b/ketofinder-a8370.appspot.com/o/Data01.png?alt=media&token=b569b988-c028-4f2c-ac68-9204ec5a9e02" alt="Question Distribution" className="pic" />

              <p className="Intro">The dataset was enriched by extracting PMP practice questions from relevant YouTube videos. Using tools like OpenCV for frame extraction and Tesseract OCR for text recognition, video frames were processed at regular intervals to capture question content. Challenges such as variations in text clarity and formatting were addressed through preprocessing techniques to filter out irrelevant or low-quality frames. This approach resulted in an expanded dataset with meaningful text segments, laying the foundation for robust model training.</p>
              <p className="Intro">To build the dataset, relevant PMP-focused YouTube videos were processed by converting them into individual frames. Frames were extracted at regular intervals to capture all question content while minimizing redundancy. Advanced tools like OpenCV were utilized for frame extraction, and Tesseract OCR was employed to detect and extract text from each frame. This process faced challenges, such as dealing with variations in font styles, background noise, and image clarity. However, preprocessing techniques were applied to enhance text quality and relevance. Check the below four pictures for a visual representation of the process, from raw video to text extraction, illustrating the transformation from video data to usable textual content.</p>
              <img src="https://firebasestorage.googleapis.com/v0/b/ketofinder-a8370.appspot.com/o/Data02.jpg?alt=media&token=792b6844-2daf-4761-9a44-6986a21385d3" alt="Question Sample" className="pic" />
              <img src="https://firebasestorage.googleapis.com/v0/b/ketofinder-a8370.appspot.com/o/Data03.png?alt=media&token=95fa7194-576b-4e68-8747-91b3fd5e1aec" alt="Question Extraction code" className="pic" />
              <img src="https://firebasestorage.googleapis.com/v0/b/ketofinder-a8370.appspot.com/o/Data04.png?alt=media&token=890549e7-4d43-45d8-b8a9-b5c426c2f2fe" alt="Question Extraction code" className="pic" />
              <img src="https://firebasestorage.googleapis.com/v0/b/ketofinder-a8370.appspot.com/o/Data05.png?alt=media&token=5e1723a5-8109-40b6-8031-d1a50354f90e" alt="Question Extraction text" className="pic" />


              <p className="Intro">To prepare the dataset for analysis, several cleaning steps were applied. Irrelevant and duplicate frames were removed, and extracted text was standardized into structured questions and answers. Non-question content was filtered out using text processing techniques, such as converting text to lowercase for uniformity and removing special characters. The text was then tokenized into manageable components, making it suitable for embedding and model training. These steps ensured a high-quality dataset that enhances the reliability of difficulty classification.</p>
              <p className="Intro">Each question in the dataset was converted into a 768-dimensional vector using BERT embeddings. This advanced feature extraction technique captures contextual meaning effectively, ensuring that the model receives robust and meaningful input. By representing questions in this way, the project leverages state-of-the-art natural language processing techniques to improve the accuracy of classification models, even with a limited dataset size.</p>
              <p className="Intro">The rigorous data preparation process was crucial for achieving accurate difficulty classification. By transforming raw textual data into structured and meaningful representations, the project ensured that the model could effectively differentiate between Easy, Medium, and Hard questions. This level of preparation not only enhances the model's performance but also provides a strong foundation for future improvements, such as expanding the dataset and incorporating adaptive learning features.</p>

              <p><a href="https://colab.research.google.com/drive/13nwu9WzsbsxEZ2-FU9Y_Z3I04ffAAeQJ?usp=sharing">Link to a sample code on Colab</a></p>
              <p><a href="https://docs.google.com/spreadsheets/d/13LXtvifTQQCZbXlyyVkXrLVqIfWPrMOw/edit?usp=sharing&ouid=111430603714665417345&rtpof=true&sd=true">Link to a sample data on Goodgle Drive</a></p>
              <p><a href="https://www.youtube.com/watch?v=Zht0-j03NfQ">Link to the sampled video Youbtube</a></p>

            </div>
          </Cell>

        </Grid>
      </div>

    )
  }
}

export default DataAndPrep