ProText-Analyzer

Blackcoffer Logo

ProText Analyzer

ProText Analyzer

Note

Apologies, but I did not use the NLTK package for some tasks. Instead, I used:

Project Structure

πŸ—‚ Directories and Files

πŸ“ Cleaned Articles

πŸ“‚ Extracted Articles

πŸ“š Master Dictionary

πŸ“‘ Project Introduction

πŸ§ͺ Test Assessment

πŸ’» Code and Markdown

🚫 Stop Words

πŸ“Š Text Analysis

πŸ“ˆ Additional Files

Blackcoffer Test Assignment

Company Information

Assignment Overview

  1. Objective: Extract textual data from provided URLs and perform text analysis.
  2. Data Extraction:
    • Input from Input.xlsx
    • Tools: Python, BeautifulSoup, Selenium, Scrapy.
  3. Data Analysis:
    • Output in CSV or Excel format.
    • Variables include Positive Score, Negative Score, Polarity Score, etc.
  4. Timeline: Duration of 6 days.
  5. Submission: Via Google Form with required files.

Methodology

Objective:
The ProText-Analyzer project extracts article content from provided URLs and performs various text analysis tasks like sentiment scoring, readability measurement, and more. The results are structured in a clean and organized format, ready for review and further use.

Project Overview

The goal of ProText-Analyzer is to:

  1. Extract Textual Data: Fetch the article content from URLs provided in the Input.xlsx file.
  2. Perform Textual Analysis: Calculate the following metrics:
    • Sentiment scores (positive, negative, polarity, subjectivity)
    • Readability scores (Fog Index, Avg. Sentence Length)
    • Word count, syllable count, and other word statistics

Technologies Used


Installation

  1. Clone the repository to your local machine:
    git clone https://github.com/rubydamodar/ProText-Analyzer.git
    cd ProText-Analyzer
    
  2. Install the required Python libraries:
    pip install -r requirements.txt
    

Data Extraction Process

The ProText-Analyzer extracts the article title and body from each URL listed in the Input.xlsx file and stores the text for further analysis.

Process Overview:

  1. Read Input File: Load the URLs and their associated IDs from Input.xlsx.
  2. Extract Article Content:
    • Fetch HTML content using requests.
    • Parse the HTML using BeautifulSoup to extract the article’s title and body.
    • Save the extracted content into text files named after the URL_ID.

File Management:


Text Analysis Process

The extracted text undergoes several analysis steps to compute the following variables:

  1. Sentiment Analysis:
    • Implemented using TextBlob to compute Positive Score, Negative Score, Polarity Score, and Subjectivity Score.
    • Text is cleaned by removing stop words and irrelevant characters.
  2. Readability Analysis:
    • Calculated using the Gunning Fog Index.
    • Additional metrics: Average Sentence Length, Percentage of Complex Words, and Fog Index.
  3. Word-Level Metrics:
    • Word Count, Complex Word Count, Syllable Count per Word (via syllapy), Personal Pronouns Count (using regex), and Average Word Length.

Output Structure

The results are saved in Excel/CSV format as per the structure outlined in Output Data Structure.xlsx. The following variables are included:


How to Run

  1. Data Extraction: Run the script to extract article data from the URLs:
    python data_extraction.py
    
  2. Text Analysis: Run the text analysis script to process the extracted articles:
    python text_analysis.py
    

The results will be saved in the output directory in .csv or .xlsx format.


Challenges and Solutions


Contributing

We welcome contributions to enhance ProText-Analyzer! To contribute:

  1. Fork the repository.
  2. Create a new branch for your changes.
  3. Submit a pull request with a detailed description of your changes.

License

This project is licensed under the MIT License.


Project Maintainer

Ruby Poddar
Email: rubypoddarr@gmail.com